Paperid: 1, https://arxiv.org/pdf/2510.11718.pdf   GitHub
Authors:Chengqi Duan, Kaiyue Sun, Rongyao Fang, Manyuan Zhang, Yan Feng, Ying Luo, Yufang Liu, Ke Wang, Peng Pei, Xunliang Cai, Hongsheng Li, Yi Ma, Xihui Liu
Title: CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images
Abstract:
Recent advances in Large Language Models (LLMs) and Vision Language Models (VLMs) have shown significant progress in mathematical reasoning, yet they still face a critical bottleneck with problems requiring visual assistance, such as drawing auxiliary lines or plotting functions to solve the problems. Most LLMs and VLMs are constrained to text-only reasoning chains, while multimodal unified models that can generate interleaved text and images lack the necessary precision and controllability for such tasks. To address this, we propose CodePlot-CoT, a code-driven Chain-of-Thought paradigm for "thinking with images" in mathematics. Our approach leverages the VLM to generate text reasoning as well as executable plotting code, which is then rendered into images as "visual thought", to solve mathematical problems. To achieve this, we first construct Math-VR, the first large-scale, bilingual dataset and benchmark for Mathematics problems with Visual Reasoning, comprising 178K samples. Second, to create high-quality training data, we develop a state-of-the-art image-to-code converter specialized for parsing complex mathematical figures into codes. Finally, using these training data, we train the CodePlot-CoT model for solving mathematical problems. Experimental results show that our model achieves up to 21% increase over base model on our new benchmark, fully validating the efficacy of our proposed code-driven reasoning paradigm. Our work opens a new direction for multimodal mathematical reasoning and provides the community with the first large-scale dataset, comprehensive benchmark, and strong approach for such problems. To facilitate future research, we make our datasets, code, and pretrained models publicly available at https://github.com/HKU-MMLab/Math-VR-CodePlot-CoT.

Authors:Haoran Feng, Dizhe Zhang, Xiangtai Li, Bo Du, Lu Qi
Title: DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training
Abstract:
In this work, we propose DiT360, a DiT-based framework that performs hybrid training on perspective and panoramic data for panoramic image generation. For the issues of maintaining geometric fidelity and photorealism in generation quality, we attribute the main reason to the lack of large-scale, high-quality, real-world panoramic data, where such a data-centric view differs from prior methods that focus on model design. Basically, DiT360 has several key modules for inter-domain transformation and intra-domain augmentation, applied at both the pre-VAE image level and the post-VAE token level. At the image level, we incorporate cross-domain knowledge through perspective image guidance and panoramic refinement, which enhance perceptual quality while regularizing diversity and photorealism. At the token level, hybrid supervision is applied across multiple modules, which include circular padding for boundary continuity, yaw loss for rotational robustness, and cube loss for distortion awareness. Extensive experiments on text-to-panorama, inpainting, and outpainting tasks demonstrate that our method achieves better boundary consistency and image fidelity across eleven quantitative metrics. Our code is available at https://github.com/Insta360-Research-Team/DiT360.

Authors:Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, Yao Lu, Xiaojuan Qi, Song Han, Yukang Chen
Title: QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs
Abstract:
We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for large language models (LLMs). While RL is essential for LLMs' reasoning capabilities, it is resource-intensive, requiring substantial GPU memory and long rollout durations. QeRL addresses these issues by combining NVFP4 quantization with Low-Rank Adaptation (LoRA), accelerating rollout phase of RL while reducing memory overhead. Beyond efficiency, our findings show that quantization noise increases policy entropy, enhancing exploration, and enabling the discovery of better strategies during RL. To further optimize exploration, QeRL introduces an Adaptive Quantization Noise (AQN) mechanism, which dynamically adjusts noise during training. Experiments demonstrate that QeRL delivers over 1.5 times speedup in the rollout phase. Moreover, this is the first framework to enable RL training of a 32B LLM on a single H100 80GB GPU, while delivering overall speedups for RL training. It also achieves faster reward growth and higher final accuracy than 16-bit LoRA and QLoRA, while matching the performance of full-parameter fine-tuning on mathematical benchmarks such as GSM8K (90.8%) and MATH 500 (77.4%) in the 7B model. These results establish QeRL as an efficient and effective framework for RL training in LLMs.

Authors:Chenghao Xiao, Hou Pong Chan, Hao Zhang, Weiwen Xu, Mahani Aljunied, Yu Rong
Title: Scaling Language-Centric Omnimodal Representation Learning
Abstract:
Recent multimodal embedding approaches leveraging multimodal large language models (MLLMs) fine-tuned with contrastive learning (CL) have shown promising results, yet the underlying reasons behind their superiority remain underexplored. This work argues that a crucial advantage of MLLM-based approaches stems from implicit cross-modal alignment achieved during generative pretraining, where the language decoder learns to exploit multimodal signals within a shared representation space for generating unimodal outputs. Through analysis of anisotropy and kernel similarity structure, we empirically confirm that latent alignment emerges within MLLM representations, allowing CL to serve as a lightweight refinement stage. Leveraging this insight, we propose a Language-Centric Omnimodal Embedding framework, termed LCO-Emb. Extensive experiments across diverse backbones and benchmarks demonstrate its effectiveness, achieving state-of-the-art performance across modalities. Furthermore, we identify a Generation-Representation Scaling Law (GRSL), showing that the representational capabilities gained through contrastive refinement scales positively with the MLLM's generative capabilities. This suggests that improving generative abilities evolves as an effective paradigm for enhancing representation quality. We provide a theoretical explanation of GRSL, which formally links the MLLM's generative quality to the upper bound on its representation performance, and validate it on a challenging, low-resource visual-document retrieval task, showing that continual generative pretraining before CL can further enhance the potential of a model's embedding capabilities. Codes, models, and resources are available at https://github.com/LCO-Embedding/LCO-Embedding.

Authors:Yinan Chen, Jiangning Zhang, Teng Hu, Yuxiang Zeng, Zhucun Xue, Qingdong He, Chengjie Wang, Yong Liu, Xiaobin Hu, Shuicheng Yan
Title: IVEBench: Modern Benchmark Suite for Instruction-Guided Video Editing Assessment
Abstract:
Instruction-guided video editing has emerged as a rapidly advancing research direction, offering new opportunities for intuitive content transformation while also posing significant challenges for systematic evaluation. Existing video editing benchmarks fail to support the evaluation of instruction-guided video editing adequately and further suffer from limited source diversity, narrow task coverage and incomplete evaluation metrics. To address the above limitations, we introduce IVEBench, a modern benchmark suite specifically designed for instruction-guided video editing assessment. IVEBench comprises a diverse database of 600 high-quality source videos, spanning seven semantic dimensions, and covering video lengths ranging from 32 to 1,024 frames. It further includes 8 categories of editing tasks with 35 subcategories, whose prompts are generated and refined through large language models and expert review. Crucially, IVEBench establishes a three-dimensional evaluation protocol encompassing video quality, instruction compliance and video fidelity, integrating both traditional metrics and multimodal large language model-based assessments. Extensive experiments demonstrate the effectiveness of IVEBench in benchmarking state-of-the-art instruction-guided video editing methods, showing its ability to provide comprehensive and human-aligned evaluation outcomes.

Authors:Feng Zhang, Haoyou Deng, Zhiqiang Li, Lida Li, Bin Xu, Qingbo Lu, Zisheng Cao, Minchen Wei, Changxin Gao, Nong Sang, Xiang Bai
Title: High-resolution Photo Enhancement in Real-time: A Laplacian Pyramid Network
Abstract:
Photo enhancement plays a crucial role in augmenting the visual aesthetics of a photograph. In recent years, photo enhancement methods have either focused on enhancement performance, producing powerful models that cannot be deployed on edge devices, or prioritized computational efficiency, resulting in inadequate performance for real-world applications. To this end, this paper introduces a pyramid network called LLF-LUT++, which integrates global and local operators through closed-form Laplacian pyramid decomposition and reconstruction. This approach enables fast processing of high-resolution images while also achieving excellent performance. Specifically, we utilize an image-adaptive 3D LUT that capitalizes on the global tonal characteristics of downsampled images, while incorporating two distinct weight fusion strategies to achieve coarse global image enhancement. To implement this strategy, we designed a spatial-frequency transformer weight predictor that effectively extracts the desired distinct weights by leveraging frequency features. Additionally, we apply local Laplacian filters to adaptively refine edge details in high-frequency components. After meticulously redesigning the network structure and transformer model, LLF-LUT++ not only achieves a 2.64 dB improvement in PSNR on the HDR+ dataset, but also further reduces runtime, with 4K resolution images processed in just 13 ms on a single GPU. Extensive experimental results on two benchmark datasets further show that the proposed approach performs favorably compared to state-of-the-art methods. The source code will be made publicly available at https://github.com/fengzhang427/LLF-LUT.

Authors:Yicheng Xu, Yue Wu, Jiashuo Yu, Ziang Yan, Tianxiang Jiang, Yinan He, Qingsong Zhao, Kai Chen, Yu Qiao, Limin Wang, Manabu Okumura, Yi Wang
Title: ExpVid: A Benchmark for Experiment Video Understanding & Reasoning
Abstract:
Multimodal Large Language Models (MLLMs) hold promise for accelerating scientific discovery by interpreting complex experimental procedures. However, their true capabilities are poorly understood, as existing benchmarks neglect the fine-grained and long-horizon nature of authentic laboratory work, especially in wet-lab settings. To bridge this gap, we introduce ExpVid, the first benchmark designed to systematically evaluate MLLMs on scientific experiment videos. Curated from peer-reviewed video publications, ExpVid features a new three-level task hierarchy that mirrors the scientific process: (1) Fine-grained Perception of tools, materials, and actions; (2) Procedural Understanding of step order and completeness; and (3) Scientific Reasoning that connects the full experiment to its published conclusions. Our vision-centric annotation pipeline, combining automated generation with multi-disciplinary expert validation, ensures that tasks require visual grounding. We evaluate 19 leading MLLMs on ExpVid and find that while they excel at coarse-grained recognition, they struggle with disambiguating fine details, tracking state changes over time, and linking experimental procedures to scientific outcomes. Our results reveal a notable performance gap between proprietary and open-source models, particularly in high-order reasoning. ExpVid not only provides a diagnostic tool but also charts a roadmap for developing MLLMs capable of becoming trustworthy partners in scientific experimentation.

Authors:Leonard Bruns, Axel Barroso-Laguna, Tommaso Cavallari, Áron Monszpart, Sowmya Munukutla, Victor Adrian Prisacariu, Eric Brachmann
Title: ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training
Abstract:
Scene coordinate regression (SCR) has established itself as a promising learning-based approach to visual relocalization. After mere minutes of scene-specific training, SCR models estimate camera poses of query images with high accuracy. Still, SCR methods fall short of the generalization capabilities of more classical feature-matching approaches. When imaging conditions of query images, such as lighting or viewpoint, are too different from the training views, SCR models fail. Failing to generalize is an inherent limitation of previous SCR frameworks, since their training objective is to encode the training views in the weights of the coordinate regressor itself. The regressor essentially overfits to the training views, by design. We propose to separate the coordinate regressor and the map representation into a generic transformer and a scene-specific map code. This separation allows us to pre-train the transformer on tens of thousands of scenes. More importantly, it allows us to train the transformer to generalize from mapping images to unseen query images during pre-training. We demonstrate on multiple challenging relocalization datasets that our method, ACE-G, leads to significantly increased robustness while keeping the computational footprint attractive.

Authors:Hongyu Zhu, Lin Chen, Mounim A. El-Yacoubi, Mingsheng Shang
Title: MS-Mix: Unveiling the Power of Mixup for Multimodal Sentiment Analysis
Abstract:
Multimodal Sentiment Analysis (MSA) aims to identify and interpret human emotions by integrating information from heterogeneous data sources such as text, video, and audio. While deep learning models have advanced in network architecture design, they remain heavily limited by scarce multimodal annotated data. Although Mixup-based augmentation improves generalization in unimodal tasks, its direct application to MSA introduces critical challenges: random mixing often amplifies label ambiguity and semantic inconsistency due to the lack of emotion-aware mixing mechanisms. To overcome these issues, we propose MS-Mix, an adaptive, emotion-sensitive augmentation framework that automatically optimizes sample mixing in multimodal settings. The key components of MS-Mix include: (1) a Sentiment-Aware Sample Selection (SASS) strategy that effectively prevents semantic confusion caused by mixing samples with contradictory emotions. (2) a Sentiment Intensity Guided (SIG) module using multi-head self-attention to compute modality-specific mixing ratios dynamically based on their respective emotional intensities. (3) a Sentiment Alignment Loss (SAL) that aligns the prediction distributions across modalities, and incorporates the Kullback-Leibler-based loss as an additional regularization term to train the emotion intensity predictor and the backbone network jointly. Extensive experiments on three benchmark datasets with six state-of-the-art backbones confirm that MS-Mix consistently outperforms existing methods, establishing a new standard for robust multimodal sentiment augmentation. The source code is available at: https://github.com/HongyuZhu-s/MS-Mix.

Authors:Kuanning Wang, Yongchong Gu, Yuqian Fu, Zeyu Shangguan, Sicheng He, Xiangyang Xue, Yanwei Fu, Daniel Seita
Title: SCOOP'D: Learning Mixed-Liquid-Solid Scooping via Sim2Real Generative Policy
Abstract:
Scooping items with tools such as spoons and ladles is common in daily life, ranging from assistive feeding to retrieving items from environmental disaster sites. However, developing a general and autonomous robotic scooping policy is challenging since it requires reasoning about complex tool-object interactions. Furthermore, scooping often involves manipulating deformable objects, such as granular media or liquids, which is challenging due to their infinite-dimensional configuration spaces and complex dynamics. We propose a method, SCOOP'D, which uses simulation from OmniGibson (built on NVIDIA Omniverse) to collect scooping demonstrations using algorithmic procedures that rely on privileged state information. Then, we use generative policies via diffusion to imitate demonstrations from observational input. We directly apply the learned policy in diverse real-world scenarios, testing its performance on various item quantities, item characteristics, and container types. In zero-shot deployment, our method demonstrates promising results across 465 trials in diverse scenarios, including objects of different difficulty levels that we categorize as "Level 1" and "Level 2." SCOOP'D outperforms all baselines and ablations, suggesting that this is a promising approach to acquiring robotic scooping skills. Project page is at https://scoopdiff.github.io/.

Authors:Aniket Gupta, Hanhui Wang, Charles Saunders, Aruni RoyChowdhury, Hanumant Singh, Huaizu Jiang
Title: SNAP: Towards Segmenting Anything in Any Point Cloud
Abstract:
Interactive 3D point cloud segmentation enables efficient annotation of complex 3D scenes through user-guided prompts. However, current approaches are typically restricted in scope to a single domain (indoor or outdoor), and to a single form of user interaction (either spatial clicks or textual prompts). Moreover, training on multiple datasets often leads to negative transfer, resulting in domain-specific tools that lack generalizability. To address these limitations, we present \textbf{SNAP} (\textbf{S}egment a\textbf{N}ything in \textbf{A}ny \textbf{P}oint cloud), a unified model for interactive 3D segmentation that supports both point-based and text-based prompts across diverse domains. Our approach achieves cross-domain generalizability by training on 7 datasets spanning indoor, outdoor, and aerial environments, while employing domain-adaptive normalization to prevent negative transfer. For text-prompted segmentation, we automatically generate mask proposals without human intervention and match them against CLIP embeddings of textual queries, enabling both panoptic and open-vocabulary segmentation. Extensive experiments demonstrate that SNAP consistently delivers high-quality segmentation results. We achieve state-of-the-art performance on 8 out of 9 zero-shot benchmarks for spatial-prompted segmentation and demonstrate competitive results on all 5 text-prompted benchmarks. These results show that a unified model can match or exceed specialized domain-specific approaches, providing a practical tool for scalable 3D annotation. Project page is at, https://neu-vi.github.io/SNAP/

Authors:Kedi Ying, Ruiping Liu, Chongyan Chen, Mingzhe Tao, Hao Shi, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen
Title: mmWalk: Towards Multi-modal Multi-view Walking Assistance
Abstract:
Walking assistance in extreme or complex environments remains a significant challenge for people with blindness or low vision (BLV), largely due to the lack of a holistic scene understanding. Motivated by the real-world needs of the BLV community, we build mmWalk, a simulated multi-modal dataset that integrates multi-view sensor and accessibility-oriented features for outdoor safe navigation. Our dataset comprises 120 manually controlled, scenario-categorized walking trajectories with 62k synchronized frames. It contains over 559k panoramic images across RGB, depth, and semantic modalities. Furthermore, to emphasize real-world relevance, each trajectory involves outdoor corner cases and accessibility-specific landmarks for BLV users. Additionally, we generate mmWalkVQA, a VQA benchmark with over 69k visual question-answer triplets across 9 categories tailored for safe and informed walking assistance. We evaluate state-of-the-art Vision-Language Models (VLMs) using zero- and few-shot settings and found they struggle with our risk assessment and navigational tasks. We validate our mmWalk-finetuned model on real-world datasets and show the effectiveness of our dataset for advancing multi-modal walking assistance.

Authors:Ruiping Liu, Junwei Zheng, Yufan Chen, Zirui Wang, Kunyu Peng, Kailun Yang, Jiaming Zhang, Marc Pollefeys, Rainer Stiefelhagen
Title: Situat3DChange: Situated 3D Change Understanding Dataset for Multimodal Large Language Model
Abstract:
Physical environments and circumstances are fundamentally dynamic, yet current 3D datasets and evaluation benchmarks tend to concentrate on either dynamic scenarios or dynamic situations in isolation, resulting in incomplete comprehension. To overcome these constraints, we introduce Situat3DChange, an extensive dataset supporting three situation-aware change understanding tasks following the perception-action model: 121K question-answer pairs, 36K change descriptions for perception tasks, and 17K rearrangement instructions for the action task. To construct this large-scale dataset, Situat3DChange leverages 11K human observations of environmental changes to establish shared mental models and shared situational awareness for human-AI collaboration. These observations, enriched with egocentric and allocentric perspectives as well as categorical and coordinate spatial relations, are integrated using an LLM to support understanding of situated changes. To address the challenge of comparing pairs of point clouds from the same scene with minor changes, we propose SCReasoner, an efficient 3D MLLM approach that enables effective point cloud comparison with minimal parameter overhead and no additional tokens required for the language decoder. Comprehensive evaluation on Situat3DChange tasks highlights both the progress and limitations of MLLMs in dynamic scene and situation understanding. Additional experiments on data scaling and cross-domain transfer demonstrate the task-agnostic effectiveness of using Situat3DChange as a training dataset for MLLMs.

Authors:Qing Li, Huifang Feng, Xun Gong, Yu-Shen Liu
Title: VA-GS: Enhancing the Geometric Representation of Gaussian Splatting via View Alignment
Abstract:
3D Gaussian Splatting has recently emerged as an efficient solution for high-quality and real-time novel view synthesis. However, its capability for accurate surface reconstruction remains underexplored. Due to the discrete and unstructured nature of Gaussians, supervision based solely on image rendering loss often leads to inaccurate geometry and inconsistent multi-view alignment. In this work, we propose a novel method that enhances the geometric representation of 3D Gaussians through view alignment (VA). Specifically, we incorporate edge-aware image cues into the rendering loss to improve surface boundary delineation. To enforce geometric consistency across views, we introduce a visibility-aware photometric alignment loss that models occlusions and encourages accurate spatial relationships among Gaussians. To further mitigate ambiguities caused by lighting variations, we incorporate normal-based constraints to refine the spatial orientation of Gaussians and improve local surface estimation. Additionally, we leverage deep image feature embeddings to enforce cross-view consistency, enhancing the robustness of the learned geometry under varying viewpoints and illumination. Extensive experiments on standard benchmarks demonstrate that our method achieves state-of-the-art performance in both surface reconstruction and novel view synthesis. The source code is available at https://github.com/LeoQLi/VA-GS.

Authors:Tianpei Zhang, Jufeng Zhao, Yiming Zhu, Guangmang Cui
Title: Coupled Degradation Modeling and Fusion: A VLM-Guided Degradation-Coupled Network for Degradation-Aware Infrared and Visible Image Fusion
Abstract:
Existing Infrared and Visible Image Fusion (IVIF) methods typically assume high-quality inputs. However, when handing degraded images, these methods heavily rely on manually switching between different pre-processing techniques. This decoupling of degradation handling and image fusion leads to significant performance degradation. In this paper, we propose a novel VLM-Guided Degradation-Coupled Fusion network (VGDCFusion), which tightly couples degradation modeling with the fusion process and leverages vision-language models (VLMs) for degradation-aware perception and guided suppression. Specifically, the proposed Specific-Prompt Degradation-Coupled Extractor (SPDCE) enables modality-specific degradation awareness and establishes a joint modeling of degradation suppression and intra-modal feature extraction. In parallel, the Joint-Prompt Degradation-Coupled Fusion (JPDCF) facilitates cross-modal degradation perception and couples residual degradation filtering with complementary cross-modal feature fusion. Extensive experiments demonstrate that our VGDCFusion significantly outperforms existing state-of-the-art fusion approaches under various degraded image scenarios. Our code is available at https://github.com/Lmmh058/VGDCFusion.

Authors:Yijun Hu, Bing Fan, Xin Gu, Haiqing Ren, Dongfang Liu, Heng Fan, Libo Zhang
Title: Robust Ego-Exo Correspondence with Long-Term Memory
Abstract:
Establishing object-level correspondence between egocentric and exocentric views is essential for intelligent assistants to deliver precise and intuitive visual guidance. However, this task faces numerous challenges, including extreme viewpoint variations, occlusions, and the presence of small objects. Existing approaches usually borrow solutions from video object segmentation models, but still suffer from the aforementioned challenges. Recently, the Segment Anything Model 2 (SAM 2) has shown strong generalization capabilities and excellent performance in video object segmentation. Yet, when simply applied to the ego-exo correspondence (EEC) task, SAM 2 encounters severe difficulties due to ineffective ego-exo feature fusion and limited long-term memory capacity, especially for long videos. Addressing these problems, we propose a novel EEC framework based on SAM 2 with long-term memories by presenting a dual-memory architecture and an adaptive feature routing module inspired by Mixture-of-Experts (MoE). Compared to SAM 2, our approach features (i) a Memory-View MoE module which consists of a dual-branch routing mechanism to adaptively assign contribution weights to each expert feature along both channel and spatial dimensions, and (ii) a dual-memory bank system with a simple yet effective compression strategy to retain critical long-term information while eliminating redundancy. In the extensive experiments on the challenging EgoExo4D benchmark, our method, dubbed LM-EEC, achieves new state-of-the-art results and significantly outperforms existing methods and the SAM 2 baseline, showcasing its strong generalization across diverse scenarios. Our code and model are available at https://github.com/juneyeeHu/LM-EEC.

Authors:Wenyuan Zhang, Jimin Tang, Weiqi Zhang, Yi Fang, Yu-Shen Liu, Zhizhong Han
Title: MaterialRefGS: Reflective Gaussian Splatting with Multi-view Consistent Material Inference
Abstract:
Modeling reflections from 2D images is essential for photorealistic rendering and novel view synthesis. Recent approaches enhance Gaussian primitives with reflection-related material attributes to enable physically based rendering (PBR) with Gaussian Splatting. However, the material inference often lacks sufficient constraints, especially under limited environment modeling, resulting in illumination aliasing and reduced generalization. In this work, we revisit the problem from a multi-view perspective and show that multi-view consistent material inference with more physically-based environment modeling is key to learning accurate reflections with Gaussian Splatting. To this end, we enforce 2D Gaussians to produce multi-view consistent material maps during deferred shading. We also track photometric variations across views to identify highly reflective regions, which serve as strong priors for reflection strength terms. To handle indirect illumination caused by inter-object occlusions, we further introduce an environment modeling strategy through ray tracing with 2DGS, enabling photorealistic rendering of indirect radiance. Experiments on widely used benchmarks show that our method faithfully recovers both illumination and geometry, achieving state-of-the-art rendering quality in novel views synthesis.

Authors:Zhao Huang, Boyang Sun, Alexandros Delitzas, Jiaqi Chen, Marc Pollefeys
Title: REACT3D: Recovering Articulations for Interactive Physical 3D Scenes
Abstract:
Interactive 3D scenes are increasingly vital for embodied intelligence, yet existing datasets remain limited due to the labor-intensive process of annotating part segmentation, kinematic types, and motion trajectories. We present REACT3D, a scalable zero-shot framework that converts static 3D scenes into simulation-ready interactive replicas with consistent geometry, enabling direct use in diverse downstream tasks. Our contributions include: (i) openable-object detection and segmentation to extract candidate movable parts from static scenes, (ii) articulation estimation that infers joint types and motion parameters, (iii) hidden-geometry completion followed by interactive object assembly, and (iv) interactive scene integration in widely supported formats to ensure compatibility with standard simulation platforms. We achieve state-of-the-art performance on detection/segmentation and articulation metrics across diverse indoor scenes, demonstrating the effectiveness of our framework and providing a practical foundation for scalable interactive scene generation, thereby lowering the barrier to large-scale research on articulated scene understanding. Our project page is \textit{\hypersetup{urlcolor=black}\href{https://react3d.github.io/}{react3d.github.io}}.

Authors:Weixuan Li, Quanjun Li, Guang Yu, Song Yang, Zimeng Li, Chi-Man Pun, Yupeng Liu, Xuhang Chen
Title: DTEA: Dynamic Topology Weaving and Instability-Driven Entropic Attenuation for Medical Image Segmentation
Abstract:
In medical image segmentation, skip connections are used to merge global context and reduce the semantic gap between encoder and decoder. Current methods often struggle with limited structural representation and insufficient contextual modeling, affecting generalization in complex clinical scenarios. We propose the DTEA model, featuring a new skip connection framework with the Semantic Topology Reconfiguration (STR) and Entropic Perturbation Gating (EPG) modules. STR reorganizes multi-scale semantic features into a dynamic hypergraph to better model cross-resolution anatomical dependencies, enhancing structural and semantic representation. EPG assesses channel stability after perturbation and filters high-entropy channels to emphasize clinically important regions and improve spatial attention. Extensive experiments on three benchmark datasets show our framework achieves superior segmentation accuracy and better generalization across various clinical settings. The code is available at \href{https://github.com/LWX-Research/DTEA}{https://github.com/LWX-Research/DTEA}.

Authors:Rohit Gupta, Anirban Roy, Claire Christensen, Sujeong Kim, Sarah Gerard, Madeline Cincebeaux, Ajay Divakaran, Todd Grindal, Mubarak Shah
Title: Class Prototypes based Contrastive Learning for Classifying Multi-Label and Fine-Grained Educational Videos
Abstract:
The recent growth in the consumption of online media by children during early childhood necessitates data-driven tools enabling educators to filter out appropriate educational content for young learners. This paper presents an approach for detecting educational content in online videos. We focus on two widely used educational content classes: literacy and math. For each class, we choose prominent codes (sub-classes) based on the Common Core Standards. For example, literacy codes include `letter names', `letter sounds', and math codes include `counting', `sorting'. We pose this as a fine-grained multilabel classification problem as videos can contain multiple types of educational content and the content classes can get visually similar (e.g., `letter names' vs `letter sounds'). We propose a novel class prototypes based supervised contrastive learning approach that can handle fine-grained samples associated with multiple labels. We learn a class prototype for each class and a loss function is employed to minimize the distances between a class prototype and the samples from the class. Similarly, distances between a class prototype and the samples from other classes are maximized. As the alignment between visual and audio cues are crucial for effective comprehension, we consider a multimodal transformer network to capture the interaction between visual and audio cues in videos while learning the embedding for videos. For evaluation, we present a dataset, APPROVE, employing educational videos from YouTube labeled with fine-grained education classes by education researchers. APPROVE consists of 193 hours of expert-annotated videos with 19 classes. The proposed approach outperforms strong baselines on APPROVE and other benchmarks such as Youtube-8M, and COIN. The dataset is available at https://github.com/rohit-gupta/MMContrast/tree/main/APPROVE

Authors:Shengming Yuan, Xinyu Lyu, Shuailong Wang, Beitao Chen, Jingkuan Song, Lianli Gao
Title: FlexAC: Towards Flexible Control of Associative Reasoning in Multimodal Large Language Models
Abstract:
Multimodal large language models (MLLMs) face an inherent trade-off between faithfulness and creativity, as different tasks require varying degrees of associative reasoning. However, existing methods lack the flexibility to modulate this reasoning strength, limiting MLLMs' adaptability across factual and creative scenarios. To bridge this gap, we propose equipping MLLMs with mechanisms that enable flexible control over associative reasoning. We begin by investigating the internal mechanisms underlying associative behavior in MLLMs and find that: (1) middle layers play a pivotal role in shaping model's associative tendencies, (2) modifying representations in these layers effectively regulates associative reasoning strength, and (3) hallucinations can be exploited to derive steering vectors that guide this modulation. Building on these findings, we introduce Flexible Association Control (FlexAC), a lightweight and training-free framework for modulating associative behavior in MLLMs. FlexAC first induces hallucination-guided intermediate representations to encode associative directions. Then, it selects high-association instances to construct effective associative steering vectors, whose strengths are adaptively calibrated to balance creative guidance with output stability. Finally, recognizing the multi-dimensional nature of associative reasoning, FlexAC incorporates task-specific associative vectors derived from a forward pass on a few target-domain samples, enabling models to follow diverse associative directions and better adapt to creative tasks. Notably, our method achieves up to a 5.8x improvement in creativity on Creation-MMBench and a 29% reduction in hallucination rate on CHAIR, surpassing existing baselines and demonstrating its effectiveness in enabling flexible control over associative reasoning in MLLMs. Our code is available at https://github.com/ylhz/FlexAC.

Authors:Zhenyu Lu, Liupeng Li, Jinpeng Wang, Yan Feng, Bin Chen, Ke Chen, Yaowei Wang
Title: CoPRS: Learning Positional Prior from Chain-of-Thought for Reasoning Segmentation
Abstract:
Existing works on reasoning segmentation either connect hidden features from a language model directly to a mask decoder or represent positions in text, which limits interpretability and semantic detail. To solve this, we present CoPRS, a Multi-modal Chain-of-Thought (MCoT)-based positional perception model that bridges language reasoning to segmentation through a differentiable and interpretable positional prior instantiated as a heatmap. By making the reasoning process clear via MCoT and expressing it as a dense, differentiable heatmap, this interface enhances interpretability and diagnostic analysis and yields more concentrated evidence on the target. A learnable concentration token aggregates features of the image and reasoning text to generate this positional prior, which is decoded to precise masks through a lightweight decoder, providing a direct connection between reasoning and segmentation. Across the RefCOCO series and ReasonSeg, CoPRS matches or surpasses the best reported metrics on each standard split under comparable protocols, with performance at or above prior state of the art across both validation and test partitions. Extensive experiments reveal that the quality of the heatmap strongly influences the resulting mask quality, supporting a consistent association between the reasoning output and downstream mask generation. Collectively, these findings support the utility of this paradigm in bridging reasoning and segmentation and show advantages in concentration driven by reasoning and predicting masks more precisely. Code, checkpoints and logs are released at https://github.com/ZhenyuLU-Heliodore/CoPRS.git.

Authors:Ans Munir, Faisal Z. Qureshi, Mohsen Ali, Muhammad Haris Khan
Title: Compositional Zero-Shot Learning: A Survey
Abstract:
Compositional Zero-Shot Learning (CZSL) is a critical task in computer vision that enables models to recognize unseen combinations of known attributes and objects during inference, addressing the combinatorial challenge of requiring training data for every possible composition. This is particularly challenging because the visual appearance of primitives is highly contextual; for example, ``small'' cats appear visually distinct from ``older'' ones, and ``wet'' cars differ significantly from ``wet'' cats. Effectively modeling this contextuality and the inherent compositionality is crucial for robust compositional zero-shot recognition. This paper presents, to our knowledge, the first comprehensive survey specifically focused on Compositional Zero-Shot Learning. We systematically review the state-of-the-art CZSL methods, introducing a taxonomy grounded in disentanglement, with four families of approaches: no explicit disentanglement, textual disentanglement, visual disentanglement, and cross-modal disentanglement. We provide a detailed comparative analysis of these methods, highlighting their core advantages and limitations in different problem settings, such as closed-world and open-world CZSL. Finally, we identify the most significant open challenges and outline promising future research directions. This survey aims to serve as a foundational resource to guide and inspire further advancements in this fascinating and important field. Papers studied in this survey with their official code are available on our github: https://github.com/ans92/Compositional-Zero-Shot-Learning

Authors:Hongxiang Li, Yaowei Li, Bin Lin, Yuwei Niu, Yuhang Yang, Xiaoshuang Huang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Long Chen
Title: GIR-Bench: Versatile Benchmark for Generating Images with Reasoning
Abstract:
Unified multimodal models integrate the reasoning capacity of large language models with both image understanding and generation, showing great promise for advanced multimodal intelligence. However, the community still lacks a rigorous reasoning-centric benchmark to systematically evaluate the alignment between understanding and generation, and their generalization potential in complex visual tasks. To this end, we introduce \textbf{GIR-Bench}, a comprehensive benchmark that evaluates unified models across three complementary perspectives. Firstly, we investigate understanding-generation consistency (GIR-Bench-UGC), asking whether models can consistently leverage the same knowledge in both understanding and generation tasks. Secondly, we investigate whether models can perform reasoning-centric text-to-image generation that requires applying logical constraints and implicit knowledge to generate faithful visual content (GIR-Bench-T2I). Thirdly, we evaluate whether models can handle multi-step reasoning in editing (GIR-Bench-Edit). For each subset, we carefully design different task-specific evaluation pipelines tailored for each task. This enables fine-grained and interpretable evaluation while mitigating biases from the prevalent MLLM-as-a-Judge paradigm. Extensive ablations over various unified models and generation-only systems have shown that: Although unified models are more capable of reasoning-driven visual tasks, they still exhibit a persistent gap between understanding and generation. The data and code for GIR-Bench are available at \href{https://hkust-longgroup.github.io/GIR-Bench}{https://hkust-longgroup.github.io/GIR-Bench}.

Authors:Ruihang Xu, Dewei Zhou, Fan Ma, Yi Yang
Title: ContextGen: Contextual Layout Anchoring for Identity-Consistent Multi-Instance Generation
Abstract:
Multi-instance image generation (MIG) remains a significant challenge for modern diffusion models due to key limitations in achieving precise control over object layout and preserving the identity of multiple distinct subjects. To address these limitations, we introduce ContextGen, a novel Diffusion Transformer framework for multi-instance generation that is guided by both layout and reference images. Our approach integrates two key technical contributions: a Contextual Layout Anchoring (CLA) mechanism that incorporates the composite layout image into the generation context to robustly anchor the objects in their desired positions, and Identity Consistency Attention (ICA), an innovative attention mechanism that leverages contextual reference images to ensure the identity consistency of multiple instances. Recognizing the lack of large-scale, hierarchically-structured datasets for this task, we introduce IMIG-100K, the first dataset with detailed layout and identity annotations. Extensive experiments demonstrate that ContextGen sets a new state-of-the-art, outperforming existing methods in control precision, identity fidelity, and overall visual quality.

Authors:Huanjin Yao, Ruifei Zhang, Jiaxing Huang, Jingyi Zhang, Yibo Wang, Bo Fang, Ruolin Zhu, Yongcheng Jing, Shunyu Liu, Guanbin Li, Dacheng Tao
Title: A Survey on Agentic Multimodal Large Language Models
Abstract:
With the recent emergence of revolutionary autonomous agentic systems, research community is witnessing a significant shift from traditional static, passive, and domain-specific AI agents toward more dynamic, proactive, and generalizable agentic AI. Motivated by the growing interest in agentic AI and its potential trajectory toward AGI, we present a comprehensive survey on Agentic Multimodal Large Language Models (Agentic MLLMs). In this survey, we explore the emerging paradigm of agentic MLLMs, delineating their conceptual foundations and distinguishing characteristics from conventional MLLM-based agents. We establish a conceptual framework that organizes agentic MLLMs along three fundamental dimensions: (i) Agentic internal intelligence functions as the system's commander, enabling accurate long-horizon planning through reasoning, reflection, and memory; (ii) Agentic external tool invocation, whereby models proactively use various external tools to extend their problem-solving capabilities beyond their intrinsic knowledge; and (iii) Agentic environment interaction further situates models within virtual or physical environments, allowing them to take actions, adapt strategies, and sustain goal-directed behavior in dynamic real-world scenarios. To further accelerate research in this area for the community, we compile open-source training frameworks, training and evaluation datasets for developing agentic MLLMs. Finally, we review the downstream applications of agentic MLLMs and outline future research directions for this rapidly evolving field. To continuously track developments in this rapidly evolving field, we will also actively update a public repository at https://github.com/HJYao00/Awesome-Agentic-MLLMs.

Authors:Namhoon Kim, Sara Fridovich-Keil
Title: Towards Distribution-Shift Uncertainty Estimation for Inverse Problems with Generative Priors
Abstract:
Generative models have shown strong potential as data-driven priors for solving inverse problems such as reconstructing medical images from undersampled measurements. While these priors improve reconstruction quality with fewer measurements, they risk hallucinating features when test images lie outside the training distribution. Existing uncertainty quantification methods in this setting (i) require an in-distribution calibration dataset, which may not be available, (ii) provide heuristic rather than statistical estimates, or (iii) quantify uncertainty from model capacity or limited measurements rather than distribution shift. We propose an instance-level, calibration-free uncertainty indicator that is sensitive to distribution shift, requires no knowledge of the training distribution, and incurs no retraining cost. Our key hypothesis is that reconstructions of in-distribution images remain stable under random measurement variations, while reconstructions of out-of-distribution (OOD) images exhibit greater instability. We use this stability as a proxy for detecting distribution shift. Our proposed OOD indicator is efficiently computable for any computational imaging inverse problem; we demonstrate it on tomographic reconstruction of MNIST digits, where a learned proximal network trained only on digit "0" is evaluated on all ten digits. Reconstructions of OOD digits show higher variability and correspondingly higher reconstruction error, validating this indicator. These results suggest a deployment strategy that pairs generative priors with lightweight guardrails, enabling aggressive measurement reduction for in-distribution cases while automatically warning when priors are applied out of distribution.

Authors:Zhaofang Qian, Hardy Chen, Zeyu Wang, Li Zhang, Zijun Wang, Xiaoke Huang, Hui Liu, Xianfeng Tang, Zeyu Zheng, Haoqin Tu, Cihang Xie, Yuyin Zhou
Title: Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales
Abstract:
Vision-language models (VLMs) have advanced rapidly, yet their capacity for image-grounded geolocation in open-world conditions, a task that is challenging and of demand in real life, has not been comprehensively evaluated. We present EarthWhere, a comprehensive benchmark for VLM image geolocation that evaluates visual recognition, step-by-step reasoning, and evidence use. EarthWhere comprises 810 globally distributed images across two complementary geolocation scales: WhereCountry (i.e., 500 multiple-choice question-answering, with country-level answer and panoramas) and WhereStreet (i.e., 310 fine-grained street-level identification tasks requiring multi-step reasoning with optional web search). For evaluation, we adopt the final-prediction metrics: location accuracies within k km (Acc@k) for coordinates and hierarchical path scores for textual localization. Beyond this, we propose to explicitly score intermediate reasoning chains using human-verified key visual clues and a Shapley-reweighted thinking score that attributes credit to each clue's marginal contribution. We benchmark 13 state-of-the-art VLMs with web searching tools on our EarthWhere and report different types of final answer accuracies as well as the calibrated model thinking scores. Overall, Gemini-2.5-Pro achieves the best average accuracy at 56.32%, while the strongest open-weight model, GLM-4.5V, reaches 34.71%. We reveal that web search and reasoning do not guarantee improved performance when visual clues are limited, and models exhibit regional biases, achieving up to 42.7% higher scores in certain areas than others. These findings highlight not only the promise but also the persistent challenges of models to mitigate bias and achieve robust, fine-grained localization. We open-source our benchmark at https://github.com/UCSC-VLAA/EarthWhere.

Authors:Soroush Mehraban, Andrea Iaboni, Babak Taati
Title: FastHMR: Accelerating Human Mesh Recovery via Token and Layer Merging with Diffusion Decoding
Abstract:
Recent transformer-based models for 3D Human Mesh Recovery (HMR) have achieved strong performance but often suffer from high computational cost and complexity due to deep transformer architectures and redundant tokens. In this paper, we introduce two HMR-specific merging strategies: Error-Constrained Layer Merging (ECLM) and Mask-guided Token Merging (Mask-ToMe). ECLM selectively merges transformer layers that have minimal impact on the Mean Per Joint Position Error (MPJPE), while Mask-ToMe focuses on merging background tokens that contribute little to the final prediction. To further address the potential performance drop caused by merging, we propose a diffusion-based decoder that incorporates temporal context and leverages pose priors learned from large-scale motion capture datasets. Experiments across multiple benchmarks demonstrate that our method achieves up to 2.3x speed-up while slightly improving performance over the baseline.

Authors:Yuan Xu, Zimu Zhang, Xiaoxuan Ma, Wentao Zhu, Yu Qiao, Yizhou Wang
Title: Seeing My Future: Predicting Situated Interaction Behavior in Virtual Reality
Abstract:
Virtual and augmented reality systems increasingly demand intelligent adaptation to user behaviors for enhanced interaction experiences. Achieving this requires accurately understanding human intentions and predicting future situated behaviors - such as gaze direction and object interactions - which is vital for creating responsive VR/AR environments and applications like personalized assistants. However, accurate behavioral prediction demands modeling the underlying cognitive processes that drive human-environment interactions. In this work, we introduce a hierarchical, intention-aware framework that models human intentions and predicts detailed situated behaviors by leveraging cognitive mechanisms. Given historical human dynamics and the observation of scene contexts, our framework first identifies potential interaction targets and forecasts fine-grained future behaviors. We propose a dynamic Graph Convolutional Network (GCN) to effectively capture human-environment relationships. Extensive experiments on challenging real-world benchmarks and live VR environment demonstrate the effectiveness of our approach, achieving superior performance across all metrics and enabling practical applications for proactive VR systems that anticipate user behaviors and adapt virtual environments accordingly.

Authors:Shelly Golan, Yotam Nitzan, Zongze Wu, Or Patashnik
Title: VLM-Guided Adaptive Negative Prompting for Creative Generation
Abstract:
Creative generation is the synthesis of new, surprising, and valuable samples that reflect user intent yet cannot be envisioned in advance. This task aims to extend human imagination, enabling the discovery of visual concepts that exist in the unexplored spaces between familiar domains. While text-to-image diffusion models excel at rendering photorealistic scenes that faithfully match user prompts, they still struggle to generate genuinely novel content. Existing approaches to enhance generative creativity either rely on interpolation of image features, which restricts exploration to predefined categories, or require time-intensive procedures such as embedding optimization or model fine-tuning. We propose VLM-Guided Adaptive Negative-Prompting, a training-free, inference-time method that promotes creative image generation while preserving the validity of the generated object. Our approach utilizes a vision-language model (VLM) that analyzes intermediate outputs of the generation process and adaptively steers it away from conventional visual concepts, encouraging the emergence of novel and surprising outputs. We evaluate creativity through both novelty and validity, using statistical metrics in the CLIP embedding space. Through extensive experiments, we show consistent gains in creative novelty with negligible computational overhead. Moreover, unlike existing methods that primarily generate single objects, our approach extends to complex scenarios, such as generating coherent sets of creative objects and preserving creativity within elaborate compositional prompts. Our method integrates seamlessly into existing diffusion pipelines, offering a practical route to producing creative outputs that venture beyond the constraints of textual descriptions.

Authors:Yuxiang Luo, Qing Xu, Hai Huang, Yuqi Ouyang, Zhen Chen, Wenting Duan
Title: MSM-Seg: A Modality-and-Slice Memory Framework with Category-Agnostic Prompting for Multi-Modal Brain Tumor Segmentation
Abstract:
Multi-modal brain tumor segmentation is critical for clinical diagnosis, and it requires accurate identification of distinct internal anatomical subregions. While the recent prompt-based segmentation paradigms enable interactive experiences for clinicians, existing methods ignore cross-modal correlations and rely on labor-intensive category-specific prompts, limiting their applicability in real-world scenarios. To address these issues, we propose a MSM-Seg framework for multi-modal brain tumor segmentation. The MSM-Seg introduces a novel dual-memory segmentation paradigm that synergistically integrates multi-modal and inter-slice information with the efficient category-agnostic prompt for brain tumor understanding. To this end, we first devise a modality-and-slice memory attention (MSMA) to exploit the cross-modal and inter-slice relationships among the input scans. Then, we propose a multi-scale category-agnostic prompt encoder (MCP-Encoder) to provide tumor region guidance for decoding. Moreover, we devise a modality-adaptive fusion decoder (MF-Decoder) that leverages the complementary decoding information across different modalities to improve segmentation accuracy. Extensive experiments on different MRI datasets demonstrate that our MSM-Seg framework outperforms state-of-the-art methods in multi-modal metastases and glioma tumor segmentation. The code is available at https://github.com/xq141839/MSM-Seg.

Authors:Gaojian Wang, Feng Lin, Tong Wu, Zhisheng Yan, Kui Ren
Title: Scalable Face Security Vision Foundation Model for Deepfake, Diffusion, and Spoofing Detection
Abstract:
With abundant, unlabeled real faces, how can we learn robust and transferable facial representations to boost generalization across various face security tasks? We make the first attempt and propose FS-VFM, a scalable self-supervised pre-training framework, to learn fundamental representations of real face images. We introduce three learning objectives, namely 3C, that synergize masked image modeling (MIM) and instance discrimination (ID), empowering FS-VFM to encode both local patterns and global semantics of real faces. Specifically, we formulate various facial masking strategies for MIM and devise a simple yet effective CRFR-P masking, which explicitly prompts the model to pursue meaningful intra-region Consistency and challenging inter-region Coherency. We present a reliable self-distillation mechanism that seamlessly couples MIM with ID to establish underlying local-to-global Correspondence. After pre-training, vanilla vision transformers (ViTs) serve as universal Vision Foundation Models for downstream Face Security tasks: cross-dataset deepfake detection, cross-domain face anti-spoofing, and unseen diffusion facial forensics. To efficiently transfer the pre-trained FS-VFM, we further propose FS-Adapter, a lightweight plug-and-play bottleneck atop the frozen backbone with a novel real-anchor contrastive objective. Extensive experiments on 11 public benchmarks demonstrate that our FS-VFM consistently generalizes better than diverse VFMs, spanning natural and facial domains, fully, weakly, and self-supervised paradigms, small, base, and large ViT scales, and even outperforms SOTA task-specific methods, while FS-Adapter offers an excellent efficiency-performance trade-off. The code and models are available on https://fsfm-3c.github.io/fsvfm.html.

Authors:Hao Shan, Ruikai Li, Han Jiang, Yizhe Fan, Ziyang Yan, Bohan Li, Xiaoshuai Hao, Hao Zhao, Zhiyong Cui, Yilong Ren, Haiyang Yu
Title: Stability Under Scrutiny: Benchmarking Representation Paradigms for Online HD Mapping
Abstract:
As one of the fundamental modules in autonomous driving, online high-definition (HD) maps have attracted significant attention due to their cost-effectiveness and real-time capabilities. Since vehicles always cruise in highly dynamic environments, spatial displacement of onboard sensors inevitably causes shifts in real-time HD mapping results, and such instability poses fundamental challenges for downstream tasks. However, existing online map construction models tend to prioritize improving each frame's mapping accuracy, while the mapping stability has not yet been systematically studied. To fill this gap, this paper presents the first comprehensive benchmark for evaluating the temporal stability of online HD mapping models. We propose a multi-dimensional stability evaluation framework with novel metrics for Presence, Localization, and Shape Stability, integrated into a unified mean Average Stability (mAS) score. Extensive experiments on 42 models and variants show that accuracy (mAP) and stability (mAS) represent largely independent performance dimensions. We further analyze the impact of key model design choices on both criteria, identifying architectural and training factors that contribute to high accuracy, high stability, or both. To encourage broader focus on stability, we will release a public benchmark. Our work highlights the importance of treating temporal stability as a core evaluation criterion alongside accuracy, advancing the development of more reliable autonomous driving systems. The benchmark toolkit, code, and models will be available at https://stablehdmap.github.io/.

Authors:Chenlong He, Zijing Dong, Min Li, Zhijian Hao, Leilei Huang, Xiaoyang Zeng, Yibo Fan
Title: JND-Guided Light-Weight Neural Pre-Filter for Perceptual Image Coding
Abstract:
Just Noticeable Distortion (JND)-guided pre-filter is a promising technique for improving the perceptual compression efficiency of image coding. However, existing methods are often computationally expensive, and the field lacks standardized benchmarks for fair comparison. To address these challenges, this paper introduces a twofold contribution. First, we develop and open-source FJNDF-Pytorch, a unified benchmark for frequency-domain JND-Guided pre-filters. Second, leveraging this platform, we propose a complete learning framework for a novel, lightweight Convolutional Neural Network (CNN). Experimental results demonstrate that our proposed method achieves state-of-the-art compression efficiency, consistently outperforming competitors across multiple datasets and encoders. In terms of computational cost, our model is exceptionally lightweight, requiring only 7.15 GFLOPs to process a 1080p image, which is merely 14.1% of the cost of recent lightweight network. Our work presents a robust, state-of-the-art solution that excels in both performance and efficiency, supported by a reproducible research platform. The open-source implementation is available at https://github.com/viplab-fudan/FJNDF-Pytorch.

Authors:Jingchao Wang, Wenlong Zhang, Dingjiang Huang, Hong Wang, Yefeng Zheng
Title: A Simple and Better Baseline for Visual Grounding
Abstract:
Visual grounding aims to predict the locations of target objects specified by textual descriptions. For this task with linguistic and visual modalities, there is a latest research line that focuses on only selecting the linguistic-relevant visual regions for object localization to reduce the computational overhead. Albeit achieving impressive performance, it is iteratively performed on different image scales, and at every iteration, linguistic features and visual features need to be stored in a cache, incurring extra overhead. To facilitate the implementation, in this paper, we propose a feature selection-based simple yet effective baseline for visual grounding, called FSVG. Specifically, we directly encapsulate the linguistic and visual modalities into an overall network architecture without complicated iterative procedures, and utilize the language in parallel as guidance to facilitate the interaction between linguistic modal and visual modal for extracting effective visual features. Furthermore, to reduce the computational cost, during the visual feature learning, we introduce a similarity-based feature selection mechanism to only exploit language-related visual features for faster prediction. Extensive experiments conducted on several benchmark datasets comprehensively substantiate that the proposed FSVG achieves a better balance between accuracy and efficiency beyond the current state-of-the-art methods. Code is available at https://github.com/jcwang0602/FSVG.

Authors:Binyu Zhao, Wei Zhang, Zhaonian Zou
Title: MCE: Towards a General Framework for Handling Missing Modalities under Imbalanced Missing Rates
Abstract:
Multi-modal learning has made significant advances across diverse pattern recognition applications. However, handling missing modalities, especially under imbalanced missing rates, remains a major challenge. This imbalance triggers a vicious cycle: modalities with higher missing rates receive fewer updates, leading to inconsistent learning progress and representational degradation that further diminishes their contribution. Existing methods typically focus on global dataset-level balancing, often overlooking critical sample-level variations in modality utility and the underlying issue of degraded feature quality. We propose Modality Capability Enhancement (MCE) to tackle these limitations. MCE includes two synergistic components: i) Learning Capability Enhancement (LCE), which introduces multi-level factors to dynamically balance modality-specific learning progress, and ii) Representation Capability Enhancement (RCE), which improves feature semantics and robustness through subset prediction and cross-modal completion tasks. Comprehensive evaluations on four multi-modal benchmarks show that MCE consistently outperforms state-of-the-art methods under various missing configurations. The journal preprint version is now available at https://doi.org/10.1016/j.patcog.2025.112591. Our code is available at https://github.com/byzhaoAI/MCE.

Authors:Yuteng Ye, Zheng Zhang, Qinchuan Zhang, Di Wang, Youjia Zhang, Wenxiao Zhang, Wei Yang, Yuan Liu
Title: Jigsaw3D: Disentangled 3D Style Transfer via Patch Shuffling and Masking
Abstract:
Controllable 3D style transfer seeks to restyle a 3D asset so that its textures match a reference image while preserving the integrity and multi-view consistency. The prevalent methods either rely on direct reference style token injection or score-distillation from 2D diffusion models, which incurs heavy per-scene optimization and often entangles style with semantic content. We introduce Jigsaw3D, a multi-view diffusion based pipeline that decouples style from content and enables fast, view-consistent stylization. Our key idea is to leverage the jigsaw operation - spatial shuffling and random masking of reference patches - to suppress object semantics and isolate stylistic statistics (color palettes, strokes, textures). We integrate these style cues into a multi-view diffusion model via reference-to-view cross-attention, producing view-consistent stylized renderings conditioned on the input mesh. The renders are then style-baked onto the surface to yield seamless textures. Across standard 3D stylization benchmarks, Jigsaw3D achieves high style fidelity and multi-view consistency with substantially lower latency, and generalizes to masked partial reference stylization, multi-object scene styling, and tileable texture generation. Project page is available at: https://babahui.github.io/jigsaw3D.github.io/

Authors:Yunlong Deng, Guangyi Chen, Tianpei Gu, Lingjing Kong, Yan Li, Zeyu Tang, Kun Zhang
Title: Towards Self-Refinement of Vision-Language Models with Triangular Consistency
Abstract:
Vision-Language Models (VLMs) integrate visual knowledge with the analytical capabilities of Large Language Models (LLMs) through supervised visual instruction tuning, using image-question-answer triplets. However, the potential of VLMs trained without supervised instruction remains largely unexplored. This study validates that VLMs possess inherent self-refinement capabilities, enabling them to generate high-quality supervised data without external inputs and thereby learn autonomously. Specifically, to stimulate the self-refinement ability of VLMs, we propose a self-refinement framework based on a Triangular Consistency principle: within the image-query-answer triangle, any masked elements should be consistently and accurately reconstructed. The framework involves three steps: (1) We enable the instruction generation ability of VLMs by adding multi-task instruction tuning like image$\rightarrow$question-answer or image-answer$\rightarrow$question. (2) We generate image-query-answer triplets from unlabeled images and use the Triangular Consistency principle for filtering. (3) The model is further updated using the filtered synthetic data. To investigate the underlying mechanisms behind this self-refinement capability, we conduct a theoretical analysis from a causal perspective. Using the widely recognized LLaVA-1.5 as our baseline, our experiments reveal that the model can autonomously achieve consistent, though deliberately modest, improvements across multiple benchmarks without any external supervision, such as human annotations or environmental feedback. We expect that the insights of this study on the self-refinement ability of VLMs can inspire future research on the learning mechanism of VLMs. Code is available at https://github.com/dengyl20/SRF-LLaVA-1.5.

Authors:Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, Ya-Qin Zhang, Jiangmiao Pang, Jingjing Liu, Tai Wang, Xianyuan Zhan
Title: X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model
Abstract:
Successful generalist Vision-Language-Action (VLA) models rely on effective training across diverse robotic platforms with large-scale, cross-embodiment, heterogeneous datasets. To facilitate and leverage the heterogeneity in rich, diverse robotic data sources, we propose a novel Soft Prompt approach with minimally added parameters, by infusing prompt learning concepts into cross-embodiment robot learning and introducing separate sets of learnable embeddings for each distinct data source. These embeddings serve as embodiment-specific prompts, which in unity empower VLA models with effective exploitation of varying cross-embodiment features. Our new X-VLA, a neat flow-matching-based VLA architecture, relies exclusively on soft-prompted standard Transformer encoders, enjoying both scalability and simplicity. Evaluated across 6 simulations as well as 3 real-world robots, our 0.9B instantiation-X-VLA-0.9B simultaneously achieves SOTA performance over a sweep of benchmarks, demonstrating superior results on a wide axes of capabilities, from flexible dexterity to quick adaptation across embodiments, environments, and tasks. Website: https://thu-air-dream.github.io/X-VLA/

Authors:Linfei Li, Fengyi Zhang, Zhong Wang, Lin Zhang, Ying Shen
Title: INR-Bench: A Unified Benchmark for Implicit Neural Representations in Multi-Domain Regression and Reconstruction
Abstract:
Implicit Neural Representations (INRs) have gained success in various signal processing tasks due to their advantages of continuity and infinite resolution. However, the factors influencing their effectiveness and limitations remain underexplored. To better understand these factors, we leverage insights from Neural Tangent Kernel (NTK) theory to analyze how model architectures (classic MLP and emerging KAN), positional encoding, and nonlinear primitives affect the response to signals of varying frequencies. Building on this analysis, we introduce INR-Bench, the first comprehensive benchmark specifically designed for multimodal INR tasks. It includes 56 variants of Coordinate-MLP models (featuring 4 types of positional encoding and 14 activation functions) and 22 Coordinate-KAN models with distinct basis functions, evaluated across 9 implicit multimodal tasks. These tasks cover both forward and inverse problems, offering a robust platform to highlight the strengths and limitations of different neural models, thereby establishing a solid foundation for future research. The code and dataset are available at https://github.com/lif314/INR-Bench.

Authors:Yulin Wang, Mengting Hu, Hongli Li, Chen Luo
Title: HccePose(BF): Predicting Front \& Back Surfaces to Construct Ultra-Dense 2D-3D Correspondences for Pose Estimation
Abstract:
In pose estimation for seen objects, a prevalent pipeline involves using neural networks to predict dense 3D coordinates of the object surface on 2D images, which are then used to establish dense 2D-3D correspondences. However, current methods primarily focus on more efficient encoding techniques to improve the precision of predicted 3D coordinates on the object's front surface, overlooking the potential benefits of incorporating the back surface and interior of the object. To better utilize the full surface and interior of the object, this study predicts 3D coordinates of both the object's front and back surfaces and densely samples 3D coordinates between them. This process creates ultra-dense 2D-3D correspondences, effectively enhancing pose estimation accuracy based on the Perspective-n-Point (PnP) algorithm. Additionally, we propose Hierarchical Continuous Coordinate Encoding (HCCE) to provide a more accurate and efficient representation of front and back surface coordinates. Experimental results show that, compared to existing state-of-the-art (SOTA) methods on the BOP website, the proposed approach outperforms across seven classic BOP core datasets. Code is available at https://github.com/WangYuLin-SEU/HCCEPose.

Authors:Cristiano Patrício, Luís F. Teixeira, João C. Neves
Title: ViConEx-Med: Visual Concept Explainability via Multi-Concept Token Transformer for Medical Image Analysis
Abstract:
Concept-based models aim to explain model decisions with human-understandable concepts. However, most existing approaches treat concepts as numerical attributes, without providing complementary visual explanations that could localize the predicted concepts. This limits their utility in real-world applications and particularly in high-stakes scenarios, such as medical use-cases. This paper proposes ViConEx-Med, a novel transformer-based framework for visual concept explainability, which introduces multi-concept learnable tokens to jointly predict and localize visual concepts. By leveraging specialized attention layers for processing visual and text-based concept tokens, our method produces concept-level localization maps while maintaining high predictive accuracy. Experiments on both synthetic and real-world medical datasets demonstrate that ViConEx-Med outperforms prior concept-based models and achieves competitive performance with black-box models in terms of both concept detection and localization precision. Our results suggest a promising direction for building inherently interpretable models grounded in visual concepts. Code is publicly available at https://github.com/CristianoPatricio/viconex-med.

Authors:Yecong Wan, Mingwen Shao, Renlong Wu, Wangmeng Zuo
Title: Color3D: Controllable and Consistent 3D Colorization with Personalized Colorizer
Abstract:
In this work, we present Color3D, a highly adaptable framework for colorizing both static and dynamic 3D scenes from monochromatic inputs, delivering visually diverse and chromatically vibrant reconstructions with flexible user-guided control. In contrast to existing methods that focus solely on static scenarios and enforce multi-view consistency by averaging color variations which inevitably sacrifice both chromatic richness and controllability, our approach is able to preserve color diversity and steerability while ensuring cross-view and cross-time consistency. In particular, the core insight of our method is to colorize only a single key view and then fine-tune a personalized colorizer to propagate its color to novel views and time steps. Through personalization, the colorizer learns a scene-specific deterministic color mapping underlying the reference view, enabling it to consistently project corresponding colors to the content in novel views and video frames via its inherent inductive bias. Once trained, the personalized colorizer can be applied to infer consistent chrominance for all other images, enabling direct reconstruction of colorful 3D scenes with a dedicated Lab color space Gaussian splatting representation. The proposed framework ingeniously recasts complicated 3D colorization as a more tractable single image paradigm, allowing seamless integration of arbitrary image colorization models with enhanced flexibility and controllability. Extensive experiments across diverse static and dynamic 3D colorization benchmarks substantiate that our method can deliver more consistent and chromatically rich renderings with precise user control. Project Page https://yecongwan.github.io/Color3D/.

Authors:Kuangpu Guo, Lijun Sheng, Yongcan Yu, Jian Liang, Zilei Wang, Ran He
Title: Cooperative Pseudo Labeling for Unsupervised Federated Classification
Abstract:
Unsupervised Federated Learning (UFL) aims to collaboratively train a global model across distributed clients without sharing data or accessing label information. Previous UFL works have predominantly focused on representation learning and clustering tasks. Recently, vision language models (e.g., CLIP) have gained significant attention for their powerful zero-shot prediction capabilities. Leveraging this advancement, classification problems that were previously infeasible under the UFL paradigm now present promising new opportunities, yet remain largely unexplored. In this paper, we extend UFL to the classification problem with CLIP for the first time and propose a novel method, \underline{\textbf{Fed}}erated \underline{\textbf{Co}}operative \underline{\textbf{P}}seudo \underline{\textbf{L}}abeling (\textbf{FedCoPL}). Specifically, clients estimate and upload their pseudo label distribution, and the server adjusts and redistributes them to avoid global imbalance among classes. Moreover, we introduce a partial prompt aggregation protocol for effective collaboration and personalization. In particular, visual prompts containing general image features are aggregated at the server, while text prompts encoding personalized knowledge are retained locally. Extensive experiments demonstrate the superior performance of our FedCoPL compared to baseline methods. Our code is available at \href{https://github.com/krumpguo/FedCoPL}{https://github.com/krumpguo/FedCoPL}.

Authors:Hehe Fan, Yi Yang, Mohan Kankanhalli, Fei Wu
Title: Translution: Unifying Self-attention and Convolution for Adaptive and Relative Modeling
Abstract:
When modeling a given type of data, we consider it to involve two key aspects: 1) identifying relevant elements (e.g., image pixels or textual words) to a central element, as in a convolutional receptive field, or to a query element, as in self-attention, and 2) encoding these tokens effectively. Self-attention can adaptively identify these elements but relies on absolute positional embedding for structural representation learning. In contrast, convolution encodes elements in a relative manner, yet their fixed kernel size limits their ability to adaptively select the relevant elements. In this paper, we introduce Translution, an operation that unifies the adaptive identification capability of self-attention and the relative encoding advantage of convolution. However, this integration leads to a substantial increase in the number of parameters, exceeding most currently available computational resources. Therefore, we propose a lightweight variant of Translution, named α-Translution. Experiments on computer vision and natural language processing tasks show that Translution (including α-Translution) achieves superior accuracy compared to self-attention. The code is available at https://github.com/hehefan/Translution.

Authors:Sitong Gong, Yunzhi Zhuge, Lu Zhang, Pingping Zhang, Huchuan Lu
Title: Complementary and Contrastive Learning for Audio-Visual Segmentation
Abstract:
Audio-Visual Segmentation (AVS) aims to generate pixel-wise segmentation maps that correlate with the auditory signals of objects. This field has seen significant progress with numerous CNN and Transformer-based methods enhancing the segmentation accuracy and robustness. Traditional CNN approaches manage audio-visual interactions through basic operations like padding and multiplications but are restricted by CNNs' limited local receptive field. More recently, Transformer-based methods treat auditory cues as queries, utilizing attention mechanisms to enhance audio-visual cooperation within frames. Nevertheless, they typically struggle to extract multimodal coefficients and temporal dynamics adequately. To overcome these limitations, we present the Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information and capturing spatial-temporal context comprehensively. Our CCFormer initiates with the Early Integration Module (EIM) that employs a parallel bilateral architecture, merging multi-scale visual features with audio data to boost cross-modal complementarity. To extract the intra-frame spatial features and facilitate the perception of temporal coherence, we introduce the Multi-query Transformer Module (MTM), which dynamically endows audio queries with learning capabilities and models the frame and video-level relations simultaneously. Furthermore, we propose the Bi-modal Contrastive Learning (BCL) to promote the alignment across both modalities in the unified feature space. Through the effective combination of those designs, our method sets new state-of-the-art benchmarks across the S4, MS3 and AVSS datasets. Our source code and model weights will be made publicly available at https://github.com/SitongGong/CCFormer

Authors:Shreshth Saini, Alan C. Bovik, Neil Birkbeck, Yilin Wang, Balu Adsumilli
Title: CHUG: Crowdsourced User-Generated HDR Video Quality Dataset
Abstract:
High Dynamic Range (HDR) videos enhance visual experiences with superior brightness, contrast, and color depth. The surge of User-Generated Content (UGC) on platforms like YouTube and TikTok introduces unique challenges for HDR video quality assessment (VQA) due to diverse capture conditions, editing artifacts, and compression distortions. Existing HDR-VQA datasets primarily focus on professionally generated content (PGC), leaving a gap in understanding real-world UGC-HDR degradations. To address this, we introduce CHUG: Crowdsourced User-Generated HDR Video Quality Dataset, the first large-scale subjective study on UGC-HDR quality. CHUG comprises 856 UGC-HDR source videos, transcoded across multiple resolutions and bitrates to simulate real-world scenarios, totaling 5,992 videos. A large-scale study via Amazon Mechanical Turk collected 211,848 perceptual ratings. CHUG provides a benchmark for analyzing UGC-specific distortions in HDR videos. We anticipate CHUG will advance No-Reference (NR) HDR-VQA research by offering a large-scale, diverse, and real-world UGC dataset. The dataset is publicly available at: https://shreshthsaini.github.io/CHUG/.

Authors:Samanta Rodriguez, Yiming Dou, Miquel Oller, Andrew Owens, Nima Fazeli
Title: Cross-Sensor Touch Generation
Abstract:
Today's visuo-tactile sensors come in many shapes and sizes, making it challenging to develop general-purpose tactile representations. This is because most models are tied to a specific sensor design. To address this challenge, we propose two approaches to cross-sensor image generation. The first is an end-to-end method that leverages paired data (Touch2Touch). The second method builds an intermediate depth representation and does not require paired data (T2D2: Touch-to-Depth-to-Touch). Both methods enable the use of sensor-specific models across multiple sensors via the cross-sensor touch generation process. Together, these models offer flexible solutions for sensor translation, depending on data availability and application needs. We demonstrate their effectiveness on downstream tasks such as in-hand pose estimation and behavior cloning, successfully transferring models trained on one sensor to another. Project page: https://samantabelen.github.io/cross_sensor_touch_generation.

Authors:Atharv Goel, Sharat Agarwal, Saket Anand, Chetan Arora
Title: Reliable Active Learning from Unreliable Labels via Neural Collapse Geometry
Abstract:
Active Learning (AL) promises to reduce annotation cost by prioritizing informative samples, yet its reliability is undermined when labels are noisy or when the data distribution shifts. In practice, annotators make mistakes, rare categories are ambiguous, and conventional AL heuristics (uncertainty, diversity) often amplify such errors by repeatedly selecting mislabeled or redundant samples. We propose Reliable Active Learning via Neural Collapse Geometry (NCAL-R), a framework that leverages the emergent geometric regularities of deep networks to counteract unreliable supervision. Our method introduces two complementary signals: (i) a Class-Mean Alignment Perturbation score, which quantifies how candidate samples structurally stabilize or distort inter-class geometry, and (ii) a Feature Fluctuation score, which captures temporal instability of representations across training checkpoints. By combining these signals, NCAL-R prioritizes samples that both preserve class separation and highlight ambiguous regions, mitigating the effect of noisy or redundant labels. Experiments on ImageNet-100 and CIFAR100 show that NCAL-R consistently outperforms standard AL baselines, achieving higher accuracy with fewer labels, improved robustness under synthetic label noise, and stronger generalization to out-of-distribution data. These results suggest that incorporating geometric reliability criteria into acquisition decisions can make Active Learning less brittle to annotation errors and distribution shifts, a key step toward trustworthy deployment in real-world labeling pipelines. Our code is available at https://github.com/Vision-IIITD/NCAL.

Authors:Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, Song Han
Title: StreamingVLM: Real-Time Understanding for Infinite Video Streams
Abstract:
Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code is available at https://github.com/mit-han-lab/streaming-vlm.

Authors:Shaoqi Dong, Chaoyou Fu, Haihan Gao, Yi-Fan Zhang, Chi Yan, Chu Wu, Xiaoyu Liu, Yunhang Shen, Jing Huo, Deqiang Jiang, Haoyu Cao, Yang Gao, Xing Sun, Ran He, Caifeng Shan
Title: VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation
Abstract:
Vision-Language Action (VLA) models significantly advance robotic manipulation by leveraging the strong perception capabilities of pretrained vision-language models (VLMs). By integrating action modules into these pretrained models, VLA methods exhibit improved generalization. However, training them from scratch is costly. In this work, we propose a simple yet effective distillation-based framework that equips VLMs with action-execution capability by transferring knowledge from pretrained small action models. Our architecture retains the original VLM structure, adding only an action token and a state encoder to incorporate physical inputs. To distill action knowledge, we adopt a two-stage training strategy. First, we perform lightweight alignment by mapping VLM hidden states into the action space of the small action model, enabling effective reuse of its pretrained action decoder and avoiding expensive pretraining. Second, we selectively fine-tune the language model, state encoder, and action modules, enabling the system to integrate multimodal inputs with precise action generation. Specifically, the action token provides the VLM with a direct handle for predicting future actions, while the state encoder allows the model to incorporate robot dynamics not captured by vision alone. This design yields substantial efficiency gains over training large VLA models from scratch. Compared with previous state-of-the-art methods, our method achieves 97.3% average success rate on LIBERO (11.8% improvement) and 93.5% on LIBERO-LONG (24.5% improvement). In real-world experiments across five manipulation tasks, our method consistently outperforms the teacher model, achieving 82.0% success rate (17% improvement), which demonstrate that action distillation effectively enables VLMs to generate precise actions while substantially reducing training costs.

Authors:Peiwen Sun, Shiqiang Lang, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, Xiangyu Yue
Title: SpaceVista: All-Scale Visual Spatial Reasoning from mm to km
Abstract:
With the current surge in spatial reasoning explorations, researchers have made significant progress in understanding indoor scenes, but still struggle with diverse applications such as robotics and autonomous driving. This paper aims to advance all-scale spatial reasoning across diverse scenarios by tackling two key challenges: 1) the heavy reliance on indoor 3D scans and labor-intensive manual annotations for dataset curation; 2) the absence of effective all-scale scene modeling, which often leads to overfitting to individual scenes. In this paper, we introduce a holistic solution that integrates a structured spatial reasoning knowledge system, scale-aware modeling, and a progressive training paradigm, as the first attempt to broaden the all-scale spatial intelligence of MLLMs to the best of our knowledge. Using a task-specific, specialist-driven automated pipeline, we curate over 38K video scenes across 5 spatial scales to create SpaceVista-1M, a dataset comprising approximately 1M spatial QA pairs spanning 19 diverse task types. While specialist models can inject useful domain knowledge, they are not reliable for evaluation. We then build an all-scale benchmark with precise annotations by manually recording, retrieving, and assembling video-based data. However, naive training with SpaceVista-1M often yields suboptimal results due to the potential knowledge conflict. Accordingly, we introduce SpaceVista-7B, a spatial reasoning model that accepts dense inputs beyond semantics and uses scale as an anchor for scale-aware experts and progressive rewards. Finally, extensive evaluations across 5 benchmarks, including our SpaceVista-Bench, demonstrate competitive performance, showcasing strong generalization across all scales and scenarios. Our dataset, model, and benchmark will be released on https://peiwensun2000.github.io/mm2km .

Authors:Arthur Bizzi, Matias Grynberg, Vitor Matias, Daniel Perazzo, João Paulo Lima, Luiz Velho, Nuno Gonçalves, João Pereira, Guilherme Schardong, Tiago Novello
Title: FLOWING: Implicit Neural Flows for Structure-Preserving Morphing
Abstract:
Morphing is a long-standing problem in vision and computer graphics, requiring a time-dependent warping for feature alignment and a blending for smooth interpolation. Recently, multilayer perceptrons (MLPs) have been explored as implicit neural representations (INRs) for modeling such deformations, due to their meshlessness and differentiability; however, extracting coherent and accurate morphings from standard MLPs typically relies on costly regularizations, which often lead to unstable training and prevent effective feature alignment. To overcome these limitations, we propose FLOWING (FLOW morphING), a framework that recasts warping as the construction of a differential vector flow, naturally ensuring continuity, invertibility, and temporal coherence by encoding structural flow properties directly into the network architectures. This flow-centric approach yields principled and stable transformations, enabling accurate and structure-preserving morphing of both 2D images and 3D shapes. Extensive experiments across a range of applications - including face and image morphing, as well as Gaussian Splatting morphing - show that FLOWING achieves state-of-the-art morphing quality with faster convergence. Code and pretrained models are available at http://schardong.github.io/flowing.

Authors:David-Alexandre Duclos, William Guimont-Martin, Gabriel Jeanson, Arthur Larochelle-Tremblay, Théo Defosse, Frédéric Moore, Philippe Nolet, François Pomerleau, Philippe Giguère
Title: SilvaScenes: Tree Segmentation and Species Classification from Under-Canopy Images in Natural Forests
Abstract:
Interest in robotics for forest management is growing, but perception in complex, natural environments remains a significant hurdle. Conditions such as heavy occlusion, variable lighting, and dense vegetation pose challenges to automated systems, which are essential for precision forestry, biodiversity monitoring, and the automation of forestry equipment. These tasks rely on advanced perceptual capabilities, such as detection and fine-grained species classification of individual trees. Yet, existing datasets are inadequate to develop such perception systems, as they often focus on urban settings or a limited number of species. To address this, we present SilvaScenes, a new dataset for instance segmentation of tree species from under-canopy images. Collected across five bioclimatic domains in Quebec, Canada, SilvaScenes features 1476 trees from 24 species with annotations from forestry experts. We demonstrate the relevance and challenging nature of our dataset by benchmarking modern deep learning approaches for instance segmentation. Our results show that, while tree segmentation is easy, with a top mean average precision (mAP) of 67.65%, species classification remains a significant challenge with an mAP of only 35.69%. Our dataset and source code will be available at https://github.com/norlab-ulaval/SilvaScenes.

Authors:Valentin Biller, Lucas Zimmer, Can Erdur, Sandeep Nagar, Daniel Rückert, Niklas Bubeck, Jonas Weidner
Title: A Biophysically-Conditioned Generative Framework for 3D Brain Tumor MRI Synthesis
Abstract:
Magnetic resonance imaging (MRI) inpainting supports numerous clinical and research applications. We introduce the first generative model that conditions on voxel-level, continuous tumor concentrations to synthesize high-fidelity brain tumor MRIs. For the BraTS 2025 Inpainting Challenge, we adapt this architecture to the complementary task of healthy tissue restoration by setting the tumor concentrations to zero. Our latent diffusion model conditioned on both tissue segmentations and the tumor concentrations generates 3D spatially coherent and anatomically consistent images for both tumor synthesis and healthy tissue inpainting. For healthy inpainting, we achieve a PSNR of 18.5, and for tumor inpainting, we achieve 17.4. Our code is available at: https://github.com/valentin-biller/ldm.git

Authors:Junyan Ye, Dongzhi Jiang, Jun He, Baichuan Zhou, Zilong Huang, Zhiyuan Yan, Hongsheng Li, Conghui He, Weijia Li
Title: BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception
Abstract:
Recently, Multimodal Large Language Models (MLLMs) have made rapid progress, particularly in enhancing their reasoning capabilities. However, existing reasoning benchmarks still primarily assess language-based reasoning, often treating visual input as replaceable context. To address this gap, we introduce BLINK-Twice, a vision-centric reasoning benchmark grounded in challenging perceptual tasks. Instead of relying on external knowledge, our tasks require models to reason from visual content alone, shifting the focus from language-based to image-grounded reasoning. Compared to prior perception benchmarks, it moves beyond shallow perception ("see") and requires fine-grained observation and analytical reasoning ("observe"). BLINK-Twice integrates three core components: seven types of visual challenges for testing visual reasoning, natural adversarial image pairs that enforce reliance on visual content, and annotated reasoning chains for fine-grained evaluation of the reasoning process rather than final answers alone. We evaluate 20 leading MLLMs, including 12 foundation models and 8 reasoning-enhanced models. BLINK-Twice poses a significant challenge to current models. While existing reasoning strategies in the language space-such as chain-of-thought or self-criticism can improve performance, they often result in unstable and redundant reasoning. We observe that repeated image observation improves performance across models, and active visual interaction, as demonstrated by models like o3, highlights the need for a new paradigm for vision reasoning. The dataset is publicly available at https://github.com/PicoTrex/BLINK-Twice

Authors:Qihang Ma, Shengyu Li, Jie Tang, Dingkang Yang, Shaodong Chen, Yingyi Zhang, Chao Feng, Jiao Ran
Title: Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models
Abstract:
Multi-modal keyphrase prediction (MMKP) aims to advance beyond text-only methods by incorporating multiple modalities of input information to produce a set of conclusive phrases. Traditional multi-modal approaches have been proven to have significant limitations in handling the challenging absence and unseen scenarios. Additionally, we identify shortcomings in existing benchmarks that overestimate model capability due to significant overlap in training tests. In this work, we propose leveraging vision-language models (VLMs) for the MMKP task. Firstly, we use two widely-used strategies, e.g., zero-shot and supervised fine-tuning (SFT) to assess the lower bound performance of VLMs. Next, to improve the complex reasoning capabilities of VLMs, we adopt Fine-tune-CoT, which leverages high-quality CoT reasoning data generated by a teacher model to finetune smaller models. Finally, to address the "overthinking" phenomenon, we propose a dynamic CoT strategy which adaptively injects CoT data during training, allowing the model to flexibly leverage its reasoning capabilities during the inference stage. We evaluate the proposed strategies on various datasets and the experimental results demonstrate the effectiveness of the proposed approaches. The code is available at https://github.com/bytedance/DynamicCoT.

Authors:Jinyuan Liu, Zihang Chen, Zhu Liu, Zhiying Jiang, Long Ma, Xin Fan, Risheng Liu
Title: Enhancing Infrared Vision: Progressive Prompt Fusion Network and Benchmark
Abstract:
We engage in the relatively underexplored task named thermal infrared image enhancement. Existing infrared image enhancement methods primarily focus on tackling individual degradations, such as noise, contrast, and blurring, making it difficult to handle coupled degradations. Meanwhile, all-in-one enhancement methods, commonly applied to RGB sensors, often demonstrate limited effectiveness due to the significant differences in imaging models. In sight of this, we first revisit the imaging mechanism and introduce a Progressive Prompt Fusion Network (PPFN). Specifically, the PPFN initially establishes prompt pairs based on the thermal imaging process. For each type of degradation, we fuse the corresponding prompt pairs to modulate the model's features, providing adaptive guidance that enables the model to better address specific degradations under single or multiple conditions. In addition, a Selective Progressive Training (SPT) mechanism is introduced to gradually refine the model's handling of composite cases to align the enhancement process, which not only allows the model to remove camera noise and retain key structural details, but also enhancing the overall contrast of the thermal image. Furthermore, we introduce the most high-quality, multi-scenarios infrared benchmark covering a wide range of scenarios. Extensive experiments substantiate that our approach not only delivers promising visual results under specific degradation but also significantly improves performance on complex degradation scenes, achieving a notable 8.76\% improvement. Code is available at https://github.com/Zihang-Chen/HM-TIR.

Authors:Wenyao Zhang, Hongsi Liu, Bohan Li, Jiawei He, Zekun Qi, Yunnan Wang, Shengyang Zhao, Xinqiang Yu, Wenjun Zeng, Xin Jin
Title: Hybrid-grained Feature Aggregation with Coarse-to-fine Language Guidance for Self-supervised Monocular Depth Estimation
Abstract:
Current self-supervised monocular depth estimation (MDE) approaches encounter performance limitations due to insufficient semantic-spatial knowledge extraction. To address this challenge, we propose Hybrid-depth, a novel framework that systematically integrates foundation models (e.g., CLIP and DINO) to extract visual priors and acquire sufficient contextual information for MDE. Our approach introduces a coarse-to-fine progressive learning framework: 1) Firstly, we aggregate multi-grained features from CLIP (global semantics) and DINO (local spatial details) under contrastive language guidance. A proxy task comparing close-distant image patches is designed to enforce depth-aware feature alignment using text prompts; 2) Next, building on the coarse features, we integrate camera pose information and pixel-wise language alignment to refine depth predictions. This module seamlessly integrates with existing self-supervised MDE pipelines (e.g., Monodepth2, ManyDepth) as a plug-and-play depth encoder, enhancing continuous depth estimation. By aggregating CLIP's semantic context and DINO's spatial details through language guidance, our method effectively addresses feature granularity mismatches. Extensive experiments on the KITTI benchmark demonstrate that our method significantly outperforms SOTA methods across all metrics, which also indeed benefits downstream tasks like BEV perception. Code is available at https://github.com/Zhangwenyao1/Hybrid-depth.

Authors:Haozhe Jia, Wenshuo Chen, Xiucheng Wang, Nan Cheng, Hongbo Zhang, Kuimou Yu, Songning Lai, Nanjian Jia, Bowen Tian, Hongru Xiao, Yutao Yue
Title: RadioFlow: Efficient Radio Map Construction Framework with Flow Matching
Abstract:
Accurate and real-time radio map (RM) generation is crucial for next-generation wireless systems, yet diffusion-based approaches often suffer from large model sizes, slow iterative denoising, and high inference latency, which hinder practical deployment. To overcome these limitations, we propose \textbf{RadioFlow}, a novel flow-matching-based generative framework that achieves high-fidelity RM generation through single-step efficient sampling. Unlike conventional diffusion models, RadioFlow learns continuous transport trajectories between noise and data, enabling both training and inference to be significantly accelerated while preserving reconstruction accuracy. Comprehensive experiments demonstrate that RadioFlow achieves state-of-the-art performance with \textbf{up to 8$\times$ fewer parameters} and \textbf{over 4$\times$ faster inference} compared to the leading diffusion-based baseline (RadioDiff). This advancement provides a promising pathway toward scalable, energy-efficient, and real-time electromagnetic digital twins for future 6G networks. We release the code at \href{https://github.com/Hxxxz0/RadioFlow}{GitHub}.

Authors:Siyuan Huang, Xiaoye Qu, Yafu Li, Yun Luo, Zefeng He, Daizong Liu, Yu Cheng
Title: Spotlight on Token Perception for Multimodal Reinforcement Learning
Abstract:
While Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs), most existing methods in multimodal reasoning neglect the critical role of visual perception within the RLVR optimization process. In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception, which measures the visual dependency of each generated token. With a granular analysis of Chain-of-Thought (CoT) processes, we uncover two key insights: first, token perception in a rollout trajectory is sparsely distributed, where only a small fraction of tokens have high visual dependency for visually-grounded reasoning; second, different trajectories exhibit significant divergence in their overall visual dependency. Based on these observations, we propose Visually-Perceptive Policy Optimization (VPPO), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal. Specifically, VPPO achieves this through a dual mechanism: it reweights a trajectory's advantage by its overall visual dependency, and focuses policy updates exclusively on perceptually pivotal tokens. On a comprehensive suite of eight perception and reasoning benchmarks, VPPO demonstrates substantial gains over leading open-source RL-tuned models, with its effectiveness consistently validated across 7B and 32B model scales. Our findings not only establish a new token-level perceptual perspective for analyzing multimodal RLVR but also present a novel and effective optimization strategy to significantly enhance the multimodal reasoning capabilities of LVLMs.

Authors:Ming Dai, Sen Yang, Boqiang Duan, Wankou Yang, Jingdong Wang
Title: MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding
Abstract:
Referring Video Object Segmentation (RefVOS) seeks to segment target objects in videos guided by natural language descriptions, demanding both temporal reasoning and fine-grained visual comprehension. Existing sampling strategies for LLM-based approaches typically rely on either handcrafted heuristics or external keyframe models. The former often overlooks essential temporal cues, while the latter increases system complexity. To address this, we propose a unified framework that jointly optimizes Temporal Sentence Grounding (TSG) and RefVOS, naturally incorporating key moment grounding capability. During training, we introduce a novel TSG paradigm that employs a dedicated \texttt{[FIND]} token for key moment identification through temporal token similarity matching, thereby avoiding the need for external timestamp encodings. For inference, we design a Moment-Centric Sampling (MCS) strategy that densely samples informative moments while sparsely sampling non-essential frames, preserving both motion details and global context. To further enhance tracking stability, we develop Bidirectional Anchor-updated Propagation (BAP), which leverages the most relevant moment as start point for high-quality mask initialization and dynamically updates at sampled points to mitigate accumulated errors. Code and model will be available at: https://github.com/Dmmm1997/MomentSeg

Authors:Zirun Zhou, Zhengyang Xiao, Haochuan Xu, Jing Sun, Di Wang, Jingfeng Zhang
Title: Goal-oriented Backdoor Attack against Vision-Language-Action Models via Physical Objects
Abstract:
Recent advances in vision-language-action (VLA) models have greatly improved embodied AI, enabling robots to follow natural language instructions and perform diverse tasks. However, their reliance on uncurated training datasets raises serious security concerns. Existing backdoor attacks on VLAs mostly assume white-box access and result in task failures instead of enforcing specific actions. In this work, we reveal a more practical threat: attackers can manipulate VLAs by simply injecting physical objects as triggers into the training dataset. We propose goal-oriented backdoor attacks (GoBA), where the VLA behaves normally in the absence of physical triggers but executes predefined and goal-oriented actions in the presence of physical triggers. Specifically, based on a popular VLA benchmark LIBERO, we introduce BadLIBERO that incorporates diverse physical triggers and goal-oriented backdoor actions. In addition, we propose a three-level evaluation that categorizes the victim VLA's actions under GoBA into three states: nothing to do, try to do, and success to do. Experiments show that GoBA enables the victim VLA to successfully achieve the backdoor goal in 97 percentage of inputs when the physical trigger is present, while causing zero performance degradation on clean inputs. Finally, by investigating factors related to GoBA, we find that the action trajectory and trigger color significantly influence attack performance, while trigger size has surprisingly little effect. The code and BadLIBERO dataset are accessible via the project page at https://goba-attack.github.io/.

Authors:Patrick Wienholt, Sophie Caselitz, Robert Siepmann, Philipp Bruners, Keno Bressem, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn
Title: Hallucination Filtering in Radiology Vision-Language Models Using Discrete Semantic Entropy
Abstract:
To determine whether using discrete semantic entropy (DSE) to reject questions likely to generate hallucinations can improve the accuracy of black-box vision-language models (VLMs) in radiologic image based visual question answering (VQA). This retrospective study evaluated DSE using two publicly available, de-identified datasets: (i) the VQA-Med 2019 benchmark (500 images with clinical questions and short-text answers) and (ii) a diagnostic radiology dataset (206 cases: 60 computed tomography scans, 60 magnetic resonance images, 60 radiographs, 26 angiograms) with corresponding ground-truth diagnoses. GPT-4o and GPT-4.1 answered each question 15 times using a temperature of 1.0. Baseline accuracy was determined using low-temperature answers (temperature 0.1). Meaning-equivalent responses were grouped using bidirectional entailment checks, and DSE was computed from the relative frequencies of the resulting semantic clusters. Accuracy was recalculated after excluding questions with DSE > 0.6 or > 0.3. p-values and 95% confidence intervals were obtained using bootstrap resampling and a Bonferroni-corrected threshold of p < .004 for statistical significance. Across 706 image-question pairs, baseline accuracy was 51.7% for GPT-4o and 54.8% for GPT-4.1. After filtering out high-entropy questions (DSE > 0.3), accuracy on the remaining questions was 76.3% (retained questions: 334/706) for GPT-4o and 63.8% (retained questions: 499/706) for GPT-4.1 (both p < .001). Accuracy gains were observed across both datasets and largely remained statistically significant after Bonferroni correction. DSE enables reliable hallucination detection in black-box VLMs by quantifying semantic inconsistency. This method significantly improves diagnostic answer accuracy and offers a filtering strategy for clinical VLM applications.

Authors:Vijay M. Galshetwar, Praful Hambarde, Prashant W. Patil, Akshay Dudhane, Sachin Chaudhary, Santosh Kumar Vipparathi, Subrahmanyam Murala
Title: Clear Roads, Clear Vision: Advancements in Multi-Weather Restoration for Smart Transportation
Abstract:
Adverse weather conditions such as haze, rain, and snow significantly degrade the quality of images and videos, posing serious challenges to intelligent transportation systems (ITS) that rely on visual input. These degradations affect critical applications including autonomous driving, traffic monitoring, and surveillance. This survey presents a comprehensive review of image and video restoration techniques developed to mitigate weather-induced visual impairments. We categorize existing approaches into traditional prior-based methods and modern data-driven models, including CNNs, transformers, diffusion models, and emerging vision-language models (VLMs). Restoration strategies are further classified based on their scope: single-task models, multi-task/multi-weather systems, and all-in-one frameworks capable of handling diverse degradations. In addition, we discuss day and night time restoration challenges, benchmark datasets, and evaluation protocols. The survey concludes with an in-depth discussion on limitations in current research and outlines future directions such as mixed/compound-degradation restoration, real-time deployment, and agentic AI frameworks. This work aims to serve as a valuable reference for advancing weather-resilient vision systems in smart transportation environments. Lastly, to stay current with rapid advancements in this field, we will maintain regular updates of the latest relevant papers and their open-source implementations at https://github.com/ChaudharyUPES/A-comprehensive-review-on-Multi-weather-restoration

Authors:Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, Alexandre Alahi
Title: Stable Video Infinity: Infinite-Length Video Generation with Error Recycling
Abstract:
We propose Stable Video Infinity (SVI) that is able to generate infinite-length videos with high temporal consistency, plausible scene transitions, and controllable streaming storylines. While existing long-video methods attempt to mitigate accumulated errors via handcrafted anti-drifting (e.g., modified noise scheduler, frame anchoring), they remain limited to single-prompt extrapolation, producing homogeneous scenes with repetitive motions. We identify that the fundamental challenge extends beyond error accumulation to a critical discrepancy between the training assumption (seeing clean data) and the test-time autoregressive reality (conditioning on self-generated, error-prone outputs). To bridge this hypothesis gap, SVI incorporates Error-Recycling Fine-Tuning, a new type of efficient training that recycles the Diffusion Transformer (DiT)'s self-generated errors into supervisory prompts, thereby encouraging DiT to actively identify and correct its own errors. This is achieved by injecting, collecting, and banking errors through closed-loop recycling, autoregressively learning from error-injected feedback. Specifically, we (i) inject historical errors made by DiT to intervene on clean inputs, simulating error-accumulated trajectories in flow matching; (ii) efficiently approximate predictions with one-step bidirectional integration and calculate errors with residuals; (iii) dynamically bank errors into replay memory across discretized timesteps, which are resampled for new input. SVI is able to scale videos from seconds to infinite durations with no additional inference cost, while remaining compatible with diverse conditions (e.g., audio, skeleton, and text streams). We evaluate SVI on three benchmarks, including consistent, creative, and conditional settings, thoroughly verifying its versatility and state-of-the-art role.

Authors:Yue Li, Shida Sun, Yu Hong, Feihu Xu, Zhiwei Xiong
Title: 3D Reconstruction from Transient Measurements with Time-Resolved Transformer
Abstract:
Transient measurements, captured by the timeresolved systems, are widely employed in photon-efficient reconstruction tasks, including line-of-sight (LOS) and non-line-of-sight (NLOS) imaging. However, challenges persist in their 3D reconstruction due to the low quantum efficiency of sensors and the high noise levels, particularly for long-range or complex scenes. To boost the 3D reconstruction performance in photon-efficient imaging, we propose a generic Time-Resolved Transformer (TRT) architecture. Different from existing transformers designed for high-dimensional data, TRT has two elaborate attention designs tailored for the spatio-temporal transient measurements. Specifically, the spatio-temporal self-attention encoders explore both local and global correlations within transient data by splitting or downsampling input features into different scales. Then, the spatio-temporal cross attention decoders integrate the local and global features in the token space, resulting in deep features with high representation capabilities. Building on TRT, we develop two task-specific embodiments: TRT-LOS for LOS imaging and TRT-NLOS for NLOS imaging. Extensive experiments demonstrate that both embodiments significantly outperform existing methods on synthetic data and real-world data captured by different imaging systems. In addition, we contribute a large-scale, high-resolution synthetic LOS dataset with various noise levels and capture a set of real-world NLOS measurements using a custom-built imaging system, enhancing the data diversity in this field. Code and datasets are available at https://github.com/Depth2World/TRT.

Authors:Mukilan Karuppasamy, Shankar Gangisetty, Shyam Nandan Rai, Carlo Masone, C V Jawahar
Title: Towards Safer and Understandable Driver Intention Prediction
Abstract:
Autonomous driving (AD) systems are becoming increasingly capable of handling complex tasks, mainly due to recent advances in deep learning and AI. As interactions between autonomous systems and humans increase, the interpretability of decision-making processes in driving systems becomes increasingly crucial for ensuring safe driving operations. Successful human-machine interaction requires understanding the underlying representations of the environment and the driving task, which remains a significant challenge in deep learning-based systems. To address this, we introduce the task of interpretability in maneuver prediction before they occur for driver safety, i.e., driver intent prediction (DIP), which plays a critical role in AD systems. To foster research in interpretable DIP, we curate the eXplainable Driving Action Anticipation Dataset (DAAD-X), a new multimodal, ego-centric video dataset to provide hierarchical, high-level textual explanations as causal reasoning for the driver's decisions. These explanations are derived from both the driver's eye-gaze and the ego-vehicle's perspective. Next, we propose Video Concept Bottleneck Model (VCBM), a framework that generates spatio-temporally coherent explanations inherently, without relying on post-hoc techniques. Finally, through extensive evaluations of the proposed VCBM on the DAAD-X dataset, we demonstrate that transformer-based models exhibit greater interpretability than conventional CNN-based models. Additionally, we introduce a multilabel t-SNE visualization technique to illustrate the disentanglement and causal correlation among multiple explanations. Our data, code and models are available at: https://mukil07.github.io/VCBM.github.io/

Authors:Weikai Huang, Jieyu Zhang, Taoyang Jia, Chenhao Zheng, Ziqi Gao, Jae Sung Park, Ranjay Krishna
Title: SOS: Synthetic Object Segments Improve Detection, Segmentation, and Grounding
Abstract:
Visual grouping -- operationalized via instance segmentation, visual grounding, and object detection -- underpins applications from robotic perception to photo editing. Large annotated datasets are costly, biased in coverage, and hard to scale. Synthetic data are promising but often lack flexibility, accuracy, and compositional diversity. We present SOS, a simple and scalable data synthesis pipeline based on an object-centric composition strategy. It pastes high-quality synthetic object segments into new images using structured layout priors and generative relighting, producing accurate and diverse masks, boxes, and referring expressions. Models trained on 100000 synthetic images from SOS outperform those trained on larger real-image datasets such as GRIT (20M) and V3Det (200K) on detection and grounding tasks, achieving +10.9 AP on LVIS detection and +8.4 $N_{\text{Acc}}$ on gRefCOCO grounding. SOS enables controllable dataset construction and improves generalization in both low-data and closed-vocabulary settings. Augmenting LVIS and COCO with synthetic object segments yields strong performance across real-data scales and even larger gains under extremely limited real data (for example, +3.83 $AP_{\text{rare}}$ on LVIS instance segmentation and +6.59 AP with a 1 percent COCO setup). This controllability also supports targeted data generation for challenging intra-class referring in visual grounding.

Authors:Akira Takahashi, Shusuke Takahashi, Yuki Mitsufuji
Title: MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation
Abstract:
We introduce MMAudioSep, a generative model for video/text-queried sound separation that is founded on a pretrained video-to-audio model. By leveraging knowledge about the relationship between video/text and audio learned through a pretrained audio generative model, we can train the model more efficiently, i.e., the model does not need to be trained from scratch. We evaluate the performance of MMAudioSep by comparing it to existing separation models, including models based on both deterministic and generative approaches, and find it is superior to the baseline models. Furthermore, we demonstrate that even after acquiring functionality for sound separation via fine-tuning, the model retains the ability for original video-to-audio generation. This highlights the potential of foundational sound generation models to be adopted for sound-related downstream tasks. Our code is available at https://github.com/sony/mmaudiosep.

Authors:Xiaoxiao Ma, Feng Zhao, Pengyang Ling, Haibo Qiu, Zhixiang Wei, Hu Yu, Jie Huang, Zhixiong Zeng, Lin Ma
Title: Towards Better & Faster Autoregressive Image Generation: From the Perspective of Entropy
Abstract:
In this work, we first revisit the sampling issues in current autoregressive (AR) image generation models and identify that image tokens, unlike text tokens, exhibit lower information density and non-uniform spatial distribution. Accordingly, we present an entropy-informed decoding strategy that facilitates higher autoregressive generation quality with faster synthesis speed. Specifically, the proposed method introduces two main innovations: 1) dynamic temperature control guided by spatial entropy of token distributions, enhancing the balance between content diversity, alignment accuracy, and structural coherence in both mask-based and scale-wise models, without extra computational overhead, and 2) entropy-aware acceptance rules in speculative decoding, achieving near-lossless generation at about 85\% of the inference cost of conventional acceleration methods. Extensive experiments across multiple benchmarks using diverse AR image generation models demonstrate the effectiveness and generalizability of our approach in enhancing both generation quality and sampling speed.

Authors:Yiyang Huang, Yizhou Wang, Yun Fu
Title: D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition
Abstract:
Video large language models (Vid-LLMs), which excel in diverse video-language tasks, can be effectively constructed by adapting image-pretrained vision-language models (VLMs). However, this adaptation remains challenging, as it requires processing dense and temporally extended visual inputs that exceed the capacity of image-based models. This paper identifies the perception bottleneck and token overload as key challenges in extending image-based VLMs to the video domain. To address these issues, we propose D-CoDe, a training-free adaptation framework that incorporates dynamic compression and question decomposition. Specifically, dynamic compression alleviates the perception bottleneck through adaptive selection of representative frames and content-aware aggregation of spatial tokens, thereby reducing redundancy while preserving informative content. In parallel, question decomposition mitigates token overload by reformulating the original query into sub-questions, guiding the model to focus on distinct aspects of the video and enabling more comprehensive understanding. Experiments demonstrate that D-CoDe effectively improves video understanding across various benchmarks. Furthermore, strong performance on the challenging long-video benchmark highlights the potential of D-CoDe in handling complex video-language tasks. Code is available at https://github.com/hukcc/D-CoDe.

Authors:Rohan Choudhury, Shanchuan Lin, Jianyi Wang, Hao Chen, Qi Zhao, Feng Cheng, Lu Jiang, Kris Kitani, Laszlo A. Jeni
Title: SkipSR: Faster Super Resolution with Token Skipping
Abstract:
Diffusion-based super-resolution (SR) is a key component in video generation and video restoration, but is slow and expensive, limiting scalability to higher resolutions and longer videos. Our key insight is that many regions in video are inherently low-detail and gain little from refinement, yet current methods process all pixels uniformly. To take advantage of this, we propose SkipSR, a simple framework for accelerating video SR by identifying low-detail regions directly from low-resolution input, then skipping computation on them entirely, only super-resolving the areas that require refinement. This simple yet effective strategy preserves perceptual quality in both standard and one-step diffusion SR models while significantly reducing computation. In standard SR benchmarks, our method achieves up to 60% faster end-to-end latency than prior models on 720p videos with no perceptible loss in quality. Video demos are available at https://rccchoudhury.github.io/skipsr/

Authors:Yifei Dong, Fengyi Wu, Guangyu Chen, Zhi-Qi Cheng, Qiyu Hu, Yuxuan Zhou, Jingdong Sun, Jun-Yan He, Qi Dai, Alexander G Hauptmann
Title: Unified World Models: Memory-Augmented Planning and Foresight for Visual Navigation
Abstract:
Enabling embodied agents to effectively imagine future states is critical for robust and generalizable visual navigation. Current state-of-the-art approaches, however, adopt modular architectures that separate navigation planning from visual world modeling, leading to state-action misalignment and limited adaptability in novel or dynamic scenarios. To overcome this fundamental limitation, we propose UniWM, a unified, memory-augmented world model integrating egocentric visual foresight and planning within a single multimodal autoregressive backbone. Unlike modular frameworks, UniWM explicitly grounds action decisions in visually imagined outcomes, ensuring tight alignment between prediction and control. A hierarchical memory mechanism further integrates detailed short-term perceptual cues with longer-term trajectory context, enabling stable, coherent reasoning over extended horizons. Extensive experiments across four challenging benchmarks (Go Stanford, ReCon, SCAND, HuRoN) demonstrate that UniWM substantially improves navigation success rates by up to 30%, significantly reduces trajectory errors compared to strong baselines, and exhibits impressive zero-shot generalization on the unseen TartanDrive dataset. These results highlight UniWM as a principled step toward unified, imagination-driven embodied navigation.

Authors:Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, Chen Change Loy
Title: Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
Abstract:
Camera-centric understanding and generation are two cornerstones of spatial intelligence, yet they are typically studied in isolation. We present Puffin, a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. Puffin integrates language regression and diffusion-based generation to interpret and create scenes from arbitrary viewpoints. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling thinking with camera. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context. Puffin is trained on Puffin-4M, a large-scale dataset of 4 million vision-language-camera triplets. We incorporate both global camera parameters and pixel-wise camera maps, yielding flexible and reliable spatial generation. Experiments demonstrate Puffin superior performance over specialized models for camera-centric generation and understanding. With instruction tuning, Puffin generalizes to diverse cross-view tasks such as spatial imagination, world exploration, and photography guidance. We will release the code, models, dataset pipeline, and benchmark to advance multimodal spatial intelligence research.

Authors:Songtao Jiang, Yuan Wang, Sibo Song, Tianxiang Hu, Chenyi Zhou, Bin Pu, Yan Zhang, Zhibo Yang, Yang Feng, Joey Tianyi Zhou, Jin Hao, Zijian Chen, Ruijia Wu, Tao Tang, Junhui Lv, Hongxia Xu, Hongwei Wang, Jun Xiao, Bin Feng, Fudong Zhu, Kenli Li, Weidi Xie, Jimeng Sun, Jian Wu, Zuozhu Liu
Title: Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding
Abstract:
Real-world clinical decision-making grapples with integrating information from diverse data modalities, including medical text, 2D/3D images, and video, leading to inefficiencies and potential diagnostic oversights. While generalist vision-language models (VLMs) offer promise, their medical development faces challenges of opaque pipelines, data scarcity, and architectural inflexibility. Here we present Hulu-Med, a transparent medical VLM that unifies understanding across all these modalities. Built upon a unified patch-based vision encoder and an LLM decoder, Hulu-Med was progressively trained on 16.7 million (M) samples to scale from 2D to 3D and video comprehension. The medical-aware token reduction enables efficient training, requiring only 4,000 to 40,000 GPU hours for 7B to 32B parameter variants. Extensive evaluation across 30 benchmarks exhibits state-of-the-art performance, surpassing leading open-source models and competing with proprietary systems in tasks spanning visual question-answering, medical report generation, and complex reasoning in multilingual and rare disease scenarios. By open-sourcing our complete pipeline, we establish that high-performance medical VLM can be achieved transparently, providing a foundational tool for accessible and impactful clinical AI. Code is released on \href{https://github.com/ZJUI-AI4H/Hulu-Med}{https://github.com/ZJUI-AI4H/Hulu-Med}.

Authors:Zhe Dong, Yuzhe Sun, Haochen Jiang, Tianzhu Liu, Yanfeng Gu
Title: PhyDAE: Physics-Guided Degradation-Adaptive Experts for All-in-One Remote Sensing Image Restoration
Abstract:
Remote sensing images inevitably suffer from various degradation factors during acquisition, including atmospheric interference, sensor limitations, and imaging conditions. These complex and heterogeneous degradations pose severe challenges to image quality and downstream interpretation tasks. Addressing limitations of existing all-in-one restoration methods that overly rely on implicit feature representations and lack explicit modeling of degradation physics, this paper proposes Physics-Guided Degradation-Adaptive Experts (PhyDAE). The method employs a two-stage cascaded architecture transforming degradation information from implicit features into explicit decision signals, enabling precise identification and differentiated processing of multiple heterogeneous degradations including haze, noise, blur, and low-light conditions. The model incorporates progressive degradation mining and exploitation mechanisms, where the Residual Manifold Projector (RMP) and Frequency-Aware Degradation Decomposer (FADD) comprehensively analyze degradation characteristics from manifold geometry and frequency perspectives. Physics-aware expert modules and temperature-controlled sparse activation strategies are introduced to enhance computational efficiency while ensuring imaging physics consistency. Extensive experiments on three benchmark datasets (MD-RSID, MD-RRSHID, and MDRS-Landsat) demonstrate that PhyDAE achieves superior performance across all four restoration tasks, comprehensively outperforming state-of-the-art methods. Notably, PhyDAE substantially improves restoration quality while achieving significant reductions in parameter count and computational complexity, resulting in remarkable efficiency gains compared to mainstream approaches and achieving optimal balance between performance and efficiency. Code is available at https://github.com/HIT-SIRS/PhyDAE.

Authors:Saumya B
Title: Reproducible Evaluation of Data Augmentation and Loss Functions for Brain Tumor Segmentation
Abstract:
Brain tumor segmentation is crucial for diagnosis and treatment planning, yet challenges such as class imbalance and limited model generalization continue to hinder progress. This work presents a reproducible evaluation of U-Net segmentation performance on brain tumor MRI using focal loss and basic data augmentation strategies. Experiments were conducted on a publicly available MRI dataset, focusing on focal loss parameter tuning and assessing the impact of three data augmentation techniques: horizontal flip, rotation, and scaling. The U-Net with focal loss achieved a precision of 90%, comparable to state-of-the-art results. By making all code and results publicly available, this study establishes a transparent, reproducible baseline to guide future research on augmentation strategies and loss function design in brain tumor segmentation.

Authors:Haofei Xu, Daniel Barath, Andreas Geiger, Marc Pollefeys
Title: ReSplat: Learning Recurrent Gaussian Splats
Abstract:
While feed-forward Gaussian splatting models provide computational efficiency and effectively handle sparse input settings, their performance is fundamentally limited by the reliance on a single forward pass during inference. We propose ReSplat, a feed-forward recurrent Gaussian splatting model that iteratively refines 3D Gaussians without explicitly computing gradients. Our key insight is that the Gaussian splatting rendering error serves as a rich feedback signal, guiding the recurrent network to learn effective Gaussian updates. This feedback signal naturally adapts to unseen data distributions at test time, enabling robust generalization. To initialize the recurrent process, we introduce a compact reconstruction model that operates in a $16 \times$ subsampled space, producing $16 \times$ fewer Gaussians than previous per-pixel Gaussian models. This substantially reduces computational overhead and allows for efficient Gaussian updates. Extensive experiments across varying of input views (2, 8, 16), resolutions ($256 \times 256$ to $540 \times 960$), and datasets (DL3DV and RealEstate10K) demonstrate that our method achieves state-of-the-art performance while significantly reducing the number of Gaussians and improving the rendering speed. Our project page is at https://haofeixu.github.io/resplat/.
Chinese: ReSplat提出了一种循环高斯泼溅模型,通过渲染误差作为反馈信号迭代优化3D高斯分布,在显著减少高斯点数量和提升渲染速度的同时,于多个数据集和设置下实现了最优性能。
English: ReSplat introduces a recurrent Gaussian splatting model that iteratively refines 3D Gaussians using rendering error as feedback, achieving state-of-the-art performance with fewer Gaussians and faster rendering across diverse datasets and settings.

Authors:Tajamul Ashraf, Umair Nawaz, Abdelrahman M. Shaker, Rao Anwer, Philip Torr, Fahad Shahbaz Khan, Salman Khan
Title: MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning
Abstract:
Vision language models (VLMs) are increasingly deployed as controllers with access to external tools for complex reasoning and decision-making, yet their effectiveness remains limited by the scarcity of high-quality multimodal trajectories and the cost of manual annotation. We address this challenge with a vision-centric agent tuning framework that automatically synthesizes multimodal trajectories, generates step-wise preference pairs, and trains a VLM controller for robust tool-use reasoning. Our pipeline first constructs M-TRACE, a large-scale dataset of 28.5K multimodal tasks with 177K verified trajectories, enabling imitation-based trajectory tuning. Building on this, we develop MATRIX Agent, a controller finetuned on M-TRACE for step-wise tool reasoning. To achieve finer alignment, we further introduce Pref-X, a set of 11K automatically generated preference pairs, and optimize MATRIX on it via step-wise preference learning. Across three benchmarks, Agent-X, GTA, and GAIA, MATRIX consistently surpasses both open- and closed-source VLMs, demonstrating scalable and effective multimodal tool use. Our data and code is avaliable at https://github.com/mbzuai-oryx/MATRIX.
中文: 本文提出一种以视觉为中心的智能体调优框架,通过自动合成多模态轨迹和偏好对来训练视觉语言模型控制器,在多项工具使用推理基准测试中均展现出卓越性能。
English: This paper introduces a vision-centric agent tuning framework that automatically synthesizes multimodal trajectories and preference pairs to train a VLM controller, achieving superior performance on multiple benchmarks for robust tool-use reasoning.

Authors:Meixi Song, Xin Lin, Dizhe Zhang, Haodong Li, Xiangtai Li, Bo Du, Lu Qi
Title: D$^2$GS: Depth-and-Density Guided Gaussian Splatting for Stable and Accurate Sparse-View Reconstruction
Abstract:
Recent advances in 3D Gaussian Splatting (3DGS) enable real-time, high-fidelity novel view synthesis (NVS) with explicit 3D representations. However, performance degradation and instability remain significant under sparse-view conditions. In this work, we identify two key failure modes under sparse-view conditions: overfitting in regions with excessive Gaussian density near the camera, and underfitting in distant areas with insufficient Gaussian coverage. To address these challenges, we propose a unified framework D$^2$GS, comprising two key components: a Depth-and-Density Guided Dropout strategy that suppresses overfitting by adaptively masking redundant Gaussians based on density and depth, and a Distance-Aware Fidelity Enhancement module that improves reconstruction quality in under-fitted far-field areas through targeted supervision. Moreover, we introduce a new evaluation metric to quantify the stability of learned Gaussian distributions, providing insights into the robustness of the sparse-view 3DGS. Extensive experiments on multiple datasets demonstrate that our method significantly improves both visual quality and robustness under sparse view conditions. The project page can be found at: https://insta360-research-team.github.io/DDGS-website/.
中文: 本文提出D²GS统一框架,通过自适应高斯掩蔽和距离感知增强解决稀疏视图3D高斯泼溅中的过拟合与欠拟合问题,显著提升了视觉质量和鲁棒性。
English: This paper introduces D²GS, a unified framework that addresses overfitting and underfitting in sparse-view 3D Gaussian Splatting through adaptive Gaussian masking and distance-aware enhancement, significantly improving visual quality and robustness.

Authors:Changyao Tian, Hao Li, Gen Luo, Xizhou Zhu, Weijie Su, Hanming Deng, Jinguo Zhu, Jie Shao, Ziran Zhu, Yunpeng Liu, Lewei Lu, Wenhai Wang, Hongsheng Li, Jifeng Dai
Title: NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints
Abstract:
Compositional training has been the de-facto paradigm in existing Multimodal Large Language Models (MLLMs), where pre-trained vision encoders are connected with pre-trained LLMs through continuous multimodal pre-training. However, the multimodal scaling property of this paradigm remains difficult to explore due to the separated training. In this paper, we focus on the native training of MLLMs in an end-to-end manner and systematically study its design space and scaling property under a practical setting, i.e., data constraint. Through careful study of various choices in MLLM, we obtain the optimal meta-architecture that best balances performance and training cost. After that, we further explore the scaling properties of the native MLLM and indicate the positively correlated scaling relationship between visual encoders and LLMs. Based on these findings, we propose a native MLLM called NaViL, combined with a simple and cost-effective recipe. Experimental results on 14 multimodal benchmarks confirm the competitive performance of NaViL against existing MLLMs. Besides that, our findings and results provide in-depth insights for the future study of native MLLMs.
Chinese: 本文提出了一种端到端训练的原生多模态大语言模型NaViL,在数据受限条件下优化架构与扩展性,实验表明视觉编码器与大语言模型存在正相关扩展关系,并在14个基准测试中取得优异性能。
English: This paper introduces NaViL, a native Multimodal Large Language Model trained end-to-end, which optimizes architecture and scaling under data constraints to achieve competitive performance across 14 benchmarks while revealing the positive correlation between visual encoders and LLMs.

Authors:Zhen Zhu, Yiming Gong, Yao Xiao, Yaoyao Liu, Derek Hoiem
Title: How to Teach Large Multimodal Models New Skills
Abstract:
How can we teach large multimodal models (LMMs) new skills without erasing prior abilities? We study sequential fine-tuning on five target skills while monitoring general ability on eight held-out benchmarks across three model families. We observe that apparent "forgetting" on held-out tasks after narrow fine-tuning can partly recover at later stages. We trace this behavior to a measurable shift in the output token distribution, manifested through a simple counting-bias probe that co-varies with forgetting. Guided by this picture, we identify two simple, robust tuning recipes that learn strongly while limiting drift: (i) updating only the self-attention projection layers, and (ii) updating only the MLP Gate&Up while freezing the Down projection. Across models and tasks, these choices deliver strong target gains while largely preserving held-out performance. Code is available at https://github.com/jessemelpolio/LMM_CL
中文摘要:本研究提出了两种有效的微调方法,通过选择性更新特定网络组件,使大型多模态模型在掌握新技能的同时,能最大程度保留原有能力。
English Summary: This research introduces two effective fine-tuning methods that enable large multimodal models to acquire new skills while minimizing the loss of existing capabilities, by selectively updating specific network components.

Authors:Maham Tanveer, Yang Zhou, Simon Niklaus, Ali Mahdavi Amiri, Hao Zhang, Krishna Kumar Singh, Nanxuan Zhao
Title: MultiCOIN: Multi-Modal COntrollable Video INbetweening
Abstract:
Video inbetweening creates smooth and natural transitions between two image frames, making it an indispensable tool for video editing and long-form video synthesis. Existing works in this domain are unable to generate large, complex, or intricate motions. In particular, they cannot accommodate the versatility of user intents and generally lack fine control over the details of intermediate frames, leading to misalignment with the creative mind. To fill these gaps, we introduce \modelname{}, a video inbetweening framework that allows multi-modal controls, including depth transition and layering, motion trajectories, text prompts, and target regions for movement localization, while achieving a balance between flexibility, ease of use, and precision for fine-grained video interpolation. To achieve this, we adopt the Diffusion Transformer (DiT) architecture as our video generative model, due to its proven capability to generate high-quality long videos. To ensure compatibility between DiT and our multi-modal controls, we map all motion controls into a common sparse and user-friendly point-based representation as the video/noise input. Further, to respect the variety of controls which operate at varying levels of granularity and influence, we separate content controls and motion controls into two branches to encode the required features before guiding the denoising process, resulting in two generators, one for motion and the other for content. Finally, we propose a stage-wise training strategy to ensure that our model learns the multi-modal controls smoothly. Extensive qualitative and quantitative experiments demonstrate that multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.

Authors:Xueyi Liu, He Wang, Li Yi
Title: DexNDM: Closing the Reality Gap for Dexterous In-Hand Rotation via Joint-Wise Neural Dynamics Model
Abstract:
Achieving generalized in-hand object rotation remains a significant challenge in robotics, largely due to the difficulty of transferring policies from simulation to the real world. The complex, contact-rich dynamics of dexterous manipulation create a "reality gap" that has limited prior work to constrained scenarios involving simple geometries, limited object sizes and aspect ratios, constrained wrist poses, or customized hands. We address this sim-to-real challenge with a novel framework that enables a single policy, trained in simulation, to generalize to a wide variety of objects and conditions in the real world. The core of our method is a joint-wise dynamics model that learns to bridge the reality gap by effectively fitting limited amount of real-world collected data and then adapting the sim policy's actions accordingly. The model is highly data-efficient and generalizable across different whole-hand interaction distributions by factorizing dynamics across joints, compressing system-wide influences into low-dimensional variables, and learning each joint's evolution from its own dynamic profile, implicitly capturing these net effects. We pair this with a fully autonomous data collection strategy that gathers diverse, real-world interaction data with minimal human intervention. Our complete pipeline demonstrates unprecedented generality: a single policy successfully rotates challenging objects with complex shapes (e.g., animals), high aspect ratios (up to 5.33), and small sizes, all while handling diverse wrist orientations and rotation axes. Comprehensive real-world evaluations and a teleoperation application for complex tasks validate the effectiveness and robustness of our approach. Website: https://meowuu7.github.io/DexNDM/
中文摘要:通过基于关节的动态模型和自主数据收集的新型仿真到现实框架,实现了单一策略对复杂形状、高宽比及小尺寸物体的通用化手内旋转。
English Summary: A novel sim-to-real framework using a joint-wise dynamics model and autonomous data collection enables a single policy to achieve generalized in-hand rotation of diverse real-world objects with complex shapes and sizes.

Authors:Minghong Cai, Qiulin Wang, Zongli Ye, Wenze Liu, Quande Liu, Weicai Ye, Xintao Wang, Pengfei Wan, Kun Gai, Xiangyu Yue
Title: VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning
Abstract:
We introduce the task of arbitrary spatio-temporal video completion, where a video is generated from arbitrary, user-specified patches placed at any spatial location and timestamp, akin to painting on a video canvas. This flexible formulation naturally unifies many existing controllable video generation tasks--including first-frame image-to-video, inpainting, extension, and interpolation--under a single, cohesive paradigm. Realizing this vision, however, faces a fundamental obstacle in modern latent video diffusion models: the temporal ambiguity introduced by causal VAEs, where multiple pixel frames are compressed into a single latent representation, making precise frame-level conditioning structurally difficult. We address this challenge with VideoCanvas, a novel framework that adapts the In-Context Conditioning (ICC) paradigm to this fine-grained control task with zero new parameters. We propose a hybrid conditioning strategy that decouples spatial and temporal control: spatial placement is handled via zero-padding, while temporal alignment is achieved through Temporal RoPE Interpolation, which assigns each condition a continuous fractional position within the latent sequence. This resolves the VAE's temporal ambiguity and enables pixel-frame-aware control on a frozen backbone. To evaluate this new capability, we develop VideoCanvasBench, the first benchmark for arbitrary spatio-temporal video completion, covering both intra-scene fidelity and inter-scene creativity. Experiments demonstrate that VideoCanvas significantly outperforms existing conditioning paradigms, establishing a new state of the art in flexible and unified video generation.
中文摘要:本文提出VideoCanvas框架,通过混合条件策略解决潜在视频模型中的时间模糊性问题,实现了任意时空视频补全,将多种视频生成任务统一于单一范式之下。
English Summary: This paper introduces VideoCanvas, a novel framework that enables arbitrary spatio-temporal video completion by resolving temporal ambiguity in latent video models through a hybrid conditioning strategy, unifying various video generation tasks under a single paradigm.

Authors:Yunzhe Xu, Yiyuan Pan, Zhe Liu
Title: Dream to Recall: Imagination-Guided Experience Retrieval for Memory-Persistent Vision-and-Language Navigation
Abstract:
Vision-and-Language Navigation (VLN) requires agents to follow natural language instructions through environments, with memory-persistent variants demanding progressive improvement through accumulated experience. Existing approaches for memory-persistent VLN face critical limitations: they lack effective memory access mechanisms, instead relying on entire memory incorporation or fixed-horizon lookup, and predominantly store only environmental observations while neglecting navigation behavioral patterns that encode valuable decision-making strategies. We present Memoir, which employs imagination as a retrieval mechanism grounded by explicit memory: a world model imagines future navigation states as queries to selectively retrieve relevant environmental observations and behavioral histories. The approach comprises: 1) a language-conditioned world model that imagines future states serving dual purposes: encoding experiences for storage and generating retrieval queries; 2) Hybrid Viewpoint-Level Memory that anchors both observations and behavioral patterns to viewpoints, enabling hybrid retrieval; and 3) an experience-augmented navigation model that integrates retrieved knowledge through specialized encoders. Extensive evaluation across diverse memory-persistent VLN benchmarks with 10 distinctive testing scenarios demonstrates Memoir's effectiveness: significant improvements across all scenarios, with 5.4% SPL gains on IR2R over the best memory-persistent baseline, accompanied by 8.3x training speedup and 74% inference memory reduction. The results validate that predictive retrieval of both environmental and behavioral memories enables more effective navigation, with analysis indicating substantial headroom (73.3% vs 93.4% upper bound) for this imagination-guided paradigm. Code at https://github.com/xyz9911/Memoir.
中文: Memoir系统通过基于想象的记忆检索机制,选择性获取环境观察和行为模式,在多个导航基准测试中实现了性能显著提升、训练加速和推理内存降低。
English: The proposed Memoir system enhances memory-persistent Vision-and-Language Navigation by using a world model to imaginatively retrieve relevant environmental observations and behavioral patterns, achieving significant performance gains, faster training, and reduced inference memory across multiple benchmarks.

Authors:Guanghao Li, Kerui Ren, Linning Xu, Zhewen Zheng, Changjian Jiang, Xin Gao, Bo Dai, Jian Pu, Mulin Yu, Jiangmiao Pang
Title: ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation
Abstract:
On-the-fly 3D reconstruction from monocular image sequences is a long-standing challenge in computer vision, critical for applications such as real-to-sim, AR/VR, and robotics. Existing methods face a major tradeoff: per-scene optimization yields high fidelity but is computationally expensive, whereas feed-forward foundation models enable real-time inference but struggle with accuracy and robustness. In this work, we propose ARTDECO, a unified framework that combines the efficiency of feed-forward models with the reliability of SLAM-based pipelines. ARTDECO uses 3D foundation models for pose estimation and point prediction, coupled with a Gaussian decoder that transforms multi-scale features into structured 3D Gaussians. To sustain both fidelity and efficiency at scale, we design a hierarchical Gaussian representation with a LoD-aware rendering strategy, which improves rendering fidelity while reducing redundancy. Experiments on eight diverse indoor and outdoor benchmarks show that ARTDECO delivers interactive performance comparable to SLAM, robustness similar to feed-forward systems, and reconstruction quality close to per-scene optimization, providing a practical path toward on-the-fly digitization of real-world environments with both accurate geometry and high visual fidelity. Explore more demos on our project page: https://city-super.github.io/artdeco/.
Chinese: ARTDECO是一个统一框架,结合了前馈模型的高效性和SLAM流程的可靠性,实现了交互式性能、鲁棒性及高质量三维重建,可用于实时环境数字化。
English: ARTDECO is a unified framework that combines the efficiency of feed-forward models with the reliability of SLAM pipelines, achieving interactive performance, robustness, and high-quality 3D reconstruction for real-time digitization of environments.

Authors:Rishubh Parihar, Or Patashnik, Daniil Ostashev, R. Venkatesh Babu, Daniel Cohen-Or, Kuan-Chieh Wang
Title: Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing
Abstract:
Instruction-based image editing offers a powerful and intuitive way to manipulate images through natural language. Yet, relying solely on text instructions limits fine-grained control over the extent of edits. We introduce Kontinuous Kontext, an instruction-driven editing model that provides a new dimension of control over edit strength, enabling users to adjust edits gradually from no change to a fully realized result in a smooth and continuous manner. Kontinuous Kontext extends a state-of-the-art image editing model to accept an additional input, a scalar edit strength which is then paired with the edit instruction, enabling explicit control over the extent of the edit. To inject this scalar information, we train a lightweight projector network that maps the input scalar and the edit instruction to coefficients in the model's modulation space. For training our model, we synthesize a diverse dataset of image-edit-instruction-strength quadruplets using existing generative models, followed by a filtering stage to ensure quality and consistency. Kontinuous Kontext provides a unified approach for fine-grained control over edit strength for instruction driven editing from subtle to strong across diverse operations such as stylization, attribute, material, background, and shape changes, without requiring attribute-specific training.
中文摘要:Kontinuous Kontext是一种基于指令的图像编辑模型,通过引入标量编辑强度参数,使用户能够从细微调整到完全改变之间实现平滑连续的编辑控制,无需针对特定属性进行专门训练即可适用于多种编辑操作。
English Summary: Kontinuous Kontext is an instruction-driven image editing model that enables fine-grained control over edit strength through a scalar input, allowing smooth transitions from subtle to full modifications across various editing operations without requiring specialized training.

Authors:Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang
Title: SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models
Abstract:
Spatial reasoning remains a fundamental challenge for Vision-Language Models (VLMs), with current approaches struggling to achieve robust performance despite recent advances. We identify that this limitation stems from a critical gap: existing methods attempt to learn spatial reasoning directly without establishing the hierarchical foundations of perception and understanding. To address this challenge, we present a comprehensive methodology for building spatial intelligence progressively. We introduce SpatialLadder-26k, a multimodal dataset containing 26,610 samples spanning object localization, single image, multi-view, and video spatial reasoning tasks, constructed through a standardized pipeline that ensures systematic coverage across modalities. Building on this dataset, we design a three-stage progressive training framework that (1) establishes spatial perception through object localization, (2) develops spatial understanding through multi-dimensional spatial tasks, and (3) strengthens complex reasoning via reinforcement learning with verifiable rewards. This approach yields SpatialLadder, a 3B-parameter model that achieves state-of-the-art performance on spatial reasoning benchmarks, with 23.4% average improvement over the base model, surpassing GPT-4o by 20.8% and Gemini-2.0-Flash by 10.1%. Notably, SpatialLadder maintains strong generalization with 7.2% improvement on out-of-domain benchmarks, demonstrating that progressive training from perception to reasoning is essential for robust spatial intelligence.
中文摘要:该研究提出SpatialLadder模型,通过基于SpatialLadder-26k数据集的三阶段渐进训练框架,在空间推理任务中实现最先进性能,相比基础模型和GPT-4o等竞争者取得显著提升,并展现出优异的泛化能力。
English Summary: The study introduces SpatialLadder, a 3B-parameter model trained through a progressive three-stage framework on the SpatialLadder-26k dataset, achieving state-of-the-art spatial reasoning performance with significant improvements over base models and competitors like GPT-4o and Gemini-2.0-Flash while demonstrating strong generalization.

Authors:Zhitong Huang, Mohan Zhang, Renhan Wang, Rui Tang, Hao Zhu, Jing Liao
Title: X2Video: Adapting Diffusion Models for Multimodal Controllable Neural Video Rendering
Abstract:
We present X2Video, the first diffusion model for rendering photorealistic videos guided by intrinsic channels including albedo, normal, roughness, metallicity, and irradiance, while supporting intuitive multi-modal controls with reference images and text prompts for both global and local regions. The intrinsic guidance allows accurate manipulation of color, material, geometry, and lighting, while reference images and text prompts provide intuitive adjustments in the absence of intrinsic information. To enable these functionalities, we extend the intrinsic-guided image generation model XRGB to video generation by employing a novel and efficient Hybrid Self-Attention, which ensures temporal consistency across video frames and also enhances fidelity to reference images. We further develop a Masked Cross-Attention to disentangle global and local text prompts, applying them effectively onto respective local and global regions. For generating long videos, our novel Recursive Sampling method incorporates progressive frame sampling, combining keyframe prediction and frame interpolation to maintain long-range temporal consistency while preventing error accumulation. To support the training of X2Video, we assembled a video dataset named InteriorVideo, featuring 1,154 rooms from 295 interior scenes, complete with reliable ground-truth intrinsic channel sequences and smooth camera trajectories. Both qualitative and quantitative evaluations demonstrate that X2Video can produce long, temporally consistent, and photorealistic videos guided by intrinsic conditions. Additionally, X2Video effectively accommodates multi-modal controls with reference images, global and local text prompts, and simultaneously supports editing on color, material, geometry, and lighting through parametric tuning. Project page: https://luckyhzt.github.io/x2video

Authors:Zhiyuan Zhang, Can Wang, Dongdong Chen, Jing Liao
Title: FlexTraj: Image-to-Video Generation with Flexible Point Trajectory Control
Abstract:
We present FlexTraj, a framework for image-to-video generation with flexible point trajectory control. FlexTraj introduces a unified point-based motion representation that encodes each point with a segmentation ID, a temporally consistent trajectory ID, and an optional color channel for appearance cues, enabling both dense and sparse trajectory control. Instead of injecting trajectory conditions into the video generator through token concatenation or ControlNet, FlexTraj employs an efficient sequence-concatenation scheme that achieves faster convergence, stronger controllability, and more efficient inference, while maintaining robustness under unaligned conditions. To train such a unified point trajectory-controlled video generator, FlexTraj adopts an annealing training strategy that gradually reduces reliance on complete supervision and aligned condition. Experimental results demonstrate that FlexTraj enables multi-granularity, alignment-agnostic trajectory control for video generation, supporting various applications such as motion cloning, drag-based image-to-video, motion interpolation, camera redirection, flexible action control and mesh animations.

Authors:Jiayun Luo, Wan-Cyuan Fan, Lyuyang Wang, Xiangteng He, Tanzila Rahman, Purang Abolmaesumi, Leonid Sigal
Title: To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models
Abstract:
Large Vision Language Models (LVLMs) have recently emerged as powerful architectures capable of understanding and reasoning over both visual and textual information. These models typically rely on two key components: a Vision Transformer (ViT) and a Large Language Model (LLM). ViT encodes visual content into a sequence of image tokens and serves as the perceptual front-end -- the eyes of the model. In contrast, the LLM interprets these tokens to perform high-level reasoning, generates responses, and functions as the cognitive core -- the brain of the model. However, it remains unclear which visual tokens contribute most significantly to understanding and reasoning, and how effectively these signals are propagated from ViT to the LLM. While most existing works have focused on identifying attention sinks, low-semantic tokens receiving disproportionately high attention, within the LLM, we shift the focus to the vision encoder by identifying a class of high-norm visual tokens from ViT, referred to as ViT attention sinks -- a problem that has been rarely studied but is indeed very important for LVLMs. Our findings show that these ViT sinks encapsulate high-level semantic concepts from images, allowing the LLM to perform more effective understanding and reasoning. Despite their importance, these sink tokens are often overlooked in existing LVLM architectures. To explore their contribution, we present both qualitative and quantitative analyses of the information embedded in these sink tokens. We also propose both training-free and training-based approaches to better leverage how this information is interpreted by the LLM, and to what extent. By explicitly utilizing these tokens, we demonstrate substantial improvements across a range of LVLMs and visual reasoning tasks, highlighting the untapped potential of ViT attention sinks in enhancing visual reasoning.

Authors:Jhen Hsieh, Kuan-Hsun Tu, Kuo-Han Hung, Tsung-Wei Ke
Title: DexMan: Learning Bimanual Dexterous Manipulation from Human and Generated Videos
Abstract:
We present DexMan, an automated framework that converts human visual demonstrations into bimanual dexterous manipulation skills for humanoid robots in simulation. Operating directly on third-person videos of humans manipulating rigid objects, DexMan eliminates the need for camera calibration, depth sensors, scanned 3D object assets, or ground-truth hand and object motion annotations. Unlike prior approaches that consider only simplified floating hands, it directly controls a humanoid robot and leverages novel contact-based rewards to improve policy learning from noisy hand-object poses estimated from in-the-wild videos. DexMan achieves state-of-the-art performance in object pose estimation on the TACO benchmark, with absolute gains of 0.08 and 0.12 in ADD-S and VSD. Meanwhile, its reinforcement learning policy surpasses previous methods by 19% in success rate on OakInk-v2. Furthermore, DexMan can generate skills from both real and synthetic videos, without the need for manual data collection and costly motion capture, and enabling the creation of large-scale, diverse datasets for training generalist dexterous manipulation.

Authors:Andrew Lee, Ian Chuang, Dechen Gao, Kai Fukazawa, Iman Soltani
Title: Gaze on the Prize: Shaping Visual Attention with Return-Guided Contrastive Learning
Abstract:
Visual Reinforcement Learning (RL) agents must learn to act based on high-dimensional image data where only a small fraction of the pixels is task-relevant. This forces agents to waste exploration and computational resources on irrelevant features, leading to sample-inefficient and unstable learning. To address this, inspired by human visual foveation, we introduce Gaze on the Prize. This framework augments visual RL with a learnable foveal attention mechanism (Gaze), guided by a self-supervised signal derived from the agent's experience pursuing higher returns (the Prize). Our key insight is that return differences reveal what matters most: If two similar representations produce different outcomes, their distinguishing features are likely task-relevant, and the gaze should focus on them accordingly. This is realized through return-guided contrastive learning that trains the attention to distinguish between the features relevant to success and failure. We group similar visual representations into positives and negatives based on their return differences and use the resulting labels to construct contrastive triplets. These triplets provide the training signal that teaches the attention mechanism to produce distinguishable representations for states associated with different outcomes. Our method achieves up to 2.4x improvement in sample efficiency and can solve tasks that the baseline fails to learn, demonstrated across a suite of manipulation tasks from the ManiSkill3 benchmark, all without modifying the underlying algorithm or hyperparameters.

Authors:Yihong Luo, Tianyang Hu, Jing Tang
Title: Reinforcing Diffusion Models by Direct Group Preference Optimization
Abstract:
While reinforcement learning methods such as Group Relative Preference Optimization (GRPO) have significantly enhanced Large Language Models, adapting them to diffusion models remains challenging. In particular, GRPO demands a stochastic policy, yet the most cost-effective diffusion samplers are based on deterministic ODEs. Recent work addresses this issue by using inefficient SDE-based samplers to induce stochasticity, but this reliance on model-agnostic Gaussian noise leads to slow convergence. To resolve this conflict, we propose Direct Group Preference Optimization (DGPO), a new online RL algorithm that dispenses with the policy-gradient framework entirely. DGPO learns directly from group-level preferences, which utilize relative information of samples within groups. This design eliminates the need for inefficient stochastic policies, unlocking the use of efficient deterministic ODE samplers and faster training. Extensive results show that DGPO trains around 20 times faster than existing state-of-the-art methods and achieves superior performance on both in-domain and out-of-domain reward metrics. Code is available at https://github.com/Luo-Yihong/DGPO.
中文摘要:DGPO是一种新型在线强化学习算法,通过直接从群体级偏好中学习,无需随机策略即可高效训练扩散模型,支持使用快速确定性采样器,实现20倍加速训练和更优性能。
English Summary: DGPO is a novel online reinforcement learning algorithm that enables efficient training of diffusion models by learning directly from group-level preferences, eliminating the need for stochastic policies and allowing the use of fast deterministic samplers, resulting in 20x faster training and superior performance.

Authors:Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhu Chen
Title: UniVideo: Unified Understanding, Generation, and Editing for Videos
Abstract:
Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.
中文: UniVideo是一个统一的多模态框架,通过结合多模态大语言模型进行指令理解和多模态DiT进行视频生成的双流设计,将统一建模扩展到视频领域,支持多种视频生成和编辑任务,实现了最先进的性能,并具备任务组合和对未见指令的泛化能力。
English: UniVideo is a unified multimodal framework that extends unified modeling to the video domain, enabling diverse video generation and editing tasks through a dual-stream design combining a Multimodal Large Language Model for instruction understanding and a Multimodal DiT for video generation, achieving state-of-the-art performance and supporting task composition and generalization to unseen instructions.

Authors:Haipeng Liu, Yang Wang, Meng Wang
Title: One Stone with Two Birds: A Null-Text-Null Frequency-Aware Diffusion Models for Text-Guided Image Inpainting
Abstract:
Text-guided image inpainting aims at reconstructing the masked regions as per text prompts, where the longstanding challenges lie in the preservation for unmasked regions, while achieving the semantics consistency between unmasked and inpainted masked regions. Previous arts failed to address both of them, always with either of them to be remedied. Such facts, as we observed, stem from the entanglement of the hybrid (e.g., mid-and-low) frequency bands that encode varied image properties, which exhibit different robustness to text prompts during the denoising process. In this paper, we propose a null-text-null frequency-aware diffusion models, dubbed \textbf{NTN-Diff}, for text-guided image inpainting, by decomposing the semantics consistency across masked and unmasked regions into the consistencies as per each frequency band, while preserving the unmasked regions, to circumvent two challenges in a row. Based on the diffusion process, we further divide the denoising process into early (high-level noise) and late (low-level noise) stages, where the mid-and-low frequency bands are disentangled during the denoising process. As observed, the stable mid-frequency band is progressively denoised to be semantically aligned during text-guided denoising process, which, meanwhile, serves as the guidance to the null-text denoising process to denoise low-frequency band for the masked regions, followed by a subsequent text-guided denoising process at late stage, to achieve the semantics consistency for mid-and-low frequency bands across masked and unmasked regions, while preserve the unmasked regions. Extensive experiments validate the superiority of NTN-Diff over the state-of-the-art diffusion models to text-guided diffusion models. Our code can be accessed from https://github.com/htyjers/NTN-Diff.
中文: 本文提出NTN-Diff这一频率感知扩散模型,通过在去噪过程中解耦中低频带,实现掩码与非掩码区域的语义一致性,同时保持未掩码区域的原始内容。
English: This paper introduces NTN-Diff, a frequency-aware diffusion model that addresses text-guided image inpainting by disentangling mid-and-low frequency bands during denoising to achieve semantic consistency between masked and unmasked regions while preserving original content.

Authors:Kesen Zhao, Jiaxin Shi, Beier Zhu, Junbao Zhou, Xiaolong Shen, Yuan Zhou, Qianru Sun, Hanwang Zhang
Title: Real-Time Motion-Controllable Autoregressive Video Diffusion
Abstract:
Real-time motion-controllable video generation remains challenging due to the inherent latency of bidirectional diffusion models and the lack of effective autoregressive (AR) approaches. Existing AR video diffusion models are limited to simple control signals or text-to-video generation, and often suffer from quality degradation and motion artifacts in few-step generation. To address these challenges, we propose AR-Drag, the first RL-enhanced few-step AR video diffusion model for real-time image-to-video generation with diverse motion control. We first fine-tune a base I2V model to support basic motion control, then further improve it via reinforcement learning with a trajectory-based reward model. Our design preserves the Markov property through a Self-Rollout mechanism and accelerates training by selectively introducing stochasticity in denoising steps. Extensive experiments demonstrate that AR-Drag achieves high visual fidelity and precise motion alignment, significantly reducing latency compared with state-of-the-art motion-controllable VDMs, while using only 1.3B parameters. Additional visualizations can be found on our project page: https://kesenzhao.github.io/AR-Drag.github.io/.

Authors:Shuhai Zhang, ZiHao Lian, Jiahao Yang, Daiyuan Li, Guoxuan Pang, Feng Liu, Bo Han, Shutao Li, Mingkui Tan
Title: Physics-Driven Spatiotemporal Modeling for AI-Generated Video Detection
Abstract:
AI-generated videos have achieved near-perfect visual realism (e.g., Sora), urgently necessitating reliable detection mechanisms. However, detecting such videos faces significant challenges in modeling high-dimensional spatiotemporal dynamics and identifying subtle anomalies that violate physical laws. In this paper, we propose a physics-driven AI-generated video detection paradigm based on probability flow conservation principles. Specifically, we propose a statistic called Normalized Spatiotemporal Gradient (NSG), which quantifies the ratio of spatial probability gradients to temporal density changes, explicitly capturing deviations from natural video dynamics. Leveraging pre-trained diffusion models, we develop an NSG estimator through spatial gradients approximation and motion-aware temporal modeling without complex motion decomposition while preserving physical constraints. Building on this, we propose an NSG-based video detection method (NSG-VD) that computes the Maximum Mean Discrepancy (MMD) between NSG features of the test and real videos as a detection metric. Last, we derive an upper bound of NSG feature distances between real and generated videos, proving that generated videos exhibit amplified discrepancies due to distributional shifts. Extensive experiments confirm that NSG-VD outperforms state-of-the-art baselines by 16.00% in Recall and 10.75% in F1-Score, validating the superior performance of NSG-VD. The source code is available at https://github.com/ZSHsh98/NSG-VD.
中文: 本文提出一种基于物理原理的AI生成视频检测方法,通过标准化时空梯度(NSG)识别物理规律异常,相比现有技术实现了显著性能提升。
English: This paper introduces a physics-based detection method for AI-generated videos using Normalized Spatiotemporal Gradient (NSG) to identify physical inconsistencies, achieving significant performance improvements over existing techniques.

Authors:Junyu Shi, Minghui Li, Junguo Zuo, Zhifei Yu, Yipeng Lin, Shengshan Hu, Ziqi Zhou, Yechao Zhang, Wei Wan, Yinzhe Xu, Leo Yu Zhang
Title: Towards Real-World Deepfake Detection: A Diverse In-the-wild Dataset of Forgery Faces
Abstract:
Deepfakes, leveraging advanced AIGC (Artificial Intelligence-Generated Content) techniques, create hyper-realistic synthetic images and videos of human faces, posing a significant threat to the authenticity of social media. While this real-world threat is increasingly prevalent, existing academic evaluations and benchmarks for detecting deepfake forgery often fall short to achieve effective application for their lack of specificity, limited deepfake diversity, restricted manipulation techniques.To address these limitations, we introduce RedFace (Real-world-oriented Deepfake Face), a specialized facial deepfake dataset, comprising over 60,000 forged images and 1,000 manipulated videos derived from authentic facial features, to bridge the gap between academic evaluations and real-world necessity. Unlike prior benchmarks, which typically rely on academic methods to generate deepfakes, RedFace utilizes 9 commercial online platforms to integrate the latest deepfake technologies found "in the wild", effectively simulating real-world black-box scenarios.Moreover, RedFace's deepfakes are synthesized using bespoke algorithms, allowing it to capture diverse and evolving methods used by real-world deepfake creators. Extensive experimental results on RedFace (including cross-domain, intra-domain, and real-world social network dissemination simulations) verify the limited practicality of existing deepfake detection schemes against real-world applications. We further perform a detailed analysis of the RedFace dataset, elucidating the reason of its impact on detection performance compared to conventional datasets. Our dataset is available at: https://github.com/kikyou-220/RedFace.
中文摘要:利用AIGC技术生成的深度伪造内容严重威胁社交媒体真实性,而RedFace数据集通过商业平台生成的6万余张图像和1000段视频弥补现有检测方法的不足,能更有效地模拟现实场景。
English Summary: Deepfakes created with AIGC pose serious threats to social media authenticity, and the RedFace dataset addresses current detection limitations by providing 60,000+ images and 1,000 videos generated through commercial platforms to better simulate real-world scenarios.

Authors:Bheeshm Sharma, Karthikeyan Jaganathan, Balamurugan Palaniappan
Title: RASALoRE: Region Aware Spatial Attention with Location-based Random Embeddings for Weakly Supervised Anomaly Detection in Brain MRI Scans
Abstract:
Weakly Supervised Anomaly detection (WSAD) in brain MRI scans is an important challenge useful to obtain quick and accurate detection of brain anomalies when precise pixel-level anomaly annotations are unavailable and only weak labels (e.g., slice-level) are available. In this work, we propose RASALoRE: Region Aware Spatial Attention with Location-based Random Embeddings, a novel two-stage WSAD framework. In the first stage, we introduce a Discriminative Dual Prompt Tuning (DDPT) mechanism that generates high-quality pseudo weak masks based on slice-level labels, serving as coarse localization cues. In the second stage, we propose a segmentation network with a region-aware spatial attention mechanism that relies on fixed location-based random embeddings. This design enables the model to effectively focus on anomalous regions. Our approach achieves state-of-the-art anomaly detection performance, significantly outperforming existing WSAD methods while utilizing less than 8 million parameters. Extensive evaluations on the BraTS20, BraTS21, BraTS23, and MSD datasets demonstrate a substantial performance improvement coupled with a significant reduction in computational complexity. Code is available at: https://github.com/BheeshmSharma/RASALoRE-BMVC-2025/.
中文:RASALoRE是一种新颖的两阶段弱监督脑部MRI异常检测框架,通过判别式双提示调优生成伪掩膜后,采用基于位置的随机嵌入与区域感知空间注意力机制,以极少参数量实现了最先进的检测性能。
English: RASALoRE is a novel two-stage weakly supervised anomaly detection framework for brain MRI that first generates pseudo masks using discriminative dual prompt tuning and then employs region-aware spatial attention with location-based random embeddings to achieve state-of-the-art performance with minimal parameters.

Authors:Shaohong Wang, Bin Lu, Xinyu Xiao, Hanzhi Zhong, Bowen Pang, Tong Wang, Zhiyu Xiang, Hangguan Shan, Eryun Liu
Title: RayFusion: Ray Fusion Enhanced Collaborative Visual Perception
Abstract:
Collaborative visual perception methods have gained widespread attention in the autonomous driving community in recent years due to their ability to address sensor limitation problems. However, the absence of explicit depth information often makes it difficult for camera-based perception systems, e.g., 3D object detection, to generate accurate predictions. To alleviate the ambiguity in depth estimation, we propose RayFusion, a ray-based fusion method for collaborative visual perception. Using ray occupancy information from collaborators, RayFusion reduces redundancy and false positive predictions along camera rays, enhancing the detection performance of purely camera-based collaborative perception systems. Comprehensive experiments show that our method consistently outperforms existing state-of-the-art models, substantially advancing the performance of collaborative visual perception. The code is available at https://github.com/wangsh0111/RayFusion.
Chinese: RayFusion是一种基于射线的融合方法,通过利用协作车辆的射线占据信息来减少深度估计的不确定性,从而提升纯摄像头协作感知系统的三维物体检测性能。
English: RayFusion is a ray-based fusion method that enhances camera-only collaborative perception in autonomous driving by using collaborators' ray occupancy to reduce depth ambiguity and improve 3D object detection accuracy.

Authors:Gaurvi Goyal, Pham Cong Thuong, Arren Glover, Masayoshi Mizuno, Chiara Bartolozzi
Title: GraphEnet: Event-driven Human Pose Estimation with a Graph Neural Network
Abstract:
Human Pose Estimation is a crucial module in human-machine interaction applications and, especially since the rise in deep learning technology, robust methods are available to consumers using RGB cameras and commercial GPUs. On the other hand, event-based cameras have gained popularity in the vision research community for their low latency and low energy advantages that make them ideal for applications where those resources are constrained like portable electronics and mobile robots. In this work we propose a Graph Neural Network, GraphEnet, that leverages the sparse nature of event camera output, with an intermediate line based event representation, to estimate 2D Human Pose of a single person at a high frequency. The architecture incorporates a novel offset vector learning paradigm with confidence based pooling to estimate the human pose. This is the first work that applies Graph Neural Networks to event data for Human Pose Estimation. The code is open-source at https://github.com/event-driven-robotics/GraphEnet-NeVi-ICCV2025.
中文摘要:本文提出GraphEnet,这是首个将图神经网络应用于事件相机数据进行人体姿态估计的方法,采用创新的偏移向量学习和置信度池化技术,实现高频二维姿态追踪。
English Summary: This paper introduces GraphEnet, the first Graph Neural Network for human pose estimation using event camera data, featuring a novel offset vector learning method with confidence-based pooling to achieve high-frequency 2D pose tracking.

Authors:Yufei Tong, Guanjie Cheng, Peihan Wu, Yicheng Zhu, Kexu Lu, Feiyi Chen, Meng Xi, Junqin Huang, Shuiguang Deng
Title: SatFusion: A Unified Framework for Enhancing Satellite IoT Images via Multi-Temporal and Multi-Source Data Fusion
Abstract:
With the rapid advancement of the digital society, the proliferation of satellites in the Satellite Internet of Things (Sat-IoT) has led to the continuous accumulation of large-scale multi-temporal and multi-source images across diverse application scenarios. However, existing methods fail to fully exploit the complementary information embedded in both temporal and source dimensions. For example, Multi-Image Super-Resolution (MISR) enhances reconstruction quality by leveraging temporal complementarity across multiple observations, yet the limited fine-grained texture details in input images constrain its performance. Conversely, pansharpening integrates multi-source images by injecting high-frequency spatial information from panchromatic data, but typically relies on pre-interpolated low-resolution inputs and assumes noise-free alignment, making it highly sensitive to noise and misregistration. To address these issues, we propose SatFusion: A Unified Framework for Enhancing Satellite IoT Images via Multi-Temporal and Multi-Source Data Fusion. Specifically, SatFusion first employs a Multi-Temporal Image Fusion (MTIF) module to achieve deep feature alignment with the panchromatic image. Then, a Multi-Source Image Fusion (MSIF) module injects fine-grained texture information from the panchromatic data. Finally, a Fusion Composition module adaptively integrates the complementary advantages of both modalities while dynamically refining spectral consistency, supervised by a weighted combination of multiple loss functions. Extensive experiments on the WorldStrat, WV3, QB, and GF2 datasets demonstrate that SatFusion significantly improves fusion quality, robustness under challenging conditions, and generalizability to real-world Sat-IoT scenarios. The code is available at: https://github.com/dllgyufei/SatFusion.git.
中文: SatFusion是一个统一框架,通过多时相和多源数据的深度融合与自适应组合,有效提升卫星物联网图像的质量和鲁棒性,在多种数据集上表现卓越。
English: SatFusion is a unified framework that enhances satellite IoT images by integrating multi-temporal and multi-source data through deep feature alignment and adaptive fusion, significantly improving quality and robustness across diverse datasets.

Authors:Haochen Yu, Qiankun Liu, Hongyuan Liu, Jianfei Jiang, Juntao Lyu, Jiansheng Chen, Huimin Ma
Title: XYZCylinder: Feedforward Reconstruction for Driving Scenes Based on A Unified Cylinder Lifting Method
Abstract:
Recently, more attention has been paid to feedforward reconstruction paradigms, which mainly learn a fixed view transformation implicitly and reconstruct the scene with a single representation. However, their generalization capability and reconstruction accuracy are still limited while reconstructing driving scenes, which results from two aspects: (1) The fixed view transformation fails when the camera configuration changes, limiting the generalization capability across different driving scenes equipped with different camera configurations. (2) The small overlapping regions between sparse views of the $360^\circ$ panorama and the complexity of driving scenes increase the learning difficulty, reducing the reconstruction accuracy. To handle these difficulties, we propose \textbf{XYZCylinder}, a feedforward model based on a unified cylinder lifting method which involves camera modeling and feature lifting. Specifically, to improve the generalization capability, we design a Unified Cylinder Camera Modeling (UCCM) strategy, which avoids the learning of viewpoint-dependent spatial correspondence and unifies different camera configurations with adjustable parameters. To improve the reconstruction accuracy, we propose a hybrid representation with several dedicated modules based on newly designed Cylinder Plane Feature Group (CPFG) to lift 2D image features to 3D space. Experimental results show that XYZCylinder achieves state-of-the-art performance under different evaluation settings, and can be generalized to other driving scenes in a zero-shot manner. Project page: \href{https://yuyuyu223.github.io/XYZCYlinder-projectpage/}{here}.
中文: 提出的XYZCylinder模型通过可调节相机参数的统一柱面建模和混合表征方法,解决了前馈重建在驾驶场景中的泛化与精度问题,实现了最优性能并具备零样本跨场景适应能力。
English: The proposed XYZCylinder model addresses limitations in feedforward reconstruction by introducing a unified cylinder lifting method with adjustable camera parameters and hybrid representations, achieving state-of-the-art performance and zero-shot generalization across driving scenes.

Authors:Yijie Gao, Houqiang Zhong, Tianchi Zhu, Zhengxue Cheng, Qiang Hu, Li Song
Title: AlignGS: Aligning Geometry and Semantics for Robust Indoor Reconstruction from Sparse Views
Abstract:
The demand for semantically rich 3D models of indoor scenes is rapidly growing, driven by applications in augmented reality, virtual reality, and robotics. However, creating them from sparse views remains a challenge due to geometric ambiguity. Existing methods often treat semantics as a passive feature painted on an already-formed, and potentially flawed, geometry. We posit that for robust sparse-view reconstruction, semantic understanding instead be an active, guiding force. This paper introduces AlignGS, a novel framework that actualizes this vision by pioneering a synergistic, end-to-end optimization of geometry and semantics. Our method distills rich priors from 2D foundation models and uses them to directly regularize the 3D representation through a set of novel semantic-to-geometry guidance mechanisms, including depth consistency and multi-faceted normal regularization. Extensive evaluations on standard benchmarks demonstrate that our approach achieves state-of-the-art results in novel view synthesis and produces reconstructions with superior geometric accuracy. The results validate that leveraging semantic priors as a geometric regularizer leads to more coherent and complete 3D models from limited input views. Our code is avaliable at https://github.com/MediaX-SJTU/AlignGS .
中文:AlignGS提出了一种端到端框架,通过主动利用语义理解来指导稀疏视图下的三维几何重建,并以语义先验作为几何正则化器实现了最先进的性能。
English: AlignGS introduces an end-to-end framework that actively integrates semantic understanding to guide 3D geometry reconstruction from sparse views, achieving state-of-the-art results by using semantic priors as geometric regularizers.

Authors:Harsh Kavediya, Vighnesh Nayak, Bheeshm Sharma, Balamurugan Palaniappan
Title: IsoSignVid2Aud: Sign Language Video to Audio Conversion without Text Intermediaries
Abstract:
Sign language to spoken language audio translation is important to connect the hearing- and speech-challenged humans with others. We consider sign language videos with isolated sign sequences rather than continuous grammatical signing. Such videos are useful in educational applications and sign prompt interfaces. Towards this, we propose IsoSignVid2Aud, a novel end-to-end framework that translates sign language videos with a sequence of possibly non-grammatic continuous signs to speech without requiring intermediate text representation, providing immediate communication benefits while avoiding the latency and cascading errors inherent in multi-stage translation systems. Our approach combines an I3D-based feature extraction module with a specialized feature transformation network and an audio generation pipeline, utilizing a novel Non-Maximal Suppression (NMS) algorithm for the temporal detection of signs in non-grammatic continuous sequences. Experimental results demonstrate competitive performance on ASL-Citizen-1500 and WLASL-100 datasets with Top-1 accuracies of 72.01\% and 78.67\%, respectively, and audio quality metrics (PESQ: 2.67, STOI: 0.73) indicating intelligible speech output. Code is available at: https://github.com/BheeshmSharma/IsoSignVid2Aud_AIMLsystems-2025.
中文: 本研究提出IsoSignVid2Aud端到端框架,能够将孤立手语视频序列直接转换为清晰语音而无需文本中间转换,在基准数据集上取得优异识别准确率的同时生成可理解语音输出。
English: This study introduces IsoSignVid2Aud, an end-to-end framework that directly translates isolated sign language video sequences into intelligible speech without intermediate text conversion, achieving competitive accuracy on benchmark datasets while producing clear audio output.

Authors:Kanglin Ning, Ruzhao Chen, Penghong Wang, Xingtao Wang, Ruiqin Xiong, Xiaopeng Fan
Title: An End-to-End Room Geometry Constrained Depth Estimation Framework for Indoor Panorama Images
Abstract:
Predicting spherical pixel depth from monocular $360^{\circ}$ indoor panoramas is critical for many vision applications. However, existing methods focus on pixel-level accuracy, causing oversmoothed room corners and noise sensitivity. In this paper, we propose a depth estimation framework based on room geometry constraints, which extracts room geometry information through layout prediction and integrates those information into the depth estimation process through background segmentation mechanism. At the model level, our framework comprises a shared feature encoder followed by task-specific decoders for layout estimation, depth estimation, and background segmentation. The shared encoder extracts multi-scale features, which are subsequently processed by individual decoders to generate initial predictions: a depth map, a room layout map, and a background segmentation map. Furthermore, our framework incorporates two strategies: a room geometry-based background depth resolving strategy and a background-segmentation-guided fusion mechanism. The proposed room-geometry-based background depth resolving strategy leverages the room layout and the depth decoder's output to generate the corresponding background depth map. Then, a background-segmentation-guided fusion strategy derives fusion weights for the background and coarse depth maps from the segmentation decoder's predictions. Extensive experimental results on the Stanford2D3D, Matterport3D and Structured3D datasets show that our proposed methods can achieve significantly superior performance than current open-source methods. Our code is available at https://github.com/emiyaning/RGCNet.
中文: 本文提出了一种基于房间几何约束的深度估计框架,通过布局预测和背景分割机制整合几何信息,显著提升了全景室内深度估计的性能。
English: This paper introduces a depth estimation framework for 360° indoor panoramas that integrates room geometry constraints through layout prediction and background segmentation to enhance depth accuracy and structural preservation.

Authors:Qinghongbing Xie, Zhaoyuan Xia, Feng Zhu, Lijun Gong, Ziyue Li, Rui Zhao, Long Zeng
Title: GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models
Abstract:
Recently spatial-temporal intelligence of Visual-Language Models (VLMs) has attracted much attention due to its importance for Autonomous Driving, Embodied AI and General Artificial Intelligence. Existing spatial-temporal benchmarks mainly focus on egocentric perspective reasoning with images/video context, or geographic perspective reasoning with graphics context (eg. a map), thus fail to assess VLMs' geographic spatial-temporal intelligence with both images/video and graphics context, which is important for areas like traffic management and emergency response. To address the gaps, we introduce Geo-Temporal Reasoning benchmark (GTR-Bench), a novel challenge for geographic temporal reasoning of moving targets in a large-scale camera network. GTR-Bench is more challenging as it requires multiple perspective switches between maps and videos, joint reasoning across multiple videos with non-overlapping fields of view, and inference over spatial-temporal regions that are unobserved by any video context. Evaluations of more than 10 popular VLMs on GTR-Bench demonstrate that even the best proprietary model, Gemini-2.5-Pro (34.9%), significantly lags behind human performance (78.61%) on geo-temporal reasoning. Moreover, our comprehensive analysis on GTR-Bench reveals three primary deficiencies of current models for geo-temporal reasoning. (1) VLMs' reasoning is impaired by an imbalanced utilization of spatial-temporal context. (2) VLMs are weak in temporal forecasting, which leads to worse performance on temporal-emphasized tasks than on spatial-emphasized tasks. (3) VLMs lack the proficiency to comprehend or align the map data with multi-view video inputs. We believe GTR-Bench offers valuable insights and opens up new opportunities for research and applications in spatial-temporal intelligence. Benchmark and code will be released at https://github.com/X-Luffy/GTR-Bench.
中文: 该摘要介绍了GTR-Bench这一新基准,用于评估视觉语言模型在地图和视频中推理移动目标的能力,揭示了当前模型与人类表现之间的显著差距,并指出了其在时空推理方面的三个主要缺陷。
English: This abstract introduces GTR-Bench, a new benchmark for evaluating visual-language models' ability to reason about moving objects across maps and videos, revealing significant performance gaps compared to humans and identifying key deficiencies in current models' spatial-temporal reasoning capabilities.

Authors:Ming Jie Ong, Sze Yinn Ung, Sim Kuan Goh, Jimmy Y. Zhong
Title: Demystifying Deep Learning-based Brain Tumor Segmentation with 3D UNets and Explainable AI (XAI): A Comparative Analysis
Abstract:
The current study investigated the use of Explainable Artificial Intelligence (XAI) to improve the accuracy of brain tumor segmentation in MRI images, with the goal of assisting physicians in clinical decision-making. The study focused on applying UNet models for brain tumor segmentation and using the XAI techniques of Gradient-weighted Class Activation Mapping (Grad-CAM) and attention-based visualization to enhance the understanding of these models. Three deep learning models - UNet, Residual UNet (ResUNet), and Attention UNet (AttUNet) - were evaluated to identify the best-performing model. XAI was employed with the aims of clarifying model decisions and increasing physicians' trust in these models. We compared the performance of two UNet variants (ResUNet and AttUNet) with the conventional UNet in segmenting brain tumors from the BraTS2020 public dataset and analyzed model predictions with Grad-CAM and attention-based visualization. Using the latest computer hardware, we trained and validated each model using the Adam optimizer and assessed their performance with respect to: (i) training, validation, and inference times, (ii) segmentation similarity coefficients and loss functions, and (iii) classification performance. Notably, during the final testing phase, ResUNet outperformed the other models with respect to Dice and Jaccard similarity scores, as well as accuracy, recall, and F1 scores. Grad-CAM provided visuospatial insights into the tumor subregions each UNet model focused on while attention-based visualization provided valuable insights into the working mechanisms of AttUNet's attention modules. These results demonstrated ResUNet as the best-performing model and we conclude by recommending its use for automated brain tumor segmentation in future clinical assessments. Our source code and checkpoint are available at https://github.com/ethanong98/MultiModel-XAI-Brats2020
本研究证明,结合可解释人工智能技术(如Grad-CAM和注意力可视化)的ResUNet模型在脑瘤MRI图像分割中表现最优,可有效提升临床诊疗辅助能力。
This study demonstrates that ResUNet, enhanced with explainable AI techniques like Grad-CAM and attention visualization, achieves superior performance in brain tumor segmentation from MRI images, thereby improving clinical decision-making support for physicians.

Authors:Yuang Meng, Xin Jin, Lina Lei, Chun-Le Guo, Chongyi Li
Title: UltraLED: Learning to See Everything in Ultra-High Dynamic Range Scenes
Abstract:
Ultra-high dynamic range (UHDR) scenes exhibit significant exposure disparities between bright and dark regions. Such conditions are commonly encountered in nighttime scenes with light sources. Even with standard exposure settings, a bimodal intensity distribution with boundary peaks often emerges, making it difficult to preserve both highlight and shadow details simultaneously. RGB-based bracketing methods can capture details at both ends using short-long exposure pairs, but are susceptible to misalignment and ghosting artifacts. We found that a short-exposure image already retains sufficient highlight detail. The main challenge of UHDR reconstruction lies in denoising and recovering information in dark regions. In comparison to the RGB images, RAW images, thanks to their higher bit depth and more predictable noise characteristics, offer greater potential for addressing this challenge. This raises a key question: can we learn to see everything in UHDR scenes using only a single short-exposure RAW image? In this study, we rely solely on a single short-exposure frame, which inherently avoids ghosting and motion blur, making it particularly robust in dynamic scenes. To achieve that, we introduce UltraLED, a two-stage framework that performs exposure correction via a ratio map to balance dynamic range, followed by a brightness-aware RAW denoiser to enhance detail recovery in dark regions. To support this setting, we design a 9-stop bracketing pipeline to synthesize realistic UHDR images and contribute a corresponding dataset based on diverse scenes, using only the shortest exposure as input for reconstruction. Extensive experiments show that UltraLED significantly outperforms existing single-frame approaches. Our code and dataset are made publicly available at https://srameo.github.io/projects/ultraled.

Authors:Jian Gao, Mengqi Yuan, Yifei Zeng, Chang Zeng, Zhihao Li, Zhenyu Chen, Weichao Qiu, Xiao-Xiao Long, Hao Zhu, Xun Cao, Yao Yao
Title: ComGS: Efficient 3D Object-Scene Composition via Surface Octahedral Probes
Abstract:
Gaussian Splatting (GS) enables immersive rendering, but realistic 3D object-scene composition remains challenging. Baked appearance and shadow information in GS radiance fields cause inconsistencies when combining objects and scenes. Addressing this requires relightable object reconstruction and scene lighting estimation. For relightable object reconstruction, existing Gaussian-based inverse rendering methods often rely on ray tracing, leading to low efficiency. We introduce Surface Octahedral Probes (SOPs), which store lighting and occlusion information and allow efficient 3D querying via interpolation, avoiding expensive ray tracing. SOPs provide at least a 2x speedup in reconstruction and enable real-time shadow computation in Gaussian scenes. For lighting estimation, existing Gaussian-based inverse rendering methods struggle to model intricate light transport and often fail in complex scenes, while learning-based methods predict lighting from a single image and are viewpoint-sensitive. We observe that 3D object-scene composition primarily concerns the object's appearance and nearby shadows. Thus, we simplify the challenging task of full scene lighting estimation by focusing on the environment lighting at the object's placement. Specifically, we capture a 360 degrees reconstructed radiance field of the scene at the location and fine-tune a diffusion model to complete the lighting. Building on these advances, we propose ComGS, a novel 3D object-scene composition framework. Our method achieves high-quality, real-time rendering at around 28 FPS, produces visually harmonious results with vivid shadows, and requires only 36 seconds for editing. Code and dataset are available at https://nju-3dv.github.io/projects/ComGS/.
中文摘要:ComGS框架通过引入表面八面体探针实现高效可重光照物体重建,并采用简化的光照估计方法,能以28帧率实时合成具有协调阴影的高质量三维场景。
English Summary: The ComGS framework introduces Surface Octahedral Probes for efficient relightable object reconstruction and a simplified lighting estimation method, achieving real-time 3D composition with harmonious shadows at 28 FPS.

Authors:Rafin Hassan, Zarin Tasnim Roshni, Rafiqul Bari, Alimul Islam, Nabeel Mohammed, Moshiur Farazi, Shafin Rahman
Title: Label Semantics for Robust Hyperspectral Image Classification
Abstract:
Hyperspectral imaging (HSI) classification is a critical tool with widespread applications across diverse fields such as agriculture, environmental monitoring, medicine, and materials science. Due to the limited availability of high-quality training samples and the high dimensionality of spectral data, HSI classification models are prone to overfitting and often face challenges in balancing accuracy and computational complexity. Furthermore, most of HSI classification models are monomodal, where it solely relies on spectral-spatial data to learn decision boundaries in the high dimensional embedding space. To address this, we propose a general-purpose Semantic Spectral-Spatial Fusion Network (S3FN) that uses contextual, class specific textual descriptions to complement the training of an HSI classification model. Specifically, S3FN leverages LLMs to generate comprehensive textual descriptions for each class label that captures their unique characteristics and spectral behaviors. These descriptions are then embedded into a vector space using a pre-trained text encoder such as BERT or RoBERTa to extract meaningful label semantics which in turn leads to a better feature-label alignment for improved classification performance. To demonstrate the effectiveness of our approach, we evaluate our model on three diverse HSI benchmark datasets - Hyperspectral Wood, HyperspectralBlueberries, and DeepHS-Fruit and report significant performance boost. Our results highlight the synergy between textual semantics and spectral-spatial data, paving the way for further advancements in semantically augmented HSI classification models. Codes are be available in: https://github.com/milab-nsu/S3FN
Chinese: 针对高光谱图像分类中训练样本有限和模型易过拟合的问题,我们提出语义光谱-空间融合网络(S3FN),通过大语言模型生成类别文本描述并与光谱空间特征融合,在多个基准数据集上实现了分类性能的显著提升。
English: To overcome overfitting and limited training data in hyperspectral imaging classification, we propose the Semantic Spectral-Spatial Fusion Network (S3FN), which integrates class-specific textual descriptions generated by large language models with spectral-spatial data to significantly enhance classification performance across multiple benchmark datasets.

Authors:Pragati Shuddhodhan Meshram, Varun Chandrasekaran
Title: D2RA: Dual Domain Regeneration Attack
Abstract:
The growing use of generative models has intensified the need for watermarking methods that ensure content attribution and provenance. While recent semantic watermarking schemes improve robustness by embedding signals in latent or frequency representations, we show they remain vulnerable even under resource-constrained adversarial settings. We present D2RA, a training-free, single-image attack that removes or weakens watermarks without access to the underlying model. By projecting watermarked images onto natural priors across complementary representations, D2RA suppresses watermark signals while preserving visual fidelity. Experiments across diverse watermarking schemes demonstrate that our approach consistently reduces watermark detectability, revealing fundamental weaknesses in current designs. Our code is available at https://github.com/Pragati-Meshram/DAWN.
Chinese: 本文提出了一种无需训练的D2RA攻击方法,通过利用多种表征的自然先验,有效去除或削弱图像水印,揭示了当前语义水印方案的根本缺陷。
English: The paper introduces D2RA, a training-free attack method that effectively removes or weakens watermarks from images by leveraging natural priors across multiple representations, exposing vulnerabilities in current semantic watermarking schemes.

Authors:Guoliang Gong, Man Yu
Title: A Denoising Framework for Real-World Ultra-Low Dose Lung CT Images Based on an Image Purification Strategy
Abstract:
Ultra-low dose CT (uLDCT) significantly reduces radiation exposure but introduces severe noise and artifacts. It also leads to substantial spatial misalignment between uLDCT and normal dose CT (NDCT) image pairs. This poses challenges for directly applying existing denoising networks trained on synthetic noise or aligned data. To address this core challenge in uLDCT denoising, this paper proposes an innovative denoising framework based on an Image Purification (IP) strategy. First, we construct a real clinical uLDCT lung dataset. Then, we propose an Image Purification strategy that generates structurally aligned uLDCT-NDCT image pairs, providing a high-quality data foundation for network training. Building upon this, we propose a Frequency-domain Flow Matching (FFM) model, which works synergistically with the IP strategy to excellently preserve the anatomical structure integrity of denoised images. Experiments on the real clinical dataset demonstrate that our IP strategy significantly enhances the performance of multiple mainstream denoising models on the uLDCT task. Notably, our proposed FFM model combined with the IP strategy achieves state-of-the-art (SOTA) results in anatomical structure preservation. This study provides an effective solution to the data mismatch problem in real-world uLDCT denoising. Code and dataset are available at https://github.com/MonkeyDadLufy/flow-matching.
中文: 本文提出图像净化策略和频域流匹配模型,解决超低剂量CT去噪中的噪声和空间错位问题,在真实临床数据上实现了最优的解剖结构保护效果。
English: This paper introduces an Image Purification strategy and a Frequency-domain Flow Matching model to address severe noise and spatial misalignment in ultra-low dose CT denoising, achieving state-of-the-art structure preservation on real clinical data.

Authors:Nithin C. Babu, Aniruddha Mahapatra, Harsh Rangwani, Rajiv Soundararajan, Kuldeep Kulkarni
Title: DynamicEval: Rethinking Evaluation for Dynamic Text-to-Video Synthesis
Abstract:
Existing text-to-video (T2V) evaluation benchmarks, such as VBench and EvalCrafter, suffer from two limitations. (i) While the emphasis is on subject-centric prompts or static camera scenes, camera motion essential for producing cinematic shots and existing metrics under dynamic motion are largely unexplored. (ii) These benchmarks typically aggregate video-level scores into a single model-level score for ranking generative models. Such aggregation, however, overlook video-level evaluation, which is vital to selecting the better video among the candidate videos generated for a given prompt. To address these gaps, we introduce DynamicEval, a benchmark consisting of systematically curated prompts emphasizing dynamic camera motion, paired with 45k human annotations on video pairs from 3k videos generated by ten T2V models. DynamicEval evaluates two key dimensions of video quality: background scene consistency and foreground object consistency. For background scene consistency, we obtain the interpretable error maps based on the Vbench motion smoothness metric. We observe that while the Vbench motion smoothness metric shows promising alignment with human judgments, it fails in two cases: occlusions/disocclusions arising from camera and foreground object movements. Building on this, we propose a new background consistency metric that leverages object error maps to correct two failure cases in a principled manner. Our second innovation is the introduction of a foreground consistency metric that tracks points and their neighbors within each object instance to assess object fidelity. Extensive experiments demonstrate that our proposed metrics achieve stronger correlations with human preferences at both the video level and the model level (an improvement of more than 2% points), establishing DynamicEval as a more comprehensive benchmark for evaluating T2V models under dynamic camera motion.

Authors:Siyoon Jin, Seongchan Kim, Dahyun Chung, Jaeho Lee, Hyunwook Choi, Jisu Nam, Jiyoung Kim, Seungryong Kim
Title: MATRIX: Mask Track Alignment for Interaction-aware Video Generation
Abstract:
Video DiTs have advanced video generation, yet they still struggle to model multi-instance or subject-object interactions. This raises a key question: How do these models internally represent interactions? To answer this, we curate MATRIX-11K, a video dataset with interaction-aware captions and multi-instance mask tracks. Using this dataset, we conduct a systematic analysis that formalizes two perspectives of video DiTs: semantic grounding, via video-to-text attention, which evaluates whether noun and verb tokens capture instances and their relations; and semantic propagation, via video-to-video attention, which assesses whether instance bindings persist across frames. We find both effects concentrate in a small subset of interaction-dominant layers. Motivated by this, we introduce MATRIX, a simple and effective regularization that aligns attention in specific layers of video DiTs with multi-instance mask tracks from the MATRIX-11K dataset, enhancing both grounding and propagation. We further propose InterGenEval, an evaluation protocol for interaction-aware video generation. In experiments, MATRIX improves both interaction fidelity and semantic alignment while reducing drift and hallucination. Extensive ablations validate our design choices. Codes and weights will be released.

Authors:Inzamamul Alam, Md Tanvir Islam, Khan Muhammad, Simon S. Woo
Title: SpecGuard: Spectral Projection-based Advanced Invisible Watermarking
Abstract:
Watermarking embeds imperceptible patterns into images for authenticity verification. However, existing methods often lack robustness against various transformations primarily including distortions, image regeneration, and adversarial perturbation, creating real-world challenges. In this work, we introduce SpecGuard, a novel watermarking approach for robust and invisible image watermarking. Unlike prior approaches, we embed the message inside hidden convolution layers by converting from the spatial domain to the frequency domain using spectral projection of a higher frequency band that is decomposed by wavelet projection. Spectral projection employs Fast Fourier Transform approximation to transform spatial data into the frequency domain efficiently. In the encoding phase, a strength factor enhances resilience against diverse attacks, including adversarial, geometric, and regeneration-based distortions, ensuring the preservation of copyrighted information. Meanwhile, the decoder leverages Parseval's theorem to effectively learn and extract the watermark pattern, enabling accurate retrieval under challenging transformations. We evaluate the proposed SpecGuard based on the embedded watermark's invisibility, capacity, and robustness. Comprehensive experiments demonstrate the proposed SpecGuard outperforms the state-of-the-art models. To ensure reproducibility, the full code is released on \href{https://github.com/inzamamulDU/SpecGuard_ICCV_2025}{\textcolor{blue}{\textbf{GitHub}}}.
中文: SpecGuard 提出了一种鲁棒且不可见的图像水印技术,通过将信息嵌入隐藏卷积层并利用谱投影转换至频域,有效抵御多种攻击,同时确保水印的准确提取。
English: SpecGuard introduces a robust and imperceptible image watermarking technique by embedding messages in hidden convolution layers through spectral projection to the frequency domain, enhancing resilience against various attacks while ensuring accurate retrieval.

Authors:Wen Ye, Zhaocheng Liu, Yuwei Gui, Tingyu Yuan, Yunyue Su, Bowen Fang, Chaoyang Zhao, Qiang Liu, Liang Wang
Title: GenPilot: A Multi-Agent System for Test-Time Prompt Optimization in Image Generation
Abstract:
Text-to-image synthesis has made remarkable progress, yet accurately interpreting complex and lengthy prompts remains challenging, often resulting in semantic inconsistencies and missing details. Existing solutions, such as fine-tuning, are model-specific and require training, while prior automatic prompt optimization (APO) approaches typically lack systematic error analysis and refinement strategies, resulting in limited reliability and effectiveness. Meanwhile, test-time scaling methods operate on fixed prompts and on noise or sample numbers, limiting their interpretability and adaptability. To solve these, we introduce a flexible and efficient test-time prompt optimization strategy that operates directly on the input text. We propose a plug-and-play multi-agent system called GenPilot, integrating error analysis, clustering-based adaptive exploration, fine-grained verification, and a memory module for iterative optimization. Our approach is model-agnostic, interpretable, and well-suited for handling long and complex prompts. Simultaneously, we summarize the common patterns of errors and the refinement strategy, offering more experience and encouraging further exploration. Experiments on DPG-bench and Geneval with improvements of up to 16.9% and 5.7% demonstrate the strong capability of our methods in enhancing the text and image consistency and structural coherence of generated images, revealing the effectiveness of our test-time prompt optimization strategy. The code is available at https://github.com/27yw/GenPilot.
Chinese: 本文提出GenPilot,一种灵活高效的测试时提示优化策略,通过模型无关、可解释的多智能体系统解决文本到图像合成中的语义不一致和细节缺失问题,显著提升了生成图像的一致性和结构连贯性。
English: This paper introduces GenPilot, a flexible and efficient test-time prompt optimization strategy that enhances text-to-image synthesis by addressing semantic inconsistencies and missing details through a model-agnostic, interpretable multi-agent system, achieving significant improvements in consistency and coherence.

Authors:Chenfei Liao, Wensong Wang, Zichen Wen, Xu Zheng, Yiyu Wang, Haocong He, Yuanhuiyi Lyu, Lutao Jiang, Xin Zou, Yuqian Fu, Bin Ren, Linfeng Zhang, Xuming Hu
Title: Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods
Abstract:
Recent endeavors to accelerate inference in Multimodal Large Language Models (MLLMs) have primarily focused on visual token compression. The effectiveness of these methods is typically assessed by measuring the accuracy drop on established benchmarks, comparing model performance before and after compression. However, these benchmarks are originally designed to assess the perception and reasoning capabilities of MLLMs, rather than to evaluate compression techniques. As a result, directly applying them to visual token compression introduces a task mismatch. Strikingly, our investigation reveals that simple image downsampling consistently outperforms many advanced compression methods across multiple widely used benchmarks. Through extensive experiments, we make the following observations: (i) Current benchmarks are noisy for the visual token compression task. (ii) Down-sampling is able to serve as a data filter to evaluate the difficulty of samples in the visual token compression task. Motivated by these findings, we introduce VTC-Bench, an evaluation framework that incorporates a data filtering mechanism to denoise existing benchmarks, thereby enabling fairer and more accurate assessment of visual token compression methods. All data and code are available at https://github.com/Chenfei-Liao/VTC-Bench.
中文: 现有基准不适合评估多模态大语言模型的视觉标记压缩,但简单的图像下采样方法却优于复杂技术,因此我们开发了VTC-Bench以实现更公平的评估。
English: Current benchmarks are inadequate for evaluating visual token compression in MLLMs, but simple image downsampling surprisingly outperforms complex methods, leading to the creation of VTC-Bench for fairer assessment.

Authors:Karim El Khoury, Maxime Zanella, Christophe De Vleeschouwer, Benoit Macq
Title: Few-Shot Adaptation Benchmark for Remote Sensing Vision-Language Models
Abstract:
Remote Sensing Vision-Language Models (RSVLMs) have shown remarkable potential thanks to large-scale pretraining, achieving strong zero-shot performance on various tasks. However, their ability to generalize in low-data regimes, such as few-shot learning, remains insufficiently explored. In this work, we present the first structured benchmark for evaluating few-shot adaptation methods on RSVLMs. We conduct comprehensive experiments across ten remote sensing scene classification datasets, applying five widely used few-shot adaptation strategies to three state-of-the-art RSVLMs with varying backbones. Our findings reveal that models with similar zero-shot performance can exhibit markedly different behavior under few-shot adaptation, with some RSVLMs being inherently more amenable to such adaptation than others. The variability of performance and the absence of a clear winner among existing methods highlight the need for the development of more robust methods for few-shot adaptation tailored to RS. To facilitate future research, we provide a reproducible benchmarking framework and open-source code to systematically evaluate RSVLMs under few-shot conditions. The source code is publicly available on Github: https://github.com/elkhouryk/fewshot_RSVLMs
中文: 遥感视觉语言模型在少样本学习场景下表现出不同的适应能力,部分模型天生更适合此类任务,这凸显了开发专门方法及标准化评测框架的必要性,以推动该领域研究发展。
English: Remote sensing vision-language models demonstrate varying adaptability in few-shot learning scenarios, with some being more inherently suited than others, highlighting the need for specialized methods and a standardized benchmarking framework to advance research in this area.

Authors:Jiahang Liu, Yunpeng Qi, Jiazhao Zhang, Minghan Li, Shaoan Wang, Kui Wu, Hanjing Ye, Hong Zhang, Zhibo Chen, Fangwei Zhong, Zhizheng Zhang, He Wang
Title: TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking
Abstract:
Embodied Visual Tracking (EVT) is a fundamental ability that underpins practical applications, such as companion robots, guidance robots and service assistants, where continuously following moving targets is essential. Recent advances have enabled language-guided tracking in complex and unstructured scenes. However, existing approaches lack explicit spatial reasoning and effective temporal memory, causing failures under severe occlusions or in the presence of similar-looking distractors. To address these challenges, we present TrackVLA++, a novel Vision-Language-Action (VLA) model that enhances embodied visual tracking with two key modules, a spatial reasoning mechanism and a Target Identification Memory (TIM). The reasoning module introduces a Chain-of-Thought paradigm, termed Polar-CoT, which infers the target's relative position and encodes it as a compact polar-coordinate token for action prediction. Guided by these spatial priors, the TIM employs a gated update strategy to preserve long-horizon target memory, ensuring spatiotemporal consistency and mitigating target loss during extended occlusions. Extensive experiments show that TrackVLA++ achieves state-of-the-art performance on public benchmarks across both egocentric and multi-camera settings. On the challenging EVT-Bench DT split, TrackVLA++ surpasses the previous leading approach by 5.1 and 12, respectively. Furthermore, TrackVLA++ exhibits strong zero-shot generalization, enabling robust real-world tracking in dynamic and occluded scenarios.

Authors:Jan Fiszer, Dominika Ciupek, Maciej Malawski
Title: Validation of Various Normalization Methods for Brain Tumor Segmentation: Can Federated Learning Overcome This Heterogeneity?
Abstract:
Deep learning (DL) has been increasingly applied in medical imaging, however, it requires large amounts of data, which raises many challenges related to data privacy, storage, and transfer. Federated learning (FL) is a training paradigm that overcomes these issues, though its effectiveness may be reduced when dealing with non-independent and identically distributed (non-IID) data. This study simulates non-IID conditions by applying different MRI intensity normalization techniques to separate data subsets, reflecting a common cause of heterogeneity. These subsets are then used for training and testing models for brain tumor segmentation. The findings provide insights into the influence of the MRI intensity normalization methods on segmentation models, both training and inference. Notably, the FL methods demonstrated resilience to inconsistently normalized data across clients, achieving the 3D Dice score of 92%, which is comparable to a centralized model (trained using all data). These results indicate that FL is a solution to effectively train high-performing models without violating data privacy, a crucial concern in medical applications. The code is available at: https://github.com/SanoScience/fl-varying-normalization.
中文:联邦学习在保护数据隐私的同时,有效训练出高性能的脑肿瘤分割模型,对不同MRI强度归一化方法导致的非独立同分布数据表现出良好的适应性。
English: Federated learning effectively trains high-performing brain tumor segmentation models while preserving data privacy, demonstrating resilience to non-IID data conditions caused by varying MRI intensity normalization methods.

Authors:Fenghe Tang, Chengqi Dong, Wenxin Ma, Zikang Xu, Heqin Zhu, Zihang Jiang, Rongsheng Wang, Yuhao Wang, Chenxu Wu, Shaohua Kevin Zhou
Title: U-Bench: A Comprehensive Understanding of U-Net through 100-Variant Benchmarking
Abstract:
Over the past decade, U-Net has been the dominant architecture in medical image segmentation, leading to the development of thousands of U-shaped variants. Despite its widespread adoption, there is still no comprehensive benchmark to systematically evaluate their performance and utility, largely because of insufficient statistical validation and limited consideration of efficiency and generalization across diverse datasets. To bridge this gap, we present U-Bench, the first large-scale, statistically rigorous benchmark that evaluates 100 U-Net variants across 28 datasets and 10 imaging modalities. Our contributions are threefold: (1) Comprehensive Evaluation: U-Bench evaluates models along three key dimensions: statistical robustness, zero-shot generalization, and computational efficiency. We introduce a novel metric, U-Score, which jointly captures the performance-efficiency trade-off, offering a deployment-oriented perspective on model progress. (2) Systematic Analysis and Model Selection Guidance: We summarize key findings from the large-scale evaluation and systematically analyze the impact of dataset characteristics and architectural paradigms on model performance. Based on these insights, we propose a model advisor agent to guide researchers in selecting the most suitable models for specific datasets and tasks. (3) Public Availability: We provide all code, models, protocols, and weights, enabling the community to reproduce our results and extend the benchmark with future methods. In summary, U-Bench not only exposes gaps in previous evaluations but also establishes a foundation for fair, reproducible, and practically relevant benchmarking in the next decade of U-Net-based segmentation models. The project can be accessed at: https://fenghetan9.github.io/ubench. Code is available at: https://github.com/FengheTan9/U-Bench.
中文: U-Bench作为首个大规模基准测试,通过28个数据集和10种成像模式系统评估了100种U-Net变体,提出U-Score指标和模型顾问工具指导实际应用,并公开所有资源促进可复现研究。
English: U-Bench is the first comprehensive benchmark that rigorously evaluates 100 U-Net variants across 28 datasets and 10 imaging modalities, introducing the U-Score metric and model advisor to guide practical model selection while making all resources publicly available.

Authors:Shaojie Zhang, Ke Chen
Title: Angular Constraint Embedding via SpherePair Loss for Constrained Clustering
Abstract:
Constrained clustering integrates domain knowledge through pairwise constraints. However, existing deep constrained clustering (DCC) methods are either limited by anchors inherent in end-to-end modeling or struggle with learning discriminative Euclidean embedding, restricting their scalability and real-world applicability. To avoid their respective pitfalls, we propose a novel angular constraint embedding approach for DCC, termed SpherePair. Using the SpherePair loss with a geometric formulation, our method faithfully encodes pairwise constraints and leads to embeddings that are clustering-friendly in angular space, effectively separating representation learning from clustering. SpherePair preserves pairwise relations without conflict, removes the need to specify the exact number of clusters, generalizes to unseen data, enables rapid inference of the number of clusters, and is supported by rigorous theoretical guarantees. Comparative evaluations with state-of-the-art DCC methods on diverse benchmarks, along with empirical validation of theoretical insights, confirm its superior performance, scalability, and overall real-world effectiveness. Code is available at \href{https://github.com/spherepaircc/SpherePairCC/tree/main}{our repository}.
Chinese: 提出的SpherePair方法采用角度约束嵌入进行深度约束聚类,有效分离表示学习与聚类过程,无需指定聚类数量即可提升可扩展性和实际应用性。
English: The proposed SpherePair method introduces an angular constraint embedding approach for deep constrained clustering, effectively separating representation learning from clustering to enhance scalability and real-world applicability without requiring the exact number of clusters.

Authors:Bouthaina Slika, Fadi Dornaika, Fares Bougourzi, Karim Hammoudi
Title: Lung Infection Severity Prediction Using Transformers with Conditional TransMix Augmentation and Cross-Attention
Abstract:
Lung infections, particularly pneumonia, pose serious health risks that can escalate rapidly, especially during pandemics. Accurate AI-based severity prediction from medical imaging is essential to support timely clinical decisions and optimize patient outcomes. In this work, we present a novel method applicable to both CT scans and chest X-rays for assessing lung infection severity. Our contributions are twofold: (i) QCross-Att-PVT, a Transformer-based architecture that integrates parallel encoders, a cross-gated attention mechanism, and a feature aggregator to capture rich multi-scale features; and (ii) Conditional Online TransMix, a custom data augmentation strategy designed to address dataset imbalance by generating mixed-label image patches during training. Evaluated on two benchmark datasets, RALO CXR and Per-COVID-19 CT, our method consistently outperforms several state-of-the-art deep learning models. The results emphasize the critical role of data augmentation and gated attention in improving both robustness and predictive accuracy. This approach offers a reliable, adaptable tool to support clinical diagnosis, disease monitoring, and personalized treatment planning. The source code of this work is available at https://github.com/bouthainas/QCross-Att-PVT.
Chinese: 本研究提出了一种结合Transformer架构和定制数据增强策略的新型AI方法,能通过CT扫描和X射线准确预测肺部感染严重程度,其性能优于现有模型,并凸显了在临床诊断和治疗中的实用价值。
English: This study introduces a novel AI method combining a Transformer-based architecture with a custom data augmentation strategy to accurately predict lung infection severity from CT scans and X-rays, demonstrating superior performance over existing models and highlighting its clinical utility for diagnosis and treatment.

Authors:Samir Abou Haidar, Alexandre Chariot, Mehdi Darouich, Cyril Joly, Jean-Emmanuel Deschaud
Title: HARP-NeXt: High-Speed and Accurate Range-Point Fusion Network for 3D LiDAR Semantic Segmentation
Abstract:
LiDAR semantic segmentation is crucial for autonomous vehicles and mobile robots, requiring high accuracy and real-time processing, especially on resource-constrained embedded systems. Previous state-of-the-art methods often face a trade-off between accuracy and speed. Point-based and sparse convolution-based methods are accurate but slow due to the complexity of neighbor searching and 3D convolutions. Projection-based methods are faster but lose critical geometric information during the 2D projection. Additionally, many recent methods rely on test-time augmentation (TTA) to improve performance, which further slows the inference. Moreover, the pre-processing phase across all methods increases execution time and is demanding on embedded platforms. Therefore, we introduce HARP-NeXt, a high-speed and accurate LiDAR semantic segmentation network. We first propose a novel pre-processing methodology that significantly reduces computational overhead. Then, we design the Conv-SE-NeXt feature extraction block to efficiently capture representations without deep layer stacking per network stage. We also employ a multi-scale range-point fusion backbone that leverages information at multiple abstraction levels to preserve essential geometric details, thereby enhancing accuracy. Experiments on the nuScenes and SemanticKITTI benchmarks show that HARP-NeXt achieves a superior speed-accuracy trade-off compared to all state-of-the-art methods, and, without relying on ensemble models or TTA, is comparable to the top-ranked PTv3, while running 24$\times$ faster. The code is available at https://github.com/SamirAbouHaidar/HARP-NeXt
Chinese: HARP-NeXt是一种高速精准的激光雷达语义分割网络,通过创新的预处理方法和多尺度特征融合,在不依赖测试时增强的情况下,实现了卓越的速度与精度平衡,比顶尖方法快24倍。
English: HARP-NeXt is a high-speed and accurate LiDAR semantic segmentation network that introduces a novel pre-processing method and multi-scale feature fusion to achieve a superior speed-accuracy trade-off, running 24 times faster than top methods without relying on test-time augmentation.

Authors:Huahui Yi, Kun Wang, Qiankun Li, Miao Yu, Liang Lin, Gongli Xi, Hao Wu, Xuming Hu, Kang Li, Yang Liu
Title: SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models
Abstract:
Multimodal Large Reasoning Models (MLRMs) demonstrate impressive cross-modal reasoning but often amplify safety risks under adversarial or unsafe prompts, a phenomenon we call the \textit{Reasoning Tax}. Existing defenses mainly act at the output level and do not constrain the reasoning process, leaving models exposed to implicit risks. In this paper, we propose SaFeR-VLM, a safety-aligned reinforcement learning framework that embeds safety directly into multimodal reasoning. The framework integrates four components: (I) QI-Safe-10K, a curated dataset emphasizing safety-critical and reasoning-sensitive cases; (II) safety-aware rollout, where unsafe generations undergo reflection and correction instead of being discarded; (III) structured reward modeling with multi-dimensional weighted criteria and explicit penalties for hallucinations and contradictions; and (IV) GRPO optimization, which reinforces both safe and corrected trajectories. This unified design shifts safety from a passive safeguard to an active driver of reasoning, enabling scalable and generalizable safety-aware reasoning. SaFeR-VLM further demonstrates robustness against both explicit and implicit risks, supporting dynamic and interpretable safety decisions beyond surface-level filtering. SaFeR-VLM-3B achieves average performance $70.13$ and $78.97$ on safety and helpfulness across six benchmarks, surpassing both same-scale and $>10\times$ larger models such as Skywork-R1V3-38B, Qwen2.5VL-72B, and GLM4.5V-106B. Remarkably, SaFeR-VLM-7B benefits from its increased scale to surpass GPT-5-mini and Gemini-2.5-Flash by \num{6.47} and \num{16.76} points respectively on safety metrics, achieving this improvement without any degradation in helpfulness performance. Our codes are available at https://github.com/HarveyYi/SaFeR-VLM.
中文: SaFeR-VLM框架通过强化学习将安全性直接嵌入多模态推理过程,在多个基准测试中实现了安全性和实用性的卓越表现,并超越了更大规模的模型。
English: The SaFeR-VLM framework integrates safety directly into multimodal reasoning through a reinforcement learning approach, achieving superior performance in both safety and helpfulness across benchmarks while surpassing larger models.

Authors:Kanglei Zhou, Qingyi Pan, Xingxing Zhang, Hubert P. H. Shum, Frederick W. B. Li, Xiaohui Liang, Liyuan Wang
Title: Continual Action Quality Assessment via Adaptive Manifold-Aligned Graph Regularization
Abstract:
Action Quality Assessment (AQA) quantifies human actions in videos, supporting applications in sports scoring, rehabilitation, and skill evaluation. A major challenge lies in the non-stationary nature of quality distributions in real-world scenarios, which limits the generalization ability of conventional methods. We introduce Continual AQA (CAQA), which equips AQA with Continual Learning (CL) capabilities to handle evolving distributions while mitigating catastrophic forgetting. Although parameter-efficient fine-tuning of pretrained models has shown promise in CL for image classification, we find it insufficient for CAQA. Our empirical and theoretical analyses reveal two insights: (i) Full-Parameter Fine-Tuning (FPFT) is necessary for effective representation learning; yet (ii) uncontrolled FPFT induces overfitting and feature manifold shift, thereby aggravating forgetting. To address this, we propose Adaptive Manifold-Aligned Graph Regularization (MAGR++), which couples backbone fine-tuning that stabilizes shallow layers while adapting deeper ones with a two-step feature rectification pipeline: a manifold projector to translate deviated historical features into the current representation space, and a graph regularizer to align local and global distributions. We construct four CAQA benchmarks from three datasets with tailored evaluation protocols and strong baselines, enabling systematic cross-dataset comparison. Extensive experiments show that MAGR++ achieves state-of-the-art performance, with average correlation gains of 3.6% offline and 12.2% online over the strongest baseline, confirming its robustness and effectiveness. Our code is available at https://github.com/ZhouKanglei/MAGRPP.
中文摘要:本文提出持续动作质量评估(CAQA)及自适应流形对齐图正则化方法(MAGR++),通过全参数微调和特征校正机制有效应对现实场景中的分布变化问题,在多个基准测试中实现最优性能。
English Summary: This paper introduces Continual Action Quality Assessment (CAQA) with a novel Adaptive Manifold-Aligned Graph Regularization (MAGR++) method that addresses distribution shifts in video analysis while preventing catastrophic forgetting, achieving state-of-the-art performance across multiple benchmarks.

Authors:Jaeseok Jeong, Junho Kim, Gayoung Lee, Yunjey Choi, Youngjung Uh
Title: StyleKeeper: Prevent Content Leakage using Negative Visual Query Guidance
Abstract:
In the domain of text-to-image generation, diffusion models have emerged as powerful tools. Recently, studies on visual prompting, where images are used as prompts, have enabled more precise control over style and content. However, existing methods often suffer from content leakage, where undesired elements of the visual style prompt are transferred along with the intended style. To address this issue, we 1) extend classifier-free guidance (CFG) to utilize swapping self-attention and propose 2) negative visual query guidance (NVQG) to reduce the transfer of unwanted contents. NVQG employs negative score by intentionally simulating content leakage scenarios that swap queries instead of key and values of self-attention layers from visual style prompts. This simple yet effective method significantly reduces content leakage. Furthermore, we provide careful solutions for using a real image as visual style prompts. Through extensive evaluation across various styles and text prompts, our method demonstrates superiority over existing approaches, reflecting the style of the references, and ensuring that resulting images match the text prompts. Our code is available \href{https://github.com/naver-ai/StyleKeeper}{here}.
中文摘要:本文提出了一种结合扩展无分类器引导与负视觉查询引导的新方法,能有效减少文本到图像生成中的内容泄漏问题,同时保持视觉提示的精确风格控制。
English Summary: This paper introduces a novel method combining extended classifier-free guidance with negative visual query guidance to effectively reduce content leakage in text-to-image generation while maintaining precise style control from visual prompts.

Authors:Yuxi Liu, Yunfeng Ma, Yi Tang, Min Liu, Shuai Jiang, Yaonan Wang
Title: Automated Neural Architecture Design for Industrial Defect Detection
Abstract:
Industrial surface defect detection (SDD) is critical for ensuring product quality and manufacturing reliability. Due to the diverse shapes and sizes of surface defects, SDD faces two main challenges: intraclass difference and interclass similarity. Existing methods primarily utilize manually designed models, which require extensive trial and error and often struggle to address both challenges effectively. To overcome this, we propose AutoNAD, an automated neural architecture design framework for SDD that jointly searches over convolutions, transformers, and multi-layer perceptrons. This hybrid design enables the model to capture both fine-grained local variations and long-range semantic context, addressing the two key challenges while reducing the cost of manual network design. To support efficient training of such a diverse search space, AutoNAD introduces a cross weight sharing strategy, which accelerates supernet convergence and improves subnet performance. Additionally, a searchable multi-level feature aggregation module (MFAM) is integrated to enhance multi-scale feature learning. Beyond detection accuracy, runtime efficiency is essential for industrial deployment. To this end, AutoNAD incorporates a latency-aware prior to guide the selection of efficient architectures. The effectiveness of AutoNAD is validated on three industrial defect datasets and further applied within a defect imaging and detection platform. Code will be available at https://github.com/Yuxi104/AutoNAD.
中文: AutoNAD是一种自动化神经架构设计框架,通过联合搜索卷积、变换器和多层感知器来解决工业表面缺陷检测中的类内差异和类间相似性挑战,同时结合了针对实际部署的效率优化措施。
English: AutoNAD is an automated neural architecture design framework that addresses the challenges of intraclass difference and interclass similarity in industrial surface defect detection by jointly searching over convolutions, transformers, and MLPs, while incorporating efficiency measures for practical deployment.

Authors:Tao Feng, Tingfa Xu, Haolin Qin, Tianhao Li, Shuaihao Han, Xuyang Zou, Zhan Lv, Jianan Li
Title: MSITrack: A Challenging Benchmark for Multispectral Single Object Tracking
Abstract:
Visual object tracking in real-world scenarios presents numerous challenges including occlusion, interference from similar objects and complex backgrounds-all of which limit the effectiveness of RGB-based trackers. Multispectral imagery, which captures pixel-level spectral reflectance, enhances target discriminability. However, the availability of multispectral tracking datasets remains limited. To bridge this gap, we introduce MSITrack, the largest and most diverse multispectral single object tracking dataset to date. MSITrack offers the following key features: (i) More Challenging Attributes-including interference from similar objects and similarity in color and texture between targets and backgrounds in natural scenarios, along with a wide range of real-world tracking challenges; (ii) Richer and More Natural Scenes-spanning 55 object categories and 300 distinct natural scenes, MSITrack far exceeds the scope of existing benchmarks. Many of these scenes and categories are introduced to the multispectral tracking domain for the first time; (iii) Larger Scale-300 videos comprising over 129k frames of multispectral imagery. To ensure annotation precision, each frame has undergone meticulous processing, manual labeling and multi-stage verification. Extensive evaluations using representative trackers demonstrate that the multispectral data in MSITrack significantly improves performance over RGB-only baselines, highlighting its potential to drive future advancements in the field. The MSITrack dataset is publicly available at: https://github.com/Fengtao191/MSITrack.
中文摘要:MSITrack作为迄今最大、最多样化的多光谱单目标跟踪数据集被提出,以解决RGB跟踪器在遮挡、相似物体干扰和复杂背景下的局限性,其具备挑战性属性、丰富的自然场景及包含300个视频和超12.9万帧的大规模数据,显著提升了跟踪性能。
English Summary: MSITrack is introduced as the largest and most diverse multispectral single object tracking dataset to address the limitations of RGB-based trackers, featuring challenging attributes, rich natural scenes, and extensive scale with 300 videos and over 129k frames, which significantly enhances tracking performance.

Authors:Ayush Zenith, Arnold Zumbrun, Neel Raut, Jing Lin
Title: SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation
Abstract:
The performance of machine learning models depends heavily on training data. The scarcity of large-scale, well-annotated datasets poses significant challenges in creating robust models. To address this, synthetic data generated through simulations and generative models has emerged as a promising solution, enhancing dataset diversity and improving the performance, reliability, and resilience of models. However, evaluating the quality of this generated data requires an effective metric. This paper introduces the Synthetic Dataset Quality Metric (SDQM) to assess data quality for object detection tasks without requiring model training to converge. This metric enables more efficient generation and selection of synthetic datasets, addressing a key challenge in resource-constrained object detection tasks. In our experiments, SDQM demonstrated a strong correlation with the mean Average Precision (mAP) scores of YOLOv11, a leading object detection model, while previous metrics only exhibited moderate or weak correlations. Additionally, it provides actionable insights for improving dataset quality, minimizing the need for costly iterative training. This scalable and efficient metric sets a new standard for evaluating synthetic data. The code for SDQM is available at https://github.com/ayushzenith/SDQM
中文: 本文提出合成数据集质量度量(SDQM),这种可扩展的评估方法无需模型训练即可评估目标检测任务的合成数据质量,实验证明其与模型性能高度相关,为资源受限场景下的数据集优化提供了高效解决方案。
English: This paper introduces the Synthetic Dataset Quality Metric (SDQM), a scalable evaluation tool that assesses synthetic data quality for object detection without requiring model training, demonstrating strong correlation with model performance and enabling efficient dataset optimization.

Authors:Tianyue Xu, Yanlin Wu, Abhai K. Tripathi, Matthew M. Ippolito, Benjamin D. Haeffele
Title: Adaptive Stain Normalization for Cross-Domain Medical Histology
Abstract:
Deep learning advances have revolutionized automated digital pathology analysis. However, differences in staining protocols and imaging conditions can introduce significant color variability. In deep learning, such color inconsistency often reduces performance when deploying models on data acquired under different conditions from the training data, a challenge known as domain shift. Many existing methods attempt to address this problem via color normalization but suffer from several notable drawbacks such as introducing artifacts or requiring careful choice of a template image for stain mapping. To address these limitations, we propose a trainable color normalization model that can be integrated with any backbone network for downstream tasks such as object detection and classification. Based on the physics of the imaging process per the Beer-Lambert law, our model architecture is derived via algorithmic unrolling of a nonnegative matrix factorization (NMF) model to extract stain-invariant structural information from the original pathology images, which serves as input for further processing. Experimentally, we evaluate the method on publicly available pathology datasets and an internally curated collection of malaria blood smears for cross-domain object detection and classification, where our method outperforms many state-of-the-art stain normalization methods. Our code is available at https://github.com/xutianyue/BeerLaNet.
Chinese: 本文提出了一种可训练的颜色归一化模型,通过结合Beer-Lambert定律和非负矩阵分解展开,与骨干网络集成以减少数字病理学中的域偏移,在跨域任务中表现优于现有方法。
English: This paper introduces a trainable color normalization model that integrates with backbone networks to mitigate domain shift in digital pathology by leveraging the Beer-Lambert law and NMF unrolling, outperforming existing methods in cross-domain tasks.

Authors:Ziyuan Huang, DanDan Zheng, Cheng Zou, Rui Liu, Xiaolong Wang, Kaixiang Ji, Weilong Chai, Jianxin Sun, Libin Wang, Yongjie Lv, Taozhi Huang, Jiajia Liu, Qingpei Guo, Ming Yang, Jingdong Chen, Jun Zhou
Title: Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer
Abstract:
Visual tokenization remains a core challenge in unifying visual understanding and generation within the autoregressive paradigm. Existing methods typically employ tokenizers in discrete latent spaces to align with the tokens from large language models, where the quantization errors can limit semantic expressiveness and degrade the capability of vision-language understanding. To address this, we introduce MingTok, a new family of visual tokenizers with a continuous latent space, for unified autoregressive generation and understanding. While understanding tasks favor discriminative high-dimensional features, generation tasks prefer compact low-level codes. Thus, to reconcile these competing demands, MingTok adopts a three-stage sequential architecture involving low-level encoding, semantic expansion, and visual reconstruction. Built on top of it, Ming-UniVision eliminates the need for task-specific visual representations, and unifies diverse vision-language tasks under a single autoregrsssive prediction paradigm. By formulating both understanding and generation as next-token prediction in a shared continuous space, it seamlessly supports multi-round, in-context tasks such as iterative understanding, generation and editing. Empirically, we find that using a unified continuous visual representation reconciles the competing requirements on the tokenizers by the understanding and generation tasks, thereby leading to state-of-the-art level performance across both domains. We hope our findings will facilitate unified visual tokenization in the continuous domain. Inference code and model weights are released to benefit community.
中文摘要:MingTok通过引入连续潜在空间的视觉分词器,采用三级架构协调理解与生成任务的竞争需求,在自回归框架下统一视觉理解与生成,实现了最先进的性能表现。
English Summary: MingTok introduces a continuous latent space visual tokenizer to unify visual understanding and generation within an autoregressive framework, achieving state-of-the-art performance by reconciling competing task requirements through a three-stage architecture.

Authors:Fei Zhang, Rob Chancia, Josie Clapp, Amirhossein Hassanzadeh, Dimah Dera, Richard MacKenzie, Jan van Aardt
Title: Through the Perspective of LiDAR: A Feature-Enriched and Uncertainty-Aware Annotation Pipeline for Terrestrial Point Cloud Segmentation
Abstract:
Accurate semantic segmentation of terrestrial laser scanning (TLS) point clouds is limited by costly manual annotation. We propose a semi-automated, uncertainty-aware pipeline that integrates spherical projection, feature enrichment, ensemble learning, and targeted annotation to reduce labeling effort, while sustaining high accuracy. Our approach projects 3D points to a 2D spherical grid, enriches pixels with multi-source features, and trains an ensemble of segmentation networks to produce pseudo-labels and uncertainty maps, the latter guiding annotation of ambiguous regions. The 2D outputs are back-projected to 3D, yielding densely annotated point clouds supported by a three-tier visualization suite (2D feature maps, 3D colorized point clouds, and compact virtual spheres) for rapid triage and reviewer guidance. Using this pipeline, we build Mangrove3D, a semantic segmentation TLS dataset for mangrove forests. We further evaluate data efficiency and feature importance to address two key questions: (1) how much annotated data are needed and (2) which features matter most. Results show that performance saturates after ~12 annotated scans, geometric features contribute the most, and compact nine-channel stacks capture nearly all discriminative power, with the mean Intersection over Union (mIoU) plateauing at around 0.76. Finally, we confirm the generalization of our feature-enrichment strategy through cross-dataset tests on ForestSemantic and Semantic3D. Our contributions include: (i) a robust, uncertainty-aware TLS annotation pipeline with visualization tools; (ii) the Mangrove3D dataset; and (iii) empirical guidance on data efficiency and feature importance, thus enabling scalable, high-quality segmentation of TLS point clouds for ecological monitoring and beyond. The dataset and processing scripts are publicly available at https://fz-rit.github.io/through-the-lidars-eye/.

Authors:Oindrila Saha, Vojtech Krs, Radomir Mech, Subhransu Maji, Kevin Blackburn-Matzen, Matheus Gadelha
Title: SIGMA-GEN: Structure and Identity Guided Multi-subject Assembly for Image Generation
Abstract:
We present SIGMA-GEN, a unified framework for multi-identity preserving image generation. Unlike prior approaches, SIGMA-GEN is the first to enable single-pass multi-subject identity-preserved generation guided by both structural and spatial constraints. A key strength of our method is its ability to support user guidance at various levels of precision -- from coarse 2D or 3D boxes to pixel-level segmentations and depth -- with a single model. To enable this, we introduce SIGMA-SET27K, a novel synthetic dataset that provides identity, structure, and spatial information for over 100k unique subjects across 27k images. Through extensive evaluation we demonstrate that SIGMA-GEN achieves state-of-the-art performance in identity preservation, image generation quality, and speed. Code and visualizations at https://oindrilasaha.github.io/SIGMA-Gen/

Authors:Yi Xin, Qi Qin, Siqi Luo, Kaiwen Zhu, Juncheng Yan, Yan Tai, Jiayi Lei, Yuewen Cao, Keqi Wang, Yibin Wang, Jinbin Bai, Qian Yu, Dengyang Jiang, Yuandong Pu, Haoxing Chen, Le Zhuo, Junjun He, Gen Luo, Tianbin Li, Ming Hu, Jin Ye, Shenglong Ye, Bo Zhang, Chang Xu, Wenhai Wang, Hongsheng Li, Guangtao Zhai, Tianfan Xue, Bin Fu, Xiaohong Liu, Yu Qiao, Yihao Liu
Title: Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding
Abstract:
We introduce Lumina-DiMOO, an open-source foundational model for seamless multi-modal generation and understanding. Lumina-DiMOO sets itself apart from prior unified models by utilizing a fully discrete diffusion modeling to handle inputs and outputs across various modalities. This innovative approach allows Lumina-DiMOO to achieve higher sampling efficiency compared to previous autoregressive (AR) or hybrid AR-Diffusion paradigms and adeptly support a broad spectrum of multi-modal tasks, including text-to-image generation, image-to-image generation (e.g., image editing, subject-driven generation, and image inpainting, etc.), as well as image understanding. Lumina-DiMOO achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multi-modal models. To foster further advancements in multi-modal and discrete diffusion model research, we release our code and checkpoints to the community. Project Page: https://synbol.github.io/Lumina-DiMOO.
中文摘要:Lumina-DiMOO是一个开源基础模型,采用离散扩散建模实现高效的多模态生成与理解,在多项任务中达到领先水平。
English Summary: Lumina-DiMOO is an open-source foundational model that employs discrete diffusion modeling for efficient multi-modal generation and understanding, achieving state-of-the-art performance across various tasks.

Authors:Xiaochen Zhao, Chengting Yu, Kairong Yu, Lei Liu, Aili Wang
Title: Enhanced Self-Distillation Framework for Efficient Spiking Neural Network Training
Abstract:
Spiking Neural Networks (SNNs) exhibit exceptional energy efficiency on neuromorphic hardware due to their sparse activation patterns. However, conventional training methods based on surrogate gradients and Backpropagation Through Time (BPTT) not only lag behind Artificial Neural Networks (ANNs) in performance, but also incur significant computational and memory overheads that grow linearly with the temporal dimension. To enable high-performance SNN training under limited computational resources, we propose an enhanced self-distillation framework, jointly optimized with rate-based backpropagation. Specifically, the firing rates of intermediate SNN layers are projected onto lightweight ANN branches, and high-quality knowledge generated by the model itself is used to optimize substructures through the ANN pathways. Unlike traditional self-distillation paradigms, we observe that low-quality self-generated knowledge may hinder convergence. To address this, we decouple the teacher signal into reliable and unreliable components, ensuring that only reliable knowledge is used to guide the optimization of the model. Extensive experiments on CIFAR-10, CIFAR-100, CIFAR10-DVS, and ImageNet demonstrate that our method reduces training complexity while achieving high-performance SNN training. Our code is available at https://github.com/Intelli-Chip-Lab/enhanced-self-distillation-framework-for-snn.
中文: 本文提出一种增强型自蒸馏框架,通过将脉冲神经网络的发放率映射到轻量级人工神经网络分支,并选择性利用可靠的自生成知识,在降低训练复杂度的同时实现了高性能。
English: This paper introduces an enhanced self-distillation framework for Spiking Neural Networks (SNNs) that reduces training complexity while achieving high performance by projecting SNN firing rates onto lightweight ANN branches and selectively using reliable self-generated knowledge.

Authors:Yue Chen, Xingyu Chen, Yuxuan Xue, Anpei Chen, Yuliang Xiu, Gerard Pons-Moll
Title: Human3R: Everyone Everywhere All at Once
Abstract:
We present Human3R, a unified, feed-forward framework for online 4D human-scene reconstruction, in the world frame, from casually captured monocular videos. Unlike previous approaches that rely on multi-stage pipelines, iterative contact-aware refinement between humans and scenes, and heavy dependencies, e.g., human detection, depth estimation, and SLAM pre-processing, Human3R jointly recovers global multi-person SMPL-X bodies ("everyone"), dense 3D scene ("everywhere"), and camera trajectories in a single forward pass ("all-at-once"). Our method builds upon the 4D online reconstruction model CUT3R, and uses parameter-efficient visual prompt tuning, to strive to preserve CUT3R's rich spatiotemporal priors, while enabling direct readout of multiple SMPL-X bodies. Human3R is a unified model that eliminates heavy dependencies and iterative refinement. After being trained on the relatively small-scale synthetic dataset BEDLAM for just one day on one GPU, it achieves superior performance with remarkable efficiency: it reconstructs multiple humans in a one-shot manner, along with 3D scenes, in one stage, at real-time speed (15 FPS) with a low memory footprint (8 GB). Extensive experiments demonstrate that Human3R delivers state-of-the-art or competitive performance across tasks, including global human motion estimation, local human mesh recovery, video depth estimation, and camera pose estimation, with a single unified model. We hope that Human3R will serve as a simple yet strong baseline, be easily extended for downstream applications.Code available in https://fanegg.github.io/Human3R
中文: Human3R是一种统一的单次前馈框架,可从单目视频中实时重建4D人-场景交互,无需多阶段依赖,并在多项任务中实现最先进的性能。
English: Human3R is a unified, feed-forward framework that reconstructs 4D human-scene interactions from monocular videos in a single forward pass, eliminating multi-stage dependencies and achieving real-time performance with state-of-the-art results across multiple tasks.

Authors:Aditya Prakash, David Forsyth, Saurabh Gupta
Title: Bimanual 3D Hand Motion and Articulation Forecasting in Everyday Images
Abstract:
We tackle the problem of forecasting bimanual 3D hand motion & articulation from a single image in everyday settings. To address the lack of 3D hand annotations in diverse settings, we design an annotation pipeline consisting of a diffusion model to lift 2D hand keypoint sequences to 4D hand motion. For the forecasting model, we adopt a diffusion loss to account for the multimodality in hand motion distribution. Extensive experiments across 6 datasets show the benefits of training on diverse data with imputed labels (14% improvement) and effectiveness of our lifting (42% better) & forecasting (16.4% gain) models, over the best baselines, especially in zero-shot generalization to everyday images.

Authors:Shuo Jiang, Zhuwen Chen, Liaoman Xu, Yanming Zhu, Changmiao Wang, Jiong Zhang, Feiwei Qin, Yifei Chen, Zhu Zhu
Title: Multimodal Feature Prototype Learning for Interpretable and Discriminative Cancer Survival Prediction
Abstract:
Survival analysis plays a vital role in making clinical decisions. However, the models currently in use are often difficult to interpret, which reduces their usefulness in clinical settings. Prototype learning presents a potential solution, yet traditional methods focus on local similarities and static matching, neglecting the broader tumor context and lacking strong semantic alignment with genomic data. To overcome these issues, we introduce an innovative prototype-based multimodal framework, FeatProto, aimed at enhancing cancer survival prediction by addressing significant limitations in current prototype learning methodologies within pathology. Our framework establishes a unified feature prototype space that integrates both global and local features of whole slide images (WSI) with genomic profiles. This integration facilitates traceable and interpretable decision-making processes. Our approach includes three main innovations: (1) A robust phenotype representation that merges critical patches with global context, harmonized with genomic data to minimize local bias. (2) An Exponential Prototype Update Strategy (EMA ProtoUp) that sustains stable cross-modal associations and employs a wandering mechanism to adapt prototypes flexibly to tumor heterogeneity. (3) A hierarchical prototype matching scheme designed to capture global centrality, local typicality, and cohort-level trends, thereby refining prototype inference. Comprehensive evaluations on four publicly available cancer datasets indicate that our method surpasses current leading unimodal and multimodal survival prediction techniques in both accuracy and interoperability, providing a new perspective on prototype learning for critical medical applications. Our source code is available at https://github.com/JSLiam94/FeatProto.
中文摘要:FeatProto框架提出了一种多模态原型学习方法,通过整合全切片图像与基因组数据,以提升可解释性和准确性来改进癌症生存预测。
English Summary: The FeatProto framework introduces a multimodal prototype learning approach that integrates whole slide images with genomic data to improve cancer survival prediction through enhanced interpretability and accuracy.

Authors:Yinjian Wang, Wei Li, Yuanyuan Gui, Gemine Vivone
Title: Compact Multi-level-prior Tensor Representation for Hyperspectral Image Super-resolution
Abstract:
Fusing a hyperspectral image with a multispectral image acquired over the same scene, \textit{i.e.}, hyperspectral image super-resolution, has become a popular computational way to access the latent high-spatial-spectral-resolution image. To date, a variety of fusion methods have been proposed, among which the tensor-based ones have testified that multiple priors, such as multidimensional low-rankness and spatial total variation at multiple levels, effectively drive the fusion process. However, existing tensor-based models can only effectively leverage one or two priors at one or two levels, since simultaneously incorporating multi-level priors inevitably increases model complexity. This introduces challenges in both balancing the weights of different priors and optimizing multi-block structures. Concerning this, we present a novel hyperspectral super-resolution model compactly characterizing these multi-level priors of hyperspectral images within the tensor framework. Firstly, the proposed model decouples the spectral low-rankness and spatial priors by casting the latent high-spatial-spectral-resolution image into spectral subspace and spatial maps via block term decomposition. Secondly, these spatial maps are stacked as the spatial tensor encoding the high-order spatial low-rankness and smoothness priors, which are co-modeled via the proposed non-convex mode-shuffled tensor correlated total variation. Finally, we draw inspiration from the linearized alternating direction method of multipliers to design an efficient algorithm to optimize the resulting model, theoretically proving its Karush-Kuhn-Tucker convergence under mild conditions. Experiments on multiple datasets demonstrate the effectiveness of the proposed algorithm. The code implementation will be available from https://github.com/WongYinJ.
中文: 本研究提出了一种新颖的基于张量的高光谱图像超分辨率模型,通过块项分解和非凸全变分有效整合多级先验,并设计了在温和条件下可收敛的高效优化算法。
English: This study introduces a novel tensor-based hyperspectral image super-resolution model that effectively integrates multi-level priors through block term decomposition and non-convex total variation, with an efficient optimization algorithm proven to converge under mild conditions.

Authors:Xinye Cao, Hongcan Guo, Jiawen Qian, Guoshun Nan, Chao Wang, Yuqi Pan, Tianhao Hou, Xiaojuan Wang, Yutong Gao
Title: VideoMiner: Iteratively Grounding Key Frames of Hour-Long Videos via Tree-based Group Relative Policy Optimization
Abstract:
Understanding hour-long videos with multi-modal large language models (MM-LLMs) enriches the landscape of human-centered AI applications. However, for end-to-end video understanding with LLMs, uniformly sampling video frames results in LLMs being overwhelmed by a vast amount of irrelevant information as video length increases. Existing hierarchical key frame extraction methods improve the accuracy of video understanding but still face two critical challenges. 1) How can the interference of extensive redundant information in long videos be mitigated? 2) How can a model dynamically adapt to complex hierarchical structures while accurately identifying key frames? To address these issues, we propose VideoMiner, which iteratively segments, captions, and clusters long videos, forming a hierarchical tree structure. The proposed VideoMiner progresses from long videos to events to frames while preserving temporal coherence, effectively addressing the first challenge. To precisely locate key frames, we introduce T-GRPO, a tree-based group relative policy optimization in reinforcement learning method that guides the exploration of the VideoMiner. The proposed T-GRPO is specifically designed for tree structures, integrating spatiotemporal information at the event level while being guided by the question, thus solving the second challenge. We achieve superior performance in all long-video understanding tasks and uncover several interesting insights. Our proposed T-GRPO surprisingly incentivizes the model to spontaneously generate a reasoning chain. Additionally, the designed tree growth auxin dynamically adjusts the expansion depth, obtaining accuracy and efficiency gains. The code is publicly available at https://github.com/caoxinye/VideoMiner.
中文: VideoMiner通过迭代分割、描述和聚类长视频形成层次树结构解决视频理解难题,同时T-GRPO强化学习方法优化关键帧定位,在保持时序连贯性的同时提升处理效率与准确性。
English: VideoMiner addresses long-video understanding challenges by iteratively segmenting, captioning, and clustering videos into a hierarchical tree structure, while T-GRPO reinforcement learning optimizes key frame identification for improved accuracy and efficiency.

Authors:Ron Keuth, Paul Kaftan, Mattias P. Heinrich
Title: Shaken or Stirred? An Analysis of MetaFormer's Token Mixing for Medical Imaging
Abstract:
The generalization of the Transformer architecture via MetaFormer has reshaped our understanding of its success in computer vision. By replacing self-attention with simpler token mixers, MetaFormer provides strong baselines for vision tasks. However, while extensively studied on natural image datasets, its use in medical imaging remains scarce, and existing works rarely compare different token mixers, potentially overlooking more suitable designs choices. In this work, we present the first comprehensive study of token mixers for medical imaging. We systematically analyze pooling-, convolution-, and attention-based token mixers within the MetaFormer architecture on image classification (global prediction task) and semantic segmentation (dense prediction task). Our evaluation spans eight datasets covering diverse modalities and common challenges in the medical domain. Given the prevalence of pretraining from natural images to mitigate medical data scarcity, we also examine transferring pretrained weights to new token mixers. Our results show that, for classification, low-complexity token mixers (e.g. grouped convolution or pooling) are sufficient, aligning with findings on natural images. Pretrained weights remain useful despite the domain gap introduced by the new token mixer. For segmentation, we find that the local inductive bias of convolutional token mixers is essential. Grouped convolutions emerge as the preferred choice, as they reduce runtime and parameter count compared to standard convolutions, while the MetaFormer's channel-MLPs already provide the necessary cross-channel interactions. Our code is available on GitHub.
中文摘要:MetaFormer中的令牌混合器在医学影像中首次得到系统研究,表明低复杂度的分组卷积足以应对分类任务,而具有局部归纳偏置的卷积混合器在分割任务中表现更优。
English Summary: MetaFormer's token mixers are comprehensively studied in medical imaging, revealing that low-complexity options like grouped convolutions suffice for classification while convolutional mixers with local inductive bias excel in segmentation tasks.

Authors:Jiesi Hu, Yanwu Yang, Zhiyu Ye, Jinyan Zhou, Jianfeng Cao, Hanyang Peng, Ting Ma
Title: Efficient Universal Models for Medical Image Segmentation via Weakly Supervised In-Context Learning
Abstract:
Universal models for medical image segmentation, such as interactive and in-context learning (ICL) models, offer strong generalization but require extensive annotations. Interactive models need repeated user prompts for each image, while ICL relies on dense, pixel-level labels. To address this, we propose Weakly Supervised In-Context Learning (WS-ICL), a new ICL paradigm that leverages weak prompts (e.g., bounding boxes or points) instead of dense labels for context. This approach significantly reduces annotation effort by eliminating the need for fine-grained masks and repeated user prompting for all images. We evaluated the proposed WS-ICL model on three held-out benchmarks. Experimental results demonstrate that WS-ICL achieves performance comparable to regular ICL models at a significantly lower annotation cost. In addition, WS-ICL is highly competitive even under the interactive paradigm. These findings establish WS-ICL as a promising step toward more efficient and unified universal models for medical image segmentation. Our code and model are publicly available at https://github.com/jiesihu/Weak-ICL.
中文摘要:提出的弱监督上下文学习(WS-ICL)模型通过使用边界框等弱提示替代密集标注,在保持与常规ICL模型相当性能的同时,显著降低了医学图像分割任务中的标注成本。
English Summary: The proposed Weakly Supervised In-Context Learning (WS-ICL) model reduces annotation burden by using weak prompts like bounding boxes instead of dense labels, achieving comparable performance to regular ICL models with significantly lower annotation costs across multiple medical image segmentation benchmarks.

Authors:Yanran Zhang, Bingyao Yu, Yu Zheng, Wenzhao Zheng, Yueqi Duan, Lei Chen, Jie Zhou, Jiwen Lu
Title: $\bf{D^3}$QE: Learning Discrete Distribution Discrepancy-aware Quantization Error for Autoregressive-Generated Image Detection
Abstract:
The emergence of visual autoregressive (AR) models has revolutionized image generation while presenting new challenges for synthetic image detection. Unlike previous GAN or diffusion-based methods, AR models generate images through discrete token prediction, exhibiting both marked improvements in image synthesis quality and unique characteristics in their vector-quantized representations. In this paper, we propose to leverage Discrete Distribution Discrepancy-aware Quantization Error (D$^3$QE) for autoregressive-generated image detection that exploits the distinctive patterns and the frequency distribution bias of the codebook existing in real and fake images. We introduce a discrete distribution discrepancy-aware transformer that integrates dynamic codebook frequency statistics into its attention mechanism, fusing semantic features and quantization error latent. To evaluate our method, we construct a comprehensive dataset termed ARForensics covering 7 mainstream visual AR models. Experiments demonstrate superior detection accuracy and strong generalization of D$^3$QE across different AR models, with robustness to real-world perturbations. Code is available at \href{https://github.com/Zhangyr2022/D3QE}{https://github.com/Zhangyr2022/D3QE}.
中文: 本文提出D$^3$QE方法,通过分析离散分布差异和量化误差来检测自回归模型生成的图像,在不同AR模型中实现了卓越的检测精度和鲁棒性。
English: This paper introduces D$^3$QE, a novel method for detecting images generated by autoregressive models by analyzing discrete distribution discrepancies and quantization errors, achieving superior accuracy and robustness across diverse AR models.

Authors:Johannes Seiffarth, Keitaro Kasahara, Michelle Bund, Benita Lückel, Richard D. Paul, Matthias Pesch, Lennart Witting, Michael Bott, Dietrich Kohlheyer, Katharina Nöh
Title: acia-workflows: Automated Single-cell Imaging Analysis for Scalable and Deep Learning-based Live-cell Imaging Analysis Workflows
Abstract:
Live-cell imaging (LCI) technology enables the detailed spatio-temporal characterization of living cells at the single-cell level, which is critical for advancing research in the life sciences, from biomedical applications to bioprocessing. High-throughput setups with tens to hundreds of parallel cell cultivations offer the potential for robust and reproducible insights. However, these insights are obscured by the large amount of LCI data recorded per experiment. Recent advances in state-of-the-art deep learning methods for cell segmentation and tracking now enable the automated analysis of such large data volumes, offering unprecedented opportunities to systematically study single-cell dynamics. The next key challenge lies in integrating these powerful tools into accessible, flexible, and user-friendly workflows that support routine application in biological research. In this work, we present acia-workflows, a platform that combines three key components: (1) the Automated live-Cell Imaging Analysis (acia) Python library, which supports the modular design of image analysis pipelines offering eight deep learning segmentation and tracking approaches; (2) workflows that assemble the image analysis pipeline, its software dependencies, documentation, and visualizations into a single Jupyter Notebook, leading to accessible, reproducible and scalable analysis workflows; and (3) a collection of application workflows showcasing the analysis and customization capabilities in real-world applications. Specifically, we present three workflows to investigate various types of microfluidic LCI experiments ranging from growth rate comparisons to precise, minute-resolution quantitative analyses of individual dynamic cells responses to changing oxygen conditions. Our collection of more than ten application workflows is open source and publicly available at https://github.com/JuBiotech/acia-workflows.
中文:acia-workflows平台将用于活细胞成像分析的深度学习工具集成到用户友好的Jupyter Notebook工作流中,支持在不同微流体实验中进行可重复的单细胞动态研究。
English: The acia-workflows platform integrates deep learning tools for automated live-cell imaging analysis into user-friendly Jupyter Notebook workflows, enabling reproducible single-cell dynamic studies across various microfluidic experiments.

Authors:Hengyang Zhou, Yiwei Wei, Jian Yang, Zhenyu Zhang
Title: Towards Robust and Realible Multimodal Fake News Detection with Incomplete Modality
Abstract:
Multimodal fake news detection (MFND) has become an urgent task with the emergence of huge multimodal fake content on social media platforms. Previous studies mainly focus on complex feature extraction and fusion to learn discriminative information from multimodal content. However, in real-world applications, multimedia news may naturally lose some information during dissemination, resulting in modality incompleteness, which is detrimental to the generalization and robustness of existing models. To this end, we propose a novel generic and robust multimodal fusion strategy, termed Multi-expert Modality-incomplete Learning Network (MMLNet), which is simple yet effective. It consists of three key steps: (1) Multi-Expert Collaborative Reasoning to compensate for missing modalities by dynamically leveraging complementary information through multiple experts. (2) Incomplete Modality Adapters compensates for the missing information by leveraging the new feature distribution. (3) Modality Missing Learning leveraging an label-aware adaptive weighting strategy to learn a robust representation with contrastive learning. We evaluate MMLNet on three real-world benchmarks across two languages, demonstrating superior performance compared to state-of-the-art methods while maintaining relative simplicity. By ensuring the accuracy of fake news detection in incomplete modality scenarios caused by information propagation, MMLNet effectively curbs the spread of malicious misinformation. Code is publicly available at https://github.com/zhyhome/MMLNet.
中文: 本研究提出MMLNet,一种通过多专家协同推理和对比学习动态补偿缺失模态的鲁棒多模态融合策略,在多个基准测试中显著提升了不完整模态场景下的虚假新闻检测性能。
English: The study introduces MMLNet, a robust multimodal fusion strategy that addresses modality incompleteness in fake news detection by dynamically compensating for missing information through multi-expert reasoning and contrastive learning, achieving superior performance across multiple benchmarks.

Authors:Sven Koehler, Sarah Kaye Mueller, Jonathan Kiekenap, Gerald Greil, Tarique Hussain, Samir Sarikouch, Florian André, Norbert Frey, Sandy Engelhardt
Title: Deformable Image Registration for Self-supervised Cardiac Phase Detection in Multi-View Multi-Disease Cardiac Magnetic Resonance Images
Abstract:
Cardiovascular magnetic resonance (CMR) is the gold standard for assessing cardiac function, but individual cardiac cycles complicate automatic temporal comparison or sub-phase analysis. Accurate cardiac keyframe detection can eliminate this problem. However, automatic methods solely derive end-systole (ES) and end-diastole (ED) frames from left ventricular volume curves, which do not provide a deeper insight into myocardial motion. We propose a self-supervised deep learning method detecting five keyframes in short-axis (SAX) and four-chamber long-axis (4CH) cine CMR. Initially, dense deformable registration fields are derived from the images and used to compute a 1D motion descriptor, which provides valuable insights into global cardiac contraction and relaxation patterns. From these characteristic curves, keyframes are determined using a simple set of rules. The method was independently evaluated for both views using three public, multicentre, multidisease datasets. M&Ms-2 (n=360) dataset was used for training and evaluation, and M&Ms (n=345) and ACDC (n=100) datasets for repeatability control. Furthermore, generalisability to patients with rare congenital heart defects was tested using the German Competence Network (GCN) dataset. Our self-supervised approach achieved improved detection accuracy by 30% - 51% for SAX and 11% - 47% for 4CH in ED and ES, as measured by cyclic frame difference (cFD), compared with the volume-based approach. We can detect ED and ES, as well as three additional keyframes throughout the cardiac cycle with a mean cFD below 1.31 frames for SAX and 1.73 for LAX. Our approach enables temporally aligned inter- and intra-patient analysis of cardiac dynamics, irrespective of cycle or phase lengths. GitHub repository: https://github.com/Cardio-AI/cmr-multi-view-phase-detection.git
This study introduces a self-supervised deep learning method that detects five cardiac keyframes using motion descriptors from deformable registration, achieving significantly improved accuracy over volume-based approaches and enabling temporally aligned analysis across patients.
English Summary:

Authors:Amirtaha Amanzadi, Zahra Dehghanian, Hamid Beigy, Hamid R. Rabiee
Title: Redefining Generalization in Visual Domains: A Two-Axis Framework for Fake Image Detection with FusionDetect
Abstract:
The rapid development of generative models has made it increasingly crucial to develop detectors that can reliably detect synthetic images. Although most of the work has now focused on cross-generator generalization, we argue that this viewpoint is too limited. Detecting synthetic images involves another equally important challenge: generalization across visual domains. To bridge this gap,we present the OmniGen Benchmark. This comprehensive evaluation dataset incorporates 12 state-of-the-art generators, providing a more realistic way of evaluating detector performance under realistic conditions. In addition, we introduce a new method, FusionDetect, aimed at addressing both vectors of generalization. FusionDetect draws on the benefits of two frozen foundation models: CLIP & Dinov2. By deriving features from both complementary models,we develop a cohesive feature space that naturally adapts to changes in both thecontent and design of the generator. Our extensive experiments demonstrate that FusionDetect delivers not only a new state-of-the-art, which is 3.87% more accurate than its closest competitor and 6.13% more precise on average on established benchmarks, but also achieves a 4.48% increase in accuracy on OmniGen,along with exceptional robustness to common image perturbations. We introduce not only a top-performing detector, but also a new benchmark and framework for furthering universal AI image detection. The code and dataset are available at http://github.com/amir-aman/FusionDetect
中文:OmniGen基准测试和FusionDetect方法解决了合成图像检测中跨生成器和跨领域泛化的双重挑战,实现了最先进的准确性和鲁棒性。
English: The OmniGen Benchmark and FusionDetect method address the dual challenges of cross-generator and cross-domain generalization in synthetic image detection, achieving state-of-the-art accuracy and robustness.

Authors:Suwhan Choi, Jaeyoon Jung, Haebin Seong, Minchan Kim, Minyeong Kim, Yongjun Cho, Yoonshik Kim, Yubeen Park, Youngjae Yu, Yunsung Lee
Title: D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI
Abstract:
Large language models leverage internet-scale text data, yet embodied AI remains constrained by the prohibitive costs of physical trajectory collection. Desktop environments -- particularly gaming -- offer a compelling alternative: they provide rich sensorimotor interactions at scale while maintaining the structured observation-action coupling essential for embodied learning. We present D2E (Desktop to Embodied AI), a framework that demonstrates desktop interactions can serve as an effective pretraining substrate for robotics embodied AI tasks. Unlike prior work that remained domain-specific (e.g., VPT for Minecraft) or kept data proprietary (e.g., SIMA), D2E establishes a complete pipeline from scalable desktop data collection to verified transfer in embodied domains. Our framework comprises three components: (1) the OWA Toolkit that unifies diverse desktop interactions into a standardized format with 152x compression, (2) the Generalist-IDM that achieves strong zero-shot generalization across unseen games through timestamp-based event prediction, enabling internet-scale pseudo-labeling, and (3) VAPT that transfers desktop-pretrained representations to physical manipulation and navigation. Using 1.3K+ hours of data (259 hours of human demonstrations, and 1K+ hours of pseudo-labeled gameplay), we achieve a total of 96.6% success rate on LIBERO manipulation and 83.3% on CANVAS navigation benchmarks. This validates that sensorimotor primitives in digital interactions exhibit sufficient invariance to transfer meaningfully to physical embodied tasks, establishing desktop pretraining as a practical paradigm for robotics. We will make all our work public, including the OWA toolkit, datasets of human-collected and pseudo-labeled, and VAPT-trained models available at https://worv-ai.github.io/d2e/

Authors:Manolis Mylonas, Charalampia Zerva, Evlampios Apostolidis, Vasileios Mezaris
Title: SD-MVSum: Script-Driven Multimodal Video Summarization Method and Datasets
Abstract:
In this work, we extend a recent method for script-driven video summarization, originally considering just the visual content of the video, to take into account the relevance of the user-provided script also with the video's spoken content. In the proposed method, SD-MVSum, the dependence between each considered pair of data modalities, i.e., script-video and script-transcript, is modeled using a new weighted cross-modal attention mechanism. This explicitly exploits the semantic similarity between the paired modalities in order to promote the parts of the full-length video with the highest relevance to the user-provided script. Furthermore, we extend two large-scale datasets for video summarization (S-VideoXum, MrHiSum), to make them suitable for training and evaluation of script-driven multimodal video summarization methods. Experimental comparisons document the competitiveness of our SD-MVSum method against other SOTA approaches for script-driven and generic video summarization. Our new method and extended datasets are available at: https://github.com/IDT-ITI/SD-MVSum.
中文: 本研究提出SD-MVSum方法,通过加权跨模态注意力机制建模脚本-视频和脚本-转录本之间的关联性,优先选取与用户脚本最相关的视频片段,同时扩展了两个适用于多模态训练评估的数据集。
English: This study introduces SD-MVSum, a script-driven video summarization method that uses a weighted cross-modal attention mechanism to model the relevance between script-video and script-transcript pairs, enhancing video segments most aligned with the user's script while also extending two datasets for multimodal training and evaluation.

Authors:Ibrahim Salihu Yusuf, Iffanice Houndayi, Rym Oualha, Mohamed Aziz Cherif, Kobby Panford-Quainoo, Arnu Pretorius
Title: InstaGeo: Compute-Efficient Geospatial Machine Learning from Data to Deployment
Abstract:
Open-access multispectral imagery from missions like Landsat 8-9 and Sentinel-2 has fueled the development of geospatial foundation models (GFMs) for humanitarian and environmental applications. Yet, their deployment remains limited by (i) the absence of automated geospatial data pipelines and (ii) the large size of fine-tuned models. Existing GFMs lack workflows for processing raw satellite imagery, and downstream adaptations often retain the full complexity of the original encoder. We present InstaGeo, an open-source, end-to-end framework that addresses these challenges by integrating: (1) automated data curation to transform raw imagery into model-ready datasets; (2) task-specific model distillation to derive compact, compute-efficient models; and (3) seamless deployment as interactive web-map applications. Using InstaGeo, we reproduced datasets from three published studies and trained models with marginal mIoU differences of -0.73 pp for flood mapping, -0.20 pp for crop segmentation, and +1.79 pp for desert locust prediction. The distilled models are up to 8x smaller than standard fine-tuned counterparts, reducing FLOPs and CO2 emissions with minimal accuracy loss. Leveraging InstaGeo's streamlined data pipeline, we also curated a larger crop segmentation dataset, achieving a state-of-the-art mIoU of 60.65%, a 12 pp improvement over prior baselines. Moreover, InstaGeo enables users to progress from raw data to model deployment within a single working day. By unifying data preparation, model compression, and deployment, InstaGeo transforms research-grade GFMs into practical, low-carbon tools for real-time, large-scale Earth observation. This approach shifts geospatial AI toward data quality and application-driven innovation. Source code, datasets, and model checkpoints are available at: https://github.com/instadeepai/InstaGeo-E2E-Geospatial-ML.git
中文: InstaGeo是一个开源框架,通过自动化地理空间数据处理和模型蒸馏,创建紧凑高效的实时地球观测模型,在保持高精度的同时显著降低计算成本和碳排放。
English: InstaGeo is an open-source framework that automates geospatial data processing and model distillation to create compact, efficient models for real-time Earth observation, significantly reducing computational costs and carbon emissions while maintaining high accuracy.

Authors:Guangrong Wan, Jun liu, Qiyang Zhou, Tang tang, Lianghao Shi, Wenjun Luo, TingTing Xu
Title: TFM Dataset: A Novel Multi-task Dataset and Integrated Pipeline for Automated Tear Film Break-Up Segmentation
Abstract:
Tear film break-up (TFBU) analysis is critical for diagnosing dry eye syndrome, but automated TFBU segmentation remains challenging due to the lack of annotated datasets and integrated solutions. This paper introduces the Tear Film Multi-task (TFM) Dataset, the first comprehensive dataset for multi-task tear film analysis, comprising 15 high-resolution videos (totaling 6,247 frames) annotated with three vision tasks: frame-level classification ('clear', 'closed', 'broken', 'blur'), Placido Ring detection, and pixel-wise TFBU area segmentation. Leveraging this dataset, we first propose TF-Net, a novel and efficient baseline segmentation model. TF-Net incorporates a MobileOne-mini backbone with re-parameterization techniques and an enhanced feature pyramid network to achieve a favorable balance between accuracy and computational efficiency for real-time clinical applications. We further establish benchmark performance on the TFM segmentation subset by comparing TF-Net against several state-of-the-art medical image segmentation models. Furthermore, we design TF-Collab, a novel integrated real-time pipeline that synergistically leverages models trained on all three tasks of the TFM dataset. By sequentially orchestrating frame classification for BUT determination, pupil region localization for input standardization, and TFBU segmentation, TF-Collab fully automates the analysis. Experimental results demonstrate the effectiveness of the proposed TF-Net and TF-Collab, providing a foundation for future research in ocular surface diagnostics. Our code and the TFM datasets are available at https://github.com/glory-wan/TF-Net
Chinese: 本文提出了首个泪膜多任务分析数据集TFM和高效分割模型TF-Net,并设计了集成化实时分析系统TF-Collab,通过多任务协同实现了干眼症诊断的自动化流程。
English: This paper introduces the TFM Dataset and TF-Net model for automated tear film analysis, along with the integrated TF-Collab pipeline that achieves real-time performance in diagnosing dry eye syndrome through multi-task coordination.

Authors:Junwen Chen, Peilin Xiong, Keiji Yanai
Title: HOI-R1: Exploring the Potential of Multimodal Large Language Models for Human-Object Interaction Detection
Abstract:
Recent Human-object interaction detection (HOID) methods highly require prior knowledge from VLMs to enhance the interaction recognition capabilities. The training strategies and model architectures for connecting the knowledge from VLMs to the HOI instance representations from the object detector are challenging, and the whole framework is complex for further development or application. On the other hand, the inherent reasoning abilities of MLLMs on human-object interaction detection are under-explored. Inspired by the recent success of training MLLMs with reinforcement learning (RL) methods, we propose HOI-R1 and first explore the potential of the language model on the HOID task without any additional detection modules. We introduce an HOI reasoning process and HOID reward functions to solve the HOID task by pure text. The results on the HICO-DET dataset show that HOI-R1 achieves 2x the accuracy of the baseline with great generalization ability. The source code is available at https://github.com/cjw2021/HOI-R1.
Chinese Summary: 该研究提出HOI-R1方法,通过在多模态语言模型中运用强化学习进行纯文本推理来实现人-物交互检测,在HICO-DET数据集上达到基线两倍的准确率,且无需额外检测模块。
English Summary: The study introduces HOI-R1, a method that leverages reinforcement learning in multimodal language models to perform human-object interaction detection purely through text reasoning, achieving double the baseline accuracy on the HICO-DET dataset without relying on additional detection modules.

Authors:Bin Kang, Bin Chen, Junjie Wang, Yulin Li, Junzhi Zhao, Zhuotao Tian
Title: CalibCLIP: Contextual Calibration of Dominant Semantics for Text-Driven Image Retrieval
Abstract:
Existing Visual Language Models (VLMs) suffer structural limitations where a few low contribution tokens may excessively capture global semantics, dominating the information aggregation process and suppressing the discriminative features in text-driven image retrieval tasks. To address this, we introduce \textbf{CalibCLIP}, a training-free method designed to calibrate the suppressive effect of dominant tokens. Specifically, in the visual space, we propose the Contrastive Visual Enhancer (CVE), which decouples visual features into target and low information regions. Subsequently, it identifies dominant tokens and dynamically suppresses their representations.In the textual space, we introduce the Discriminative Concept Calibrator (DCC), which aims to differentiate between general and discriminative concepts within the text query. By mitigating the challenges posed by generic concepts and improving the representations of discriminative concepts, DCC strengthens the differentiation among similar samples. Finally, extensive experiments demonstrate consistent improvements across seven benchmarks spanning three image retrieval tasks, underscoring the effectiveness of CalibCLIP. Code is available at: https://github.com/kangbin98/CalibCLIP
Chinese: CalibCLIP是一种无需训练的方法,通过视觉和文本空间的校准来缓解视觉语言模型中主导令牌的抑制效应,从而在图像检索任务中实现性能提升。
English: CalibCLIP is a training-free method that addresses the issue of dominant tokens in Visual Language Models by calibrating their suppressive effects through visual and textual enhancements, leading to improved performance in image retrieval tasks.

Authors:Hongchi Xia, Chih-Hao Lin, Hao-Yu Hsu, Quentin Leboutet, Katelyn Gao, Michael Paulitsch, Benjamin Ummenhofer, Shenlong Wang
Title: HoloScene: Simulation-Ready Interactive 3D Worlds from a Single Video
Abstract:
Digitizing the physical world into accurate simulation-ready virtual environments offers significant opportunities in a variety of fields such as augmented and virtual reality, gaming, and robotics. However, current 3D reconstruction and scene-understanding methods commonly fall short in one or more critical aspects, such as geometry completeness, object interactivity, physical plausibility, photorealistic rendering, or realistic physical properties for reliable dynamic simulation. To address these limitations, we introduce HoloScene, a novel interactive 3D reconstruction framework that simultaneously achieves these requirements. HoloScene leverages a comprehensive interactive scene-graph representation, encoding object geometry, appearance, and physical properties alongside hierarchical and inter-object relationships. Reconstruction is formulated as an energy-based optimization problem, integrating observational data, physical constraints, and generative priors into a unified, coherent objective. Optimization is efficiently performed via a hybrid approach combining sampling-based exploration with gradient-based refinement. The resulting digital twins exhibit complete and precise geometry, physical stability, and realistic rendering from novel viewpoints. Evaluations conducted on multiple benchmark datasets demonstrate superior performance, while practical use-cases in interactive gaming and real-time digital-twin manipulation illustrate HoloScene's broad applicability and effectiveness. Project page: https://xiahongchi.github.io/HoloScene.

Authors:Yang Xiao, Gen Li, Kaiyuan Deng, Yushu Wu, Zheng Zhan, Yanzhi Wang, Xiaolong Ma, Bo Hui
Title: LightCache: Memory-Efficient, Training-Free Acceleration for Video Generation
Abstract:
Training-free acceleration has emerged as an advanced research area in video generation based on diffusion models. The redundancy of latents in diffusion model inference provides a natural entry point for acceleration. In this paper, we decompose the inference process into the encoding, denoising, and decoding stages, and observe that cache-based acceleration methods often lead to substantial memory surges in the latter two stages. To address this problem, we analyze the characteristics of inference across different stages and propose stage-specific strategies for reducing memory consumption: 1) Asynchronous Cache Swapping. 2) Feature chunk. 3) Slicing latents to decode. At the same time, we ensure that the time overhead introduced by these three strategies remains lower than the acceleration gains themselves. Compared with the baseline, our approach achieves faster inference speed and lower memory usage, while maintaining quality degradation within an acceptable range. The Code is available at https://github.com/NKUShaw/LightCache .
中文: 本文提出LightCache方法,通过异步缓存交换、特征分块等阶段优化策略,在无需训练的情况下降低扩散模型视频生成的内存消耗,实现加速推理且保持可接受的画质损失。
English: This paper introduces LightCache, a training-free method that accelerates video generation in diffusion models by reducing memory usage through stage-specific strategies like asynchronous cache swapping and feature chunking, achieving faster inference with minimal quality loss.

Authors:Zeyu Zhu, Kevin Qinghong Lin, Mike Zheng Shou
Title: Paper2Video: Automatic Video Generation from Scientific Papers
Abstract:
Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video. Unlike natural video, presentation video generation involves distinctive challenges: inputs from research papers, dense multi-modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talker. To address these challenges, we introduce Paper2Video, the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata. We further design four tailored evaluation metrics--Meta Similarity, PresentArena, PresentQuiz, and IP Memory--to measure how videos convey the paper's information to the audience. Building on this foundation, we propose PaperTalker, the first multi-agent framework for academic presentation video generation. It integrates slide generation with effective layout refinement by a novel effective tree search visual choice, cursor grounding, subtitling, speech synthesis, and talking-head rendering, while parallelizing slide-wise generation for efficiency. Experiments on Paper2Video demonstrate that the presentation videos produced by our approach are more faithful and informative than existing baselines, establishing a practical step toward automated and ready-to-use academic video generation. Our dataset, agent, and code are available at https://github.com/showlab/Paper2Video.
中文:Paper2Video基准和PaperTalker框架通过自动化多模态内容协调并引入定制化评估指标,解决了学术演示视频制作费时费力的问题,生成的视频比现有方法更具信息保真度和内容丰富性。
English: The Paper2Video benchmark and PaperTalker framework address the labor-intensive creation of academic presentation videos by automating multi-modal content coordination and introducing tailored evaluation metrics, resulting in more faithful and informative videos than existing methods.

Authors:Ziqi Huang, Ning Yu, Gordon Chen, Haonan Qiu, Paul Debevec, Ziwei Liu
Title: VChain: Chain-of-Visual-Thought for Reasoning in Video Generation
Abstract:
Recent video generation models can produce smooth and visually appealing clips, but they often struggle to synthesize complex dynamics with a coherent chain of consequences. Accurately modeling visual outcomes and state transitions over time remains a core challenge. In contrast, large language and multimodal models (e.g., GPT-4o) exhibit strong visual state reasoning and future prediction capabilities. To bridge these strengths, we introduce VChain, a novel inference-time chain-of-visual-thought framework that injects visual reasoning signals from multimodal models into video generation. Specifically, VChain contains a dedicated pipeline that leverages large multimodal models to generate a sparse set of critical keyframes as snapshots, which are then used to guide the sparse inference-time tuning of a pre-trained video generator only at these key moments. Our approach is tuning-efficient, introduces minimal overhead and avoids dense supervision. Extensive experiments on complex, multi-step scenarios show that VChain significantly enhances the quality of generated videos.
Chinese: VChain是一种新颖的框架,通过利用大型多模态模型生成关键帧来指导预训练视频生成器的调整,显著提升了复杂场景下生成视频的质量。
English: VChain is a novel framework that enhances video generation by using large multimodal models to create critical keyframes, which guide the tuning of a pre-trained video generator, significantly improving video quality in complex scenarios.

Authors:Tingting Liao, Chongjian Ge, Guangyi Liu, Hao Li, Yi Zhou
Title: Character Mixing for Video Generation
Abstract:
Imagine Mr. Bean stepping into Tom and Jerry--can we generate videos where characters interact naturally across different worlds? We study inter-character interaction in text-to-video generation, where the key challenge is to preserve each character's identity and behaviors while enabling coherent cross-context interaction. This is difficult because characters may never have coexisted and because mixing styles often causes style delusion, where realistic characters appear cartoonish or vice versa. We introduce a framework that tackles these issues with Cross-Character Embedding (CCE), which learns identity and behavioral logic across multimodal sources, and Cross-Character Augmentation (CCA), which enriches training with synthetic co-existence and mixed-style data. Together, these techniques allow natural interactions between previously uncoexistent characters without losing stylistic fidelity. Experiments on a curated benchmark of cartoons and live-action series with 10 characters show clear improvements in identity preservation, interaction quality, and robustness to style delusion, enabling new forms of generative storytelling.Additional results and videos are available on our project page: https://tingtingliao.github.io/mimix/.

Authors:Ronen Kamenetsky, Sara Dorfman, Daniel Garibi, Roni Paiss, Or Patashnik, Daniel Cohen-Or
Title: SAEdit: Token-level control for continuous image editing via Sparse AutoEncoder
Abstract:
Large-scale text-to-image diffusion models have become the backbone of modern image editing, yet text prompts alone do not offer adequate control over the editing process. Two properties are especially desirable: disentanglement, where changing one attribute does not unintentionally alter others, and continuous control, where the strength of an edit can be smoothly adjusted. We introduce a method for disentangled and continuous editing through token-level manipulation of text embeddings. The edits are applied by manipulating the embeddings along carefully chosen directions, which control the strength of the target attribute. To identify such directions, we employ a Sparse Autoencoder (SAE), whose sparse latent space exposes semantically isolated dimensions. Our method operates directly on text embeddings without modifying the diffusion process, making it model agnostic and broadly applicable to various image synthesis backbones. Experiments show that it enables intuitive and efficient manipulations with continuous control across diverse attributes and domains.
中文: 本文提出一种基于稀疏自编码器的文本嵌入标记级操控方法,实现解耦连续的图像编辑,无需修改扩散过程即可跨模型控制属性强度。
English: This paper introduces a token-level text embedding manipulation method using a Sparse Autoencoder to achieve disentangled and continuous image editing, providing model-agnostic control over attribute strength without altering the diffusion process.

Authors:Yolo Yunlong Tang, Jing Bi, Pinxin Liu, Zhenyu Pan, Zhangyun Tan, Qianxiang Shen, Jiani Liu, Hang Hua, Junjia Guo, Yunzhong Xiao, Chao Huang, Zhiyuan Wang, Susan Liang, Xinyi Liu, Yizhi Song, Yuhe Nie, Jia-Xing Zhong, Bozheng Li, Daiqing Qi, Ziyun Zeng, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Daiki Shimada, Han Liu, Jiebo Luo, Chenliang Xu
Title: Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
Abstract:
Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training
中文: 本综述首次系统分析了视频大型多模态模型的后训练方法,涵盖监督微调、强化学习和测试时扩展三大支柱,旨在提升模型对复杂时空关系和长视频的理解与推理能力。
English: This survey offers the first comprehensive analysis of post-training methods for Video-Large Multimodal Models, focusing on supervised fine-tuning, reinforcement learning, and test-time scaling to enhance their reasoning capabilities for complex video understanding tasks.

Authors:Chi Yan, Dan Xu
Title: Progressive Gaussian Transformer with Anisotropy-aware Sampling for Open Vocabulary Occupancy Prediction
Abstract:
The 3D occupancy prediction task has witnessed remarkable progress in recent years, playing a crucial role in vision-based autonomous driving systems. While traditional methods are limited to fixed semantic categories, recent approaches have moved towards predicting text-aligned features to enable open-vocabulary text queries in real-world scenes. However, there exists a trade-off in text-aligned scene modeling: sparse Gaussian representation struggles to capture small objects in the scene, while dense representation incurs significant computational overhead. To address these limitations, we present PG-Occ, an innovative Progressive Gaussian Transformer Framework that enables open-vocabulary 3D occupancy prediction. Our framework employs progressive online densification, a feed-forward strategy that gradually enhances the 3D Gaussian representation to capture fine-grained scene details. By iteratively enhancing the representation, the framework achieves increasingly precise and detailed scene understanding. Another key contribution is the introduction of an anisotropy-aware sampling strategy with spatio-temporal fusion, which adaptively assigns receptive fields to Gaussians at different scales and stages, enabling more effective feature aggregation and richer scene information capture. Through extensive evaluations, we demonstrate that PG-Occ achieves state-of-the-art performance with a relative 14.3% mIoU improvement over the previous best performing method. Code and pretrained models will be released upon publication on our project page: https://yanchi-3dv.github.io/PG-Occ

Authors:KunHo Heo, GiHyun Kim, SuYeon Kim, MyeongAh Cho
Title: Object-Centric Representation Learning for Enhanced 3D Scene Graph Prediction
Abstract:
3D Semantic Scene Graph Prediction aims to detect objects and their semantic relationships in 3D scenes, and has emerged as a crucial technology for robotics and AR/VR applications. While previous research has addressed dataset limitations and explored various approaches including Open-Vocabulary settings, they frequently fail to optimize the representational capacity of object and relationship features, showing excessive reliance on Graph Neural Networks despite insufficient discriminative capability. In this work, we demonstrate through extensive analysis that the quality of object features plays a critical role in determining overall scene graph accuracy. To address this challenge, we design a highly discriminative object feature encoder and employ a contrastive pretraining strategy that decouples object representation learning from the scene graph prediction. This design not only enhances object classification accuracy but also yields direct improvements in relationship prediction. Notably, when plugging in our pretrained encoder into existing frameworks, we observe substantial performance improvements across all evaluation metrics. Additionally, whereas existing approaches have not fully exploited the integration of relationship information, we effectively combine both geometric and semantic features to achieve superior relationship prediction. Comprehensive experiments on the 3DSSG dataset demonstrate that our approach significantly outperforms previous state-of-the-art methods. Our code is publicly available at https://github.com/VisualScienceLab-KHU/OCRL-3DSSG-Codes.
Chinese: 本研究通过设计具有对比预训练的判别性物体特征编码器,解决了三维语义场景图预测中的局限性,显著提升了物体分类和关系预测的性能,在3DSSG数据集上超越了现有最优方法。
English: This study addresses the limitations in 3D Semantic Scene Graph prediction by developing a discriminative object feature encoder with contrastive pretraining, which significantly improves both object classification and relationship prediction, outperforming existing methods on the 3DSSG dataset.

Authors:Foivos Paraperas Papantoniou, Stefanos Zafeiriou
Title: ID-Consistent, Precise Expression Generation with Blendshape-Guided Diffusion
Abstract:
Human-centric generative models designed for AI-driven storytelling must bring together two core capabilities: identity consistency and precise control over human performance. While recent diffusion-based approaches have made significant progress in maintaining facial identity, achieving fine-grained expression control without compromising identity remains challenging. In this work, we present a diffusion-based framework that faithfully reimagines any subject under any particular facial expression. Building on an ID-consistent face foundation model, we adopt a compositional design featuring an expression cross-attention module guided by FLAME blendshape parameters for explicit control. Trained on a diverse mixture of image and video data rich in expressive variation, our adapter generalizes beyond basic emotions to subtle micro-expressions and expressive transitions, overlooked by prior works. In addition, a pluggable Reference Adapter enables expression editing in real images by transferring the appearance from a reference frame during synthesis. Extensive quantitative and qualitative evaluations show that our model outperforms existing methods in tailored and identity-consistent expression generation. Code and models can be found at https://github.com/foivospar/Arc2Face.
中文摘要:本文提出了一种基于扩散模型的框架,通过组合式设计和多样化表情数据训练,在保持身份一致性的同时实现精细面部表情控制,其性能优于现有方法。
English Summary: This paper introduces a diffusion-based framework that achieves precise facial expression control while maintaining identity consistency, outperforming existing methods through compositional design and training on diverse expressive data.

Authors:Habin Lim, Yeongseob Won, Juwon Seo, Gyeong-Moon Park
Title: ConceptSplit: Decoupled Multi-Concept Personalization of Diffusion Models via Token-wise Adaptation and Attention Disentanglement
Abstract:
In recent years, multi-concept personalization for text-to-image (T2I) diffusion models to represent several subjects in an image has gained much more attention. The main challenge of this task is "concept mixing", where multiple learned concepts interfere or blend undesirably in the output image. To address this issue, in this paper, we present ConceptSplit, a novel framework to split the individual concepts through training and inference. Our framework comprises two key components. First, we introduce Token-wise Value Adaptation (ToVA), a merging-free training method that focuses exclusively on adapting the value projection in cross-attention. Based on our empirical analysis, we found that modifying the key projection, a common approach in existing methods, can disrupt the attention mechanism and lead to concept mixing. Second, we propose Latent Optimization for Disentangled Attention (LODA), which alleviates attention entanglement during inference by optimizing the input latent. Through extensive qualitative and quantitative experiments, we demonstrate that ConceptSplit achieves robust multi-concept personalization, mitigating unintended concept interference. Code is available at https://github.com/KU-VGI/ConceptSplit
中文: ConceptSplit是一个新颖框架,通过训练中的令牌级值适应和推理中的潜在优化解耦注意力,有效解决了文本到图像扩散模型中多概念个性化的概念混合问题,防止了非预期的概念干扰。
English: ConceptSplit is a novel framework that addresses concept mixing in multi-concept personalization for text-to-image diffusion models through Token-wise Value Adaptation during training and Latent Optimization for Disentangled Attention during inference, effectively preventing unintended concept interference.

Authors:Zeyi Zhang, Yanju Zhou, Heyuan Yao, Tenglong Ao, Xiaohang Zhan, Libin Liu
Title: Social Agent: Mastering Dyadic Nonverbal Behavior Generation via Conversational LLM Agents
Abstract:
We present Social Agent, a novel framework for synthesizing realistic and contextually appropriate co-speech nonverbal behaviors in dyadic conversations. In this framework, we develop an agentic system driven by a Large Language Model (LLM) to direct the conversation flow and determine appropriate interactive behaviors for both participants. Additionally, we propose a novel dual-person gesture generation model based on an auto-regressive diffusion model, which synthesizes coordinated motions from speech signals. The output of the agentic system is translated into high-level guidance for the gesture generator, resulting in realistic movement at both the behavioral and motion levels. Furthermore, the agentic system periodically examines the movements of interlocutors and infers their intentions, forming a continuous feedback loop that enables dynamic and responsive interactions between the two participants. User studies and quantitative evaluations show that our model significantly improves the quality of dyadic interactions, producing natural, synchronized nonverbal behaviors.

Authors:Hao Liu, Yunhao Gao, Wei Li, Mingyang Zhang, Maoguo Gong, Lorenzo Bruzzone
Title: A Spatial-Spectral-Frequency Interactive Network for Multimodal Remote Sensing Classification
Abstract:
Deep learning-based methods have achieved significant success in remote sensing Earth observation data analysis. Numerous feature fusion techniques address multimodal remote sensing image classification by integrating global and local features. However, these techniques often struggle to extract structural and detail features from heterogeneous and redundant multimodal images. With the goal of introducing frequency domain learning to model key and sparse detail features, this paper introduces the spatial-spectral-frequency interaction network (S$^2$Fin), which integrates pairwise fusion modules across the spatial, spectral, and frequency domains. Specifically, we propose a high-frequency sparse enhancement transformer that employs sparse spatial-spectral attention to optimize the parameters of the high-frequency filter. Subsequently, a two-level spatial-frequency fusion strategy is introduced, comprising an adaptive frequency channel module that fuses low-frequency structures with enhanced high-frequency details, and a high-frequency resonance mask that emphasizes sharp edges via phase similarity. In addition, a spatial-spectral attention fusion module further enhances feature extraction at intermediate layers of the network. Experiments on four benchmark multimodal datasets with limited labeled data demonstrate that S$^2$Fin performs superior classification, outperforming state-of-the-art methods. The code is available at https://github.com/HaoLiu-XDU/SSFin.
中文摘要:本文提出空间-光谱-频率交互网络(S²Fin),通过引入频域学习和多域融合策略,有效解决了异质多模态遥感图像中结构与细节特征提取的难题,在多个基准数据集上实现了优于现有方法的分类性能。
English Summary: This paper introduces the spatial-spectral-frequency interaction network (S²Fin) that enhances multimodal remote sensing image classification by integrating frequency domain learning to better extract structural and detail features from heterogeneous data, demonstrating superior performance on benchmark datasets.

Authors:Honglin Liu, Chao Sun, Peng Hu, Yunfan Li, Xi Peng
Title: Conditional Representation Learning for Customized Tasks
Abstract:
Conventional representation learning methods learn a universal representation that primarily captures dominant semantics, which may not always align with customized downstream tasks. For instance, in animal habitat analysis, researchers prioritize scene-related features, whereas universal embeddings emphasize categorical semantics, leading to suboptimal results. As a solution, existing approaches resort to supervised fine-tuning, which however incurs high computational and annotation costs. In this paper, we propose Conditional Representation Learning (CRL), aiming to extract representations tailored to arbitrary user-specified criteria. Specifically, we reveal that the semantics of a space are determined by its basis, thereby enabling a set of descriptive words to approximate the basis for a customized feature space. Building upon this insight, given a user-specified criterion, CRL first employs a large language model (LLM) to generate descriptive texts to construct the semantic basis, then projects the image representation into this conditional feature space leveraging a vision-language model (VLM). The conditional representation better captures semantics for the specific criterion, which could be utilized for multiple customized tasks. Extensive experiments on classification and retrieval tasks demonstrate the superiority and generality of the proposed CRL. The code is available at https://github.com/XLearning-SCU/2025-NeurIPS-CRL.
中文: 提出的条件表示学习(CRL)方法通过大语言模型生成描述性文本来构建语义基,并利用视觉语言模型将图像特征投影至该定制空间,无需精细调优即可在特定任务中优于通用表示。
English: The proposed Conditional Representation Learning (CRL) method generates customized representations by constructing semantic bases from descriptive texts using LLMs and projecting image features via VLMs, outperforming universal embeddings in tailored tasks without costly fine-tuning.

Authors:Jorge Leonardo Ruiz Williams
Title: Fast Witness Persistence for MRI Volumes via Hybrid Landmarking
Abstract:
We introduce a scalable witness-based persistent homology pipeline for full-brain MRI volumes that couples density-aware landmark selection with a GPU-ready witness filtration. Candidates are scored by a hybrid metric that balances geometric coverage against inverse kernel density, yielding landmark sets that shrink mean pairwise distances by 30-60% over random or density-only baselines while preserving topological features. Benchmarks on BrainWeb, IXI, and synthetic manifolds execute in under ten seconds on a single NVIDIA RTX 4090 GPU, avoiding the combinatorial blow-up of Cech, Vietoris-Rips, and alpha filtrations. The package is distributed on PyPI as whale-tda (installable via pip); source and issues are hosted at https://github.com/jorgeLRW/whale. The release also exposes a fast preset (mri_deep_dive_fast) for exploratory sweeps, and ships with reproducibility-focused scripts and artifacts for drop-in use in medical imaging workflows.
中文: 我们提出了一种可扩展的脑部MRI持续同调分析流程,结合密度感知地标选择与GPU优化过滤,在保持拓扑结构的同时比基线方法减少30-60%距离,并在十秒内完成体积数据处理。
English: We present a scalable persistent homology pipeline for brain MRI analysis that combines density-aware landmark selection with GPU-optimized filtration, achieving 30-60% distance reduction over baselines while preserving topology and processing volumes in under ten seconds.

Authors:Zijing Hu, Yunze Tong, Fengda Zhang, Junkun Yuan, Jun Xiao, Kun Kuang
Title: Asynchronous Denoising Diffusion Models for Aligning Text-to-Image Generation
Abstract:
Diffusion models have achieved impressive results in generating high-quality images. Yet, they often struggle to faithfully align the generated images with the input prompts. This limitation arises from synchronous denoising, where all pixels simultaneously evolve from random noise to clear images. As a result, during generation, the prompt-related regions can only reference the unrelated regions at the same noise level, failing to obtain clear context and ultimately impairing text-to-image alignment. To address this issue, we propose asynchronous diffusion models -- a novel framework that allocates distinct timesteps to different pixels and reformulates the pixel-wise denoising process. By dynamically modulating the timestep schedules of individual pixels, prompt-related regions are denoised more gradually than unrelated regions, thereby allowing them to leverage clearer inter-pixel context. Consequently, these prompt-related regions achieve better alignment in the final images. Extensive experiments demonstrate that our asynchronous diffusion models can significantly improve text-to-image alignment across diverse prompts. The code repository for this work is available at https://github.com/hu-zijing/AsynDM.
中文总结:该研究提出的异步扩散模型通过为不同像素分配差异化去噪时间步,使提示相关区域能够参考更清晰的上下文信息,从而显著提升了文本到图像的生成对齐效果。
English Summary: The proposed asynchronous diffusion model improves text-to-image alignment by assigning different denoising timesteps to pixels, allowing prompt-related regions to reference clearer context during generation.

Authors:Baber Jan, Saeed Anwar, Aiman H. El-Maleh, Abdul Jabbar Siddiqui, Abdul Bais
Title: SPEGNet: Synergistic Perception-Guided Network for Camouflaged Object Detection
Abstract:
Camouflaged object detection segments objects with intrinsic similarity and edge disruption. Current detection methods rely on accumulated complex components. Each approach adds components such as boundary modules, attention mechanisms, and multi-scale processors independently. This accumulation creates a computational burden without proportional gains. To manage this complexity, they process at reduced resolutions, eliminating fine details essential for camouflage. We present SPEGNet, addressing fragmentation through a unified design. The architecture integrates multi-scale features via channel calibration and spatial enhancement. Boundaries emerge directly from context-rich representations, maintaining semantic-spatial alignment. Progressive refinement implements scale-adaptive edge modulation with peak influence at intermediate resolutions. This design strikes a balance between boundary precision and regional consistency. SPEGNet achieves 0.887 $S_α$ on CAMO, 0.890 on COD10K, and 0.895 on NC4K, with real-time inference speed. Our approach excels across scales, from tiny, intricate objects to large, pattern-similar ones, while handling occlusion and ambiguous boundaries. Code, model weights, and results are available on \href{https://github.com/Baber-Jan/SPEGNet}{https://github.com/Baber-Jan/SPEGNet}.
Chinese: SPEGNet提出了一种统一架构,通过整合多尺度特征和渐进式优化,在基准数据集上实现了精确的伪装目标检测和实时性能,超越了现有方法。
English: SPEGNet introduces a unified architecture that integrates multi-scale features and progressive refinement to achieve precise camouflaged object detection with real-time performance, outperforming existing methods on benchmark datasets.

Authors:Xuehai He, Shijie Zhou, Thivyanth Venkateswaran, Kaizhi Zheng, Ziyu Wan, Achuta Kadambi, Xin Eric Wang
Title: MorphoSim: An Interactive, Controllable, and Editable Language-guided 4D World Simulator
Abstract:
World models that support controllable and editable spatiotemporal environments are valuable for robotics, enabling scalable training data, repro ducible evaluation, and flexible task design. While recent text-to-video models generate realistic dynam ics, they are constrained to 2D views and offer limited interaction. We introduce MorphoSim, a language guided framework that generates 4D scenes with multi-view consistency and object-level controls. From natural language instructions, MorphoSim produces dynamic environments where objects can be directed, recolored, or removed, and scenes can be observed from arbitrary viewpoints. The framework integrates trajectory-guided generation with feature field dis tillation, allowing edits to be applied interactively without full re-generation. Experiments show that Mor phoSim maintains high scene fidelity while enabling controllability and editability. The code is available at https://github.com/eric-ai-lab/Morph4D.
中文: MorphoSim是一种语言引导的框架,能生成具有多视角一致性和对象级控制的4D场景,支持动态环境交互编辑且无需完全重新生成,同时保持高场景保真度。
English: MorphoSim is a language-guided framework that generates controllable 4D environments with multi-view consistency and object-level editing capabilities, enabling dynamic scene manipulation without full regeneration while maintaining high fidelity.

Authors:Wojciech Górny, Michał Łasica, Alexandros Matsoukas
Title: Adaptive double-phase Rudin--Osher--Fatemi denoising model
Abstract:
We propose a new image denoising model based on a variable-growth total variation regularization of double-phase type with adaptive weight. It is designed to reduce staircasing with respect to the classical Rudin--Osher--Fatemi model, while preserving the edges of the image in a similar fashion. We implement the model and test its performance on synthetic and natural images in 1D and 2D over a range of noise levels.
Chinese: 本研究提出了一种基于自适应权重双相型变增长全变分正则化的新型图像去噪模型,旨在减少阶梯效应并有效保持图像边缘,已通过在不同噪声水平下对合成和自然图像的测试验证其性能。
English: This study introduces a novel image denoising model using a variable-growth total variation regularization with adaptive weighting to minimize staircasing effects and maintain edge preservation, which has been validated through tests on both synthetic and natural images across various noise levels.

Authors:Lan Feng, Yang Gao, Eloi Zablocki, Quanyi Li, Wuyang Li, Sichao Liu, Matthieu Cord, Alexandre Alahi
Title: RAP: 3D Rasterization Augmented End-to-End Planning
Abstract:
Imitation learning for end-to-end driving trains policies only on expert demonstrations. Once deployed in a closed loop, such policies lack recovery data: small mistakes cannot be corrected and quickly compound into failures. A promising direction is to generate alternative viewpoints and trajectories beyond the logged path. Prior work explores photorealistic digital twins via neural rendering or game engines, but these methods are prohibitively slow and costly, and thus mainly used for evaluation. In this work, we argue that photorealism is unnecessary for training end-to-end planners. What matters is semantic fidelity and scalability: driving depends on geometry and dynamics, not textures or lighting. Motivated by this, we propose 3D Rasterization, which replaces costly rendering with lightweight rasterization of annotated primitives, enabling augmentations such as counterfactual recovery maneuvers and cross-agent view synthesis. To transfer these synthetic views effectively to real-world deployment, we introduce a Raster-to-Real feature-space alignment that bridges the sim-to-real gap. Together, these components form Rasterization Augmented Planning (RAP), a scalable data augmentation pipeline for planning. RAP achieves state-of-the-art closed-loop robustness and long-tail generalization, ranking first on four major benchmarks: NAVSIM v1/v2, Waymo Open Dataset Vision-based E2E Driving, and Bench2Drive. Our results show that lightweight rasterization with feature alignment suffices to scale E2E training, offering a practical alternative to photorealistic rendering. Project page: https://alan-lanfeng.github.io/RAP/.

Authors:Jiarui Ouyang, Yihui Wang, Yihang Gao, Yingxue Xu, Shu Yang, Hao Chen
Title: GenAR: Next-Scale Autoregressive Generation for Spatial Gene Expression Prediction
Abstract:
Spatial Transcriptomics (ST) offers spatially resolved gene expression but remains costly. Predicting expression directly from widely available Hematoxylin and Eosin (H&E) stained images presents a cost-effective alternative. However, most computational approaches (i) predict each gene independently, overlooking co-expression structure, and (ii) cast the task as continuous regression despite expression being discrete counts. This mismatch can yield biologically implausible outputs and complicate downstream analyses. We introduce GenAR, a multi-scale autoregressive framework that refines predictions from coarse to fine. GenAR clusters genes into hierarchical groups to expose cross-gene dependencies, models expression as codebook-free discrete token generation to directly predict raw counts, and conditions decoding on fused histological and spatial embeddings. From an information-theoretic perspective, the discrete formulation avoids log-induced biases and the coarse-to-fine factorization aligns with a principled conditional decomposition. Extensive experimental results on four Spatial Transcriptomics datasets across different tissue types demonstrate that GenAR achieves state-of-the-art performance, offering potential implications for precision medicine and cost-effective molecular profiling. Code is publicly available at https://github.com/oyjr/genar.
中文: GenAR是一种创新的多尺度自回归框架,通过建模基因间依赖关系并融合组织学与空间嵌入特征,直接从H&E图像预测离散基因表达计数,在多种组织类型上实现最优性能,同时解决了连续回归方法的局限性。
English: GenAR is a novel multi-scale autoregressive framework that predicts discrete gene expression counts directly from H&E images by modeling cross-gene dependencies and using fused histological-spatial embeddings, achieving state-of-the-art performance across multiple tissue types while addressing limitations of continuous regression approaches.

Authors:Moo Hyun Son, Jintaek Oh, Sun Bin Mun, Jaechul Roh, Sehyun Choi
Title: World-To-Image: Grounding Text-to-Image Generation with Agent-Driven World Knowledge
Abstract:
While text-to-image (T2I) models can synthesize high-quality images, their performance degrades significantly when prompted with novel or out-of-distribution (OOD) entities due to inherent knowledge cutoffs. We introduce World-To-Image, a novel framework that bridges this gap by empowering T2I generation with agent-driven world knowledge. We design an agent that dynamically searches the web to retrieve images for concepts unknown to the base model. This information is then used to perform multimodal prompt optimization, steering powerful generative backbones toward an accurate synthesis. Critically, our evaluation goes beyond traditional metrics, utilizing modern assessments like LLMGrader and ImageReward to measure true semantic fidelity. Our experiments show that World-To-Image substantially outperforms state-of-the-art methods in both semantic alignment and visual aesthetics, achieving +8.1% improvement in accuracy-to-prompt on our curated NICE benchmark. Our framework achieves these results with high efficiency in less than three iterations, paving the way for T2I systems that can better reflect the ever-changing real world. Our demo code is available here\footnote{https://github.com/mhson-kyle/World-To-Image}.
中文: World-To-Image框架通过智能体检索网络知识补充模型未知概念,采用多模态提示优化技术,在语义准确性和视觉美感上显著超越现有方法。
English: The World-To-Image framework enhances text-to-image models by using an agent to retrieve web-based knowledge for unknown concepts, significantly improving semantic accuracy and visual quality through multimodal prompt optimization.

Authors:Xinglong Luo, Ao Luo, Kunming Luo, Zhengning Wang, Ping Tan, Bing Zeng, Shuaicheng Liu
Title: Learning Efficient Meshflow and Optical Flow from Event Cameras
Abstract:
In this paper, we explore the problem of event-based meshflow estimation, a novel task that involves predicting a spatially smooth sparse motion field from event cameras. To start, we review the state-of-the-art in event-based flow estimation, highlighting two key areas for further research: i) the lack of meshflow-specific event datasets and methods, and ii) the underexplored challenge of event data density. First, we generate a large-scale High-Resolution Event Meshflow (HREM) dataset, which showcases its superiority by encompassing the merits of high resolution at 1280x720, handling dynamic objects and complex motion patterns, and offering both optical flow and meshflow labels. These aspects have not been fully explored in previous works. Besides, we propose Efficient Event-based MeshFlow (EEMFlow) network, a lightweight model featuring a specially crafted encoder-decoder architecture to facilitate swift and accurate meshflow estimation. Furthermore, we upgrade EEMFlow network to support dense event optical flow, in which a Confidence-induced Detail Completion (CDC) module is proposed to preserve sharp motion boundaries. We conduct comprehensive experiments to show the exceptional performance and runtime efficiency (30x faster) of our EEMFlow model compared to the recent state-of-the-art flow method. As an extension, we expand HREM into HREM+, a multi-density event dataset contributing to a thorough study of the robustness of existing methods across data with varying densities, and propose an Adaptive Density Module (ADM) to adjust the density of input event data to a more optimal range, enhancing the model's generalization ability. We empirically demonstrate that ADM helps to significantly improve the performance of EEMFlow and EEMFlow+ by 8% and 10%, respectively. Code and dataset are released at https://github.com/boomluo02/EEMFlowPlus.
中文摘要:本文提出了基于事件相机的新型网格流估计任务,通过构建HREM数据集和开发EEMFlow网络,实现了30倍加速的卓越性能,并利用自适应密度模块显著提升了模型泛化能力。
English Summary: This paper introduces a novel event-based meshflow estimation task, presenting the HREM dataset and EEMFlow network that achieve superior performance with 30x faster processing and enhanced generalization through adaptive density adjustment.

Authors:Zheng Chen, Kewei Zhang, Xiaoyang Liu, Weihang Zhang, Mengfan Wang, Yifan Fu, Yulun Zhang
Title: QuantDemoire: Quantization with Outlier Aware for Image Demoiréing
Abstract:
Demoiréing aims to remove moiré artifacts that often occur in images. While recent deep learning-based methods have achieved promising results, they typically require substantial computational resources, limiting their deployment on edge devices. Model quantization offers a compelling solution. However, directly applying existing quantization methods to demoiréing models introduces severe performance degradation. The main reasons are distribution outliers and weakened representations in smooth regions. To address these issues, we propose QuantDemoire, a post-training quantization framework tailored to demoiréing. It contains two key components. **First}, we introduce an outlier-aware quantizer to reduce errors from outliers. It uses sampling-based range estimation to reduce activation outliers, and keeps a few extreme weights in FP16 with negligible cost. **Second**, we design a frequency-aware calibration strategy. It emphasizes low- and mid-frequency components during fine-tuning, which mitigates banding artifacts caused by low-bit quantization. Extensive experiments validate that our QuantDemoire achieves large reductions in parameters and computation while maintaining quality. Meanwhile, it outperforms existing quantization methods by over **4 dB** on W4A4. Code is released at: https://github.com/zhengchen1999/QuantDemoire.
中文:QuantDemoire是一种专为去摩尔纹设计的训练后量化框架,通过处理分布异常值和增强频率表示,在显著降低计算需求的同时保持高质量的去纹效果。
English: QuantDemoire is a post-training quantization framework designed to efficiently remove moiré patterns from images by addressing distribution outliers and enhancing frequency representations, significantly reducing computational demands while maintaining high performance.

Authors:Bingtao Yang, Yujia Wang, Mengzhi Jiao, Hongwei Huo
Title: Quantization Range Estimation for Convolutional Neural Networks
Abstract:
Post-training quantization for reducing the storage of deep neural network models has been demonstrated to be an effective way in various tasks. However, low-bit quantization while maintaining model accuracy is a challenging problem. In this paper, we present a range estimation method to improve the quantization performance for post-training quantization. We model the range estimation into an optimization problem of minimizing quantization errors by layer-wise local minima. We prove this problem is locally convex and present an efficient search algorithm to find the optimal solution. We propose the application of the above search algorithm to the transformed weights space to do further improvement in practice. Our experiments demonstrate that our method outperforms state-of-the-art performance generally on top-1 accuracy for image classification tasks on the ResNet series models and Inception-v3 model. The experimental results show that the proposed method has almost no loss of top-1 accuracy in 8-bit and 6-bit settings for image classifications, and the accuracy of 4-bit quantization is also significantly improved. The code is available at https://github.com/codeiscommitting/REQuant.
中文: 本文提出了一种通过逐层局部最小值优化来最小化量化误差的范围估计方法,显著提升了训练后量化性能,在图像分类模型的8位和6位量化中几乎无精度损失,并大幅改善了4位量化的准确性。
English: This paper introduces a range estimation method that minimizes quantization errors through layer-wise optimization, significantly enhancing post-training quantization performance with minimal accuracy loss in 8-bit and 6-bit settings and notable improvements in 4-bit quantization for image classification models.

Authors:Kushal Vyas, Ashok Veeraraghavan, Guha Balakrishnan
Title: Fit Pixels, Get Labels: Meta-learned Implicit Networks for Image Segmentation
Abstract:
Implicit neural representations (INRs) have achieved remarkable successes in learning expressive yet compact signal representations. However, they are not naturally amenable to predictive tasks such as segmentation, where they must learn semantic structures over a distribution of signals. In this study, we introduce MetaSeg, a meta-learning framework to train INRs for medical image segmentation. MetaSeg uses an underlying INR that simultaneously predicts per pixel intensity values and class labels. It then uses a meta-learning procedure to find optimal initial parameters for this INR over a training dataset of images and segmentation maps, such that the INR can simply be fine-tuned to fit pixels of an unseen test image, and automatically decode its class labels. We evaluated MetaSeg on 2D and 3D brain MRI segmentation tasks and report Dice scores comparable to commonly used U-Net models, but with $90\%$ fewer parameters. MetaSeg offers a fresh, scalable alternative to traditional resource-heavy architectures such as U-Nets and vision transformers for medical image segmentation. Our project is available at https://kushalvyas.github.io/metaseg.html .

Authors:Yaxin Hou, Bo Han, Yuheng Jia, Hui Liu, Junhui Hou
Title: Keep It on a Leash: Controllable Pseudo-label Generation Towards Realistic Long-Tailed Semi-Supervised Learning
Abstract:
Current long-tailed semi-supervised learning methods assume that labeled data exhibit a long-tailed distribution, and unlabeled data adhere to a typical predefined distribution (i.e., long-tailed, uniform, or inverse long-tailed). However, the distribution of the unlabeled data is generally unknown and may follow an arbitrary distribution. To tackle this challenge, we propose a Controllable Pseudo-label Generation (CPG) framework, expanding the labeled dataset with the progressively identified reliable pseudo-labels from the unlabeled dataset and training the model on the updated labeled dataset with a known distribution, making it unaffected by the unlabeled data distribution. Specifically, CPG operates through a controllable self-reinforcing optimization cycle: (i) at each training step, our dynamic controllable filtering mechanism selectively incorporates reliable pseudo-labels from the unlabeled dataset into the labeled dataset, ensuring that the updated labeled dataset follows a known distribution; (ii) we then construct a Bayes-optimal classifier using logit adjustment based on the updated labeled data distribution; (iii) this improved classifier subsequently helps identify more reliable pseudo-labels in the next training step. We further theoretically prove that this optimization cycle can significantly reduce the generalization error under some conditions. Additionally, we propose a class-aware adaptive augmentation module to further improve the representation of minority classes, and an auxiliary branch to maximize data utilization by leveraging all labeled and unlabeled samples. Comprehensive evaluations on various commonly used benchmark datasets show that CPG achieves consistent improvements, surpassing state-of-the-art methods by up to $\textbf{15.97%}$ in accuracy. The code is available at https://github.com/yaxinhou/CPG.
Chinese: 本文提出的可控伪标签生成(CPG)框架通过动态筛选可靠伪标签扩展标注数据集以维持已知分布,使模型训练不受未标注数据任意分布的影响,在多个基准数据集上实现了最高15.97%的准确率提升。
English: This paper introduces the Controllable Pseudo-label Generation (CPG) framework, which dynamically expands labeled data with reliable pseudo-labels to maintain a known distribution and trains models unaffected by arbitrary unlabeled data distributions, achieving up to 15.97% accuracy improvement over state-of-the-art methods.

Authors:Sameep Vani, Shreyas Jena, Maitreya Patel, Chitta Baral, Somak Aditya, Yezhou Yang
Title: Harnessing Synthetic Preference Data for Enhancing Temporal Understanding of Video-LLMs
Abstract:
While Video Large Language Models (Video-LLMs) have demonstrated remarkable performance across general video understanding benchmarks-particularly in video captioning and descriptive tasks-they consistently underperform on tasks that require fine-grained temporal understanding. This limitation arises due to the lack of visual complexity and temporal nuance in current fine-tuning datasets, leading these models to rely heavily on language-based reasoning rather than truly understanding video dynamics. In this work, we propose TimeWarp, a systematic method to create a targeted synthetic temporal dataset to fine-tune the model's responses to encourage it to focus on the given input video. We introduce a large-scale preference dataset, created using TimeWarp, that captures intricate temporal dynamics often overlooked, grounding the model's responses to visual and temporal information. We demonstrate that when our method is applied to existing models, it significantly improves performance on temporal understanding benchmarks, highlighting the effectiveness of our proposed datasets in advancing temporal understanding in Video-LLMs, resulting in an absolute improvement in performance across seven benchmarks. Code is available at https://github.com/sameepv21/timewarp.
Chinese: 视频大语言模型在通用视频理解中表现出色,但在细粒度时序理解上表现不佳,因此我们提出了TimeWarp方法创建针对性合成数据集,显著提升了七个基准测试中的时序理解性能。
English: Video-LLMs excel in general video tasks but struggle with fine-grained temporal understanding due to insufficiently complex training data, prompting the development of TimeWarp, a synthetic dataset method that significantly enhances temporal reasoning across multiple benchmarks.

Authors:Hyelin Nam, Hyojun Go, Byeongjun Park, Byung-Hoon Kim, Hyungjin Chung
Title: Generating Human Motion Videos using a Cascaded Text-to-Video Framework
Abstract:
Human video generation is becoming an increasingly important task with broad applications in graphics, entertainment, and embodied AI. Despite the rapid progress of video diffusion models (VDMs), their use for general-purpose human video generation remains underexplored, with most works constrained to image-to-video setups or narrow domains like dance videos. In this work, we propose CAMEO, a cascaded framework for general human motion video generation. It seamlessly bridges Text-to-Motion (T2M) models and conditional VDMs, mitigating suboptimal factors that may arise in this process across both training and inference through carefully designed components. Specifically, we analyze and prepare both textual prompts and visual conditions to effectively train the VDM, ensuring robust alignment between motion descriptions, conditioning signals, and the generated videos. Furthermore, we introduce a camera-aware conditioning module that connects the two stages, automatically selecting viewpoints aligned with the input text to enhance coherence and reduce manual intervention. We demonstrate the effectiveness of our approach on both the MovieGen benchmark and a newly introduced benchmark tailored to the T2M-VDM combination, while highlighting its versatility across diverse use cases.

Authors:Md. Atabuzzaman, Andrew Zhang, Chris Thomas
Title: Zero-Shot Fine-Grained Image Classification Using Large Vision-Language Models
Abstract:
Large Vision-Language Models (LVLMs) have demonstrated impressive performance on vision-language reasoning tasks. However, their potential for zero-shot fine-grained image classification, a challenging task requiring precise differentiation between visually similar categories, remains underexplored. We present a novel method that transforms zero-shot fine-grained image classification into a visual question-answering framework, leveraging LVLMs' comprehensive understanding capabilities rather than relying on direct class name generation. We enhance model performance through a novel attention intervention technique. We also address a key limitation in existing datasets by developing more comprehensive and precise class description benchmarks. We validate the effectiveness of our method through extensive experimentation across multiple fine-grained image classification benchmarks. Our proposed method consistently outperforms the current state-of-the-art (SOTA) approach, demonstrating both the effectiveness of our method and the broader potential of LVLMs for zero-shot fine-grained classification tasks. Code and Datasets: https://github.com/Atabuzzaman/Fine-grained-classification
Chinese: 本研究提出了一种创新方法,将零样本细粒度图像分类转化为视觉问答框架,利用大型视觉语言模型并通过注意力干预技术和改进的基准数据集增强性能,在多个基准测试中持续超越现有最优方法。
English: This study introduces a novel method that transforms zero-shot fine-grained image classification into a visual question-answering framework using Large Vision-Language Models, enhanced by attention intervention and improved benchmarks, consistently outperforming current state-of-the-art approaches.

Authors:Minseo Lee, Byeonghyeon Lee, Lucas Yunkyu Lee, Eunsoo Lee, Sangmin Kim, Seunghyeon Song, Joo Chan Lee, Jong Hwan Ko, Jaesik Park, Eunbyung Park
Title: Optimized Minimal 4D Gaussian Splatting
Abstract:
4D Gaussian Splatting has emerged as a new paradigm for dynamic scene representation, enabling real-time rendering of scenes with complex motions. However, it faces a major challenge of storage overhead, as millions of Gaussians are required for high-fidelity reconstruction. While several studies have attempted to alleviate this memory burden, they still face limitations in compression ratio or visual quality. In this work, we present OMG4 (Optimized Minimal 4D Gaussian Splatting), a framework that constructs a compact set of salient Gaussians capable of faithfully representing 4D Gaussian models. Our method progressively prunes Gaussians in three stages: (1) Gaussian Sampling to identify primitives critical to reconstruction fidelity, (2) Gaussian Pruning to remove redundancies, and (3) Gaussian Merging to fuse primitives with similar characteristics. In addition, we integrate implicit appearance compression and generalize Sub-Vector Quantization (SVQ) to 4D representations, further reducing storage while preserving quality. Extensive experiments on standard benchmark datasets demonstrate that OMG4 significantly outperforms recent state-of-the-art methods, reducing model sizes by over 60% while maintaining reconstruction quality. These results position OMG4 as a significant step forward in compact 4D scene representation, opening new possibilities for a wide range of applications. Our source code is available at https://minshirley.github.io/OMG4/.

Authors:Rui Qian, Xin Yin, Chuanhang Deng, Zhiyuan Peng, Jian Xiong, Wei Zhai, Dejing Dou
Title: UGround: Towards Unified Visual Grounding with Unrolled Transformers
Abstract:
We present UGround, a \textbf{U}nified visual \textbf{Ground}ing paradigm that dynamically selects intermediate layers across \textbf{U}nrolled transformers as ``mask as prompt'', diverging from the prevailing pipeline that leverages the fixed last hidden layer as ``\texttt{} as prompt''. UGround addresses two primary challenges posed by the prevailing paradigm: (1) its reliance on the fixed last hidden layer, which sequentially amplifies cumulative errors arising from layer-by-layer propagation without intermediate correction, and (2) its use of \texttt{} as a prompt, which implicitly projects textual embeddings into visual space without explicit spatial cues (\eg, coordinates). Central to UGround is Policy-Prompted Masking, which comprises two key components: Stochastic Skip Connection (SSC) and Mask as Prompt (MasP). SSC is a reinforcement learning policy that, via stochastic sampling, allows each \texttt{} token to slide across unrolled transformer layers, enabling dynamic layer selection at which it connects to the vision model (\eg, SAM) in a skip-connection fashion. Given the selected hidden layer, MasP uses the similarity map derived from the \texttt{} token and image tokens as a soft logit mask to prompt SAM for mask generation, offering explicit spatial cues through its activation regions. To validate the effectiveness of UGround, we, for the first time, have unified visual grounding within a single framework from an attribute perspective, spanning from traditional refer expression segmentation to newly proposed reasoning segmentation, single-target to multi-target, positive query to false premise (empty target). All codes and models are publicly available at \href{https://github.com/rui-qian/UGround}{https://github.com/rui-qian/UGround}.
中文: UGround提出了一种统一的视觉定位范式,通过动态选择Transformer中间层作为掩码提示,解决了现有方法依赖固定输出层和缺乏显式空间线索的问题。
English: UGround introduces a unified visual grounding paradigm that dynamically selects transformer layers for mask prompts, addressing the limitations of fixed-layer reliance and implicit spatial cues in existing methods.

Authors:Shuoyan Wei, Feng Li, Shengeng Tang, Runmin Cong, Yao Zhao, Meng Wang, Huihui Bai
Title: Towards Robust and Generalizable Continuous Space-Time Video Super-Resolution with Events
Abstract:
Continuous space-time video super-resolution (C-STVSR) has garnered increasing interest for its capability to reconstruct high-resolution and high-frame-rate videos at arbitrary spatial and temporal scales. However, prevailing methods often generalize poorly, producing unsatisfactory results when applied to out-of-distribution (OOD) scales. To overcome this limitation, we present EvEnhancer, a novel approach that marries the unique properties of high temporal resolution and high dynamic range encapsulated in event streams to achieve robust and generalizable C-STVSR. Our approach incorporates event-adapted synthesis that capitalizes on the spatiotemporal correlations between frames and events to capture long-term motion trajectories, enabling adaptive interpolation and fusion across space and time. This is then coupled with a local implicit video transformer that integrates local implicit video neural function with cross-scale spatiotemporal attention to learn continuous video representations and generate plausible videos at arbitrary resolutions and frame rates. We further develop EvEnhancerPlus, which builds a controllable switching mechanism that dynamically determines the reconstruction difficulty for each spatiotemporal pixel based on local event statistics. This allows the model to adaptively route reconstruction along the most suitable pathways at a fine-grained pixel level, substantially reducing computational overhead while maintaining excellent performance. Furthermore, we devise a cross-derivative training strategy that stabilizes the convergence of such a multi-pathway framework through staged cross-optimization. Extensive experiments demonstrate that our method achieves state-of-the-art performance on both synthetic and real-world datasets, while maintaining superior generalizability at OOD scales. The code is available at https://github.com/W-Shuoyan/EvEnhancerPlus.
中文: EvEnhancer是一种新颖方法,利用事件流实现鲁棒的连续时空视频超分辨率,通过事件适配合成和局部隐式视频变换器实现跨任意尺度的自适应插值与融合;而EvEnhancerPlus则增加了可控切换机制,在保持优异性能的同时显著降低计算开销。
English: EvEnhancer is a novel method that leverages event streams to achieve robust continuous space-time video super-resolution, incorporating event-adapted synthesis and a local implicit video transformer for adaptive interpolation and fusion across arbitrary scales, while EvEnhancerPlus adds a controllable switching mechanism to reduce computational costs without sacrificing performance.

Authors:Xueyang Zhou, Yangming Xu, Guiyao Tie, Yongchao Chen, Guowen Zhang, Duanfeng Chu, Pan Zhou, Lichao Sun
Title: LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization
Abstract:
LIBERO has emerged as a widely adopted benchmark for evaluating Vision-Language-Action (VLA) models; however, its current training and evaluation settings are problematic, often leading to inflated performance estimates and preventing fair model comparison. To address these issues, we introduce LIBERO-PRO, an extended LIBERO benchmark that systematically evaluates model performance under reasonable perturbations across four dimensions: manipulated objects, initial states, task instructions, and environments. Experimental results reveal that, although existing models achieve over 90% accuracy under the standard LIBERO evaluation, their performance collapses to 0.0% under our generalized setting. Crucially, this discrepancy exposes the models' reliance on rote memorization of action sequences and environment layouts from the training set, rather than genuine task understanding or environmental perception. For instance, models persist in executing grasping actions when the target object is replaced with irrelevant items, and their outputs remain unchanged even when given corrupted instructions or even messy tokens. These findings expose the severe flaws in current evaluation practices, and we call on the community to abandon misleading methodologies in favor of robust assessments of model generalization and comprehension. Our code is available at: https://github.com/Zxy-MLlab/LIBERO-PRO.
中文: LIBERO基准在评估视觉-语言-动作模型时存在缺陷,导致性能评估虚高,因此我们提出LIBERO-PRO,在合理扰动下测试模型,发现它们依赖记忆而非真正理解。
English: The LIBERO benchmark for Vision-Language-Action models is flawed, leading to inflated performance, so we introduce LIBERO-PRO to evaluate models under realistic perturbations, revealing their reliance on memorization rather than true understanding.

Authors:Yiheng Dong, Yi Lin, Xin Yang
Title: CoPA: Hierarchical Concept Prompting and Aggregating Network for Explainable Diagnosis
Abstract:
The transparency of deep learning models is essential for clinical diagnostics. Concept Bottleneck Model provides clear decision-making processes for diagnosis by transforming the latent space of black-box models into human-understandable concepts. However, concept-based methods still face challenges in concept capture capabilities. These methods often rely on encode features solely from the final layer, neglecting shallow and multiscale features, and lack effective guidance in concept encoding, hindering fine-grained concept extraction. To address these issues, we introduce Concept Prompting and Aggregating (CoPA), a novel framework designed to capture multilayer concepts under prompt guidance. This framework utilizes the Concept-aware Embedding Generator (CEG) to extract concept representations from each layer of the visual encoder. Simultaneously, these representations serve as prompts for Concept Prompt Tuning (CPT), steering the model towards amplifying critical concept-related visual cues. Visual representations from each layer are aggregated to align with textual concept representations. With the proposed method, valuable concept-wise information in the images is captured and utilized effectively, thus improving the performance of concept and disease prediction. Extensive experimental results demonstrate that CoPA outperforms state-of-the-art methods on three public datasets. Code is available at https://github.com/yihengd/CoPA.
中文: CoPA框架通过提示引导的多层概念聚合,有效提升了临床诊断的透明度和疾病预测性能,在多个公开数据集上表现优于现有先进方法。
English: The CoPA framework enhances clinical diagnostic transparency by capturing multilayer concepts through prompt-guided aggregation, improving both concept and disease prediction accuracy across multiple datasets.

Authors:Jiaxin Deng, Junbiao Pang
Title: Adaptively Sampling-Reusing-Mixing Decomposed Gradients to Speed Up Sharpness Aware Minimization
Abstract:
Sharpness-Aware Minimization (SAM) improves model generalization but doubles the computational cost of Stochastic Gradient Descent (SGD) by requiring twice the gradient calculations per optimization step. To mitigate this, we propose Adaptively sampling-Reusing-mixing decomposed gradients to significantly accelerate SAM (ARSAM). Concretely, we firstly discover that SAM's gradient can be decomposed into the SGD gradient and the Projection of the Second-order gradient onto the First-order gradient (PSF). Furthermore, we observe that the SGD gradient and PSF dynamically evolve during training, emphasizing the growing role of the PSF to achieve a flat minima. Therefore, ARSAM is proposed to the reused PSF and the timely updated PSF still maintain the model's generalization ability. Extensive experiments show that ARSAM achieves state-of-the-art accuracies comparable to SAM across diverse network architectures. On CIFAR-10/100, ARSAM is comparable to SAM while providing a speedup of about 40\%. Moreover, ARSAM accelerates optimization for the various challenge tasks (\textit{e.g.}, human pose estimation, and model quantization) without sacrificing performance, demonstrating its broad practicality.% The code is publicly accessible at: https://github.com/ajiaaa/ARSAM.
中文: ARSAM通过自适应重用分解梯度来加速锐度感知最小化,在CIFAR数据集上保持与SAM相当的准确率同时实现约40%的速度提升。
English: ARSAM accelerates Sharpness-Aware Minimization by adaptively reusing decomposed gradients, maintaining comparable accuracy to SAM while achieving about 40% speedup on CIFAR datasets.

Authors:Zuomin Qu, Yimao Guo, Qianyue Hu, Wei Lu
Title: LoRA Patching: Exposing the Fragility of Proactive Defenses against Deepfakes
Abstract:
Deepfakes pose significant societal risks, motivating the development of proactive defenses that embed adversarial perturbations in facial images to prevent manipulation. However, in this paper, we show that these preemptive defenses often lack robustness and reliability. We propose a novel approach, Low-Rank Adaptation (LoRA) patching, which injects a plug-and-play LoRA patch into Deepfake generators to bypass state-of-the-art defenses. A learnable gating mechanism adaptively controls the effect of the LoRA patch and prevents gradient explosions during fine-tuning. We also introduce a Multi-Modal Feature Alignment (MMFA) loss, encouraging the features of adversarial outputs to align with those of the desired outputs at the semantic level. Beyond bypassing, we present defensive LoRA patching, embedding visible warnings in the outputs as a complementary solution to mitigate this newly identified security vulnerability. With only 1,000 facial examples and a single epoch of fine-tuning, LoRA patching successfully defeats multiple proactive defenses. These results reveal a critical weakness in current paradigms and underscore the need for more robust Deepfake defense strategies. Our code is available at https://github.com/ZOMIN28/LoRA-Patching.
中文: 本文揭示了当前主动式深度伪造防御的脆弱性,提出了一种新型LoRA修补方法,既能有效绕过现有防御机制,又提供了防御性版本以应对这一新安全漏洞。
English: This paper reveals the vulnerability of current proactive Deepfake defenses and introduces a novel LoRA patching method that effectively bypasses them while also proposing a defensive version to mitigate this new security threat.

Authors:Claudia Takyi Ankomah, Livingstone Eli Ayivor, Ireneaus Nyame, Leslie Wambo, Patrick Yeboah Bonsu, Aondona Moses Iorumbur, Raymond Confidence, Toufiq Musah
Title: How We Won BraTS-SSA 2025: Brain Tumor Segmentation in the Sub-Saharan African Population Using Segmentation-Aware Data Augmentation and Model Ensembling
Abstract:
Brain tumors, particularly gliomas, pose significant chall-enges due to their complex growth patterns, infiltrative nature, and the variability in brain structure across individuals, which makes accurate diagnosis and monitoring difficult. Deep learning models have been developed to accurately delineate these tumors. However, most of these models were trained on relatively homogenous high-resource datasets, limiting their robustness when deployed in underserved regions. In this study, we performed segmentation-aware offline data augmentation on the BraTS-Africa dataset to increase the data sample size and diversity to enhance generalization. We further constructed an ensemble of three distinct architectures, MedNeXt, SegMamba, and Residual-Encoder U-Net, to leverage their complementary strengths. Our best-performing model, MedNeXt, was trained on 1000 epochs and achieved the highest average lesion-wise dice and normalized surface distance scores of 0.86 and 0.81 respectively. However, the ensemble model trained for 500 epochs produced the most balanced segmentation performance across the tumour subregions. This work demonstrates that a combination of advanced augmentation and model ensembling can improve segmentation accuracy and robustness on diverse and underrepresented datasets. Code available at: https://github.com/SPARK-Academy-2025/SPARK-2025/tree/main/SPARK2025_BraTs_MODELS/SPARK_NeuroAshanti
中文: 本研究通过先进的数据增强和MedNeXt、SegMamba及残差编码器U-Net的模型集成,提升了BraTS-Africa数据集上脑肿瘤分割的准确性,增强了针对代表性不足人群的泛化能力。
English: This study enhances brain tumor segmentation on the BraTS-Africa dataset using advanced data augmentation and an ensemble of MedNeXt, SegMamba, and Residual-Encoder U-Net models, achieving improved accuracy and robustness for underrepresented populations.

Authors:Sixten Norelius, Aaron O. Feldman, Mac Schwager
Title: SketchPlan: Diffusion Based Drone Planning From Human Sketches
Abstract:
We propose SketchPlan, a diffusion-based planner that interprets 2D hand-drawn sketches over depth images to generate 3D flight paths for drone navigation. SketchPlan comprises two components: a SketchAdapter that learns to map the human sketches to projected 2D paths, and DiffPath, a diffusion model that infers 3D trajectories from 2D projections and a first person view depth image. Our model achieves zero-shot sim-to-real transfer, generating accurate and safe flight paths in previously unseen real-world environments. To train the model, we build a synthetic dataset of 32k flight paths using a diverse set of photorealistic 3D Gaussian Splatting scenes. We automatically label the data by computing 2D projections of the 3D flight paths onto the camera plane, and use this to train the DiffPath diffusion model. However, since real human 2D sketches differ significantly from ideal 2D projections, we additionally label 872 of the 3D flight paths with real human sketches and use this to train the SketchAdapter to infer the 2D projection from the human sketch. We demonstrate SketchPlan's effectiveness in both simulated and real-world experiments, and show through ablations that training on a mix of human labeled and auto-labeled data together with a modular design significantly boosts its capabilities to correctly interpret human intent and infer 3D paths. In real-world drone tests, SketchPlan achieved 100\% success in low/medium clutter and 40\% in unseen high-clutter environments, outperforming key ablations by 20-60\% in task completion.
中文:SketchPlan是一种基于扩散模型的系统,能将手绘二维草图结合深度图像生成无人机三维飞行路径,通过合成数据与人工标注的混合训练实现了零样本的仿真到现实迁移,在真实环境中展现出卓越的导航能力。
English: SketchPlan is a diffusion-based system that converts 2D hand-drawn sketches into 3D drone flight paths using depth images, achieving successful zero-shot transfer to real-world navigation through a combination of synthetic data and human-annotated training.

Authors:Lyes Saad Saoud, Loic Lesobre, Enrico Sorato, Irfan Hussain
Title: Real-Time Threaded Houbara Detection and Segmentation for Wildlife Conservation using Mobile Platforms
Abstract:
Real-time animal detection and segmentation in natural environments are vital for wildlife conservation, enabling non-invasive monitoring through remote camera streams. However, these tasks remain challenging due to limited computational resources and the cryptic appearance of many species. We propose a mobile-optimized two-stage deep learning framework that integrates a Threading Detection Model (TDM) to parallelize YOLOv10-based detection and MobileSAM-based segmentation. Unlike prior YOLO+SAM pipelines, our approach improves real-time performance by reducing latency through threading. YOLOv10 handles detection while MobileSAM performs lightweight segmentation, both executed concurrently for efficient resource use. On the cryptic Houbara Bustard, a conservation-priority species, our model achieves mAP50 of 0.9627, mAP75 of 0.7731, mAP95 of 0.7178, and a MobileSAM mIoU of 0.7421. YOLOv10 operates at 43.7 ms per frame, confirming real-time readiness. We introduce a curated Houbara dataset of 40,000 annotated images to support model training and evaluation across diverse conditions. The code and dataset used in this study are publicly available on GitHub at https://github.com/LyesSaadSaoud/mobile-houbara-detseg. For interactive demos and additional resources, visit https://lyessaadsaoud.github.io/LyesSaadSaoud-Threaded-YOLO-SAM-Houbara.
Chinese: 本研究提出了一种移动优化的两阶段深度学习框架,结合YOLOv10进行检测和MobileSAM进行分割,通过线程化技术提升实时性能,在隐蔽的波斑鸨检测与分割中实现了高精度。
English: This study introduces a mobile-optimized two-stage deep learning framework that combines YOLOv10 for detection and MobileSAM for segmentation, using threading to enhance real-time performance and achieving high accuracy in detecting and segmenting the cryptic Houbara Bustard.

Authors:Renrong Shao, Wei Zhang, Jun wang
Title: Conditional Pseudo-Supervised Contrast for Data-Free Knowledge Distillation
Abstract:
Data-free knowledge distillation~(DFKD) is an effective manner to solve model compression and transmission restrictions while retaining privacy protection, which has attracted extensive attention in recent years. Currently, the majority of existing methods utilize a generator to synthesize images to support the distillation. Although the current methods have achieved great success, there are still many issues to be explored. Firstly, the outstanding performance of supervised learning in deep learning drives us to explore a pseudo-supervised paradigm on DFKD. Secondly, current synthesized methods cannot distinguish the distributions of different categories of samples, thus producing ambiguous samples that may lead to an incorrect evaluation by the teacher. Besides, current methods cannot optimize the category-wise diversity samples, which will hinder the student model learning from diverse samples and further achieving better performance. In this paper, to address the above limitations, we propose a novel learning paradigm, i.e., conditional pseudo-supervised contrast for data-free knowledge distillation~(CPSC-DFKD). The primary innovations of CPSC-DFKD are: (1) introducing a conditional generative adversarial network to synthesize category-specific diverse images for pseudo-supervised learning, (2) improving the modules of the generator to distinguish the distributions of different categories, and (3) proposing pseudo-supervised contrastive learning based on teacher and student views to enhance diversity. Comprehensive experiments on three commonly-used datasets validate the performance lift of both the student and generator brought by CPSC-DFKD. The code is available at https://github.com/RoryShao/CPSC-DFKD.git
中文: 本文提出CPSC-DFKD这一新型无数据知识蒸馏方法,通过条件生成对抗网络合成特定类别图像,并采用伪监督对比学习增强样本多样性和分布区分能力,在三个数据集上的实验验证了其性能提升。
English: This paper introduces CPSC-DFKD, a novel data-free knowledge distillation method that uses conditional GANs to generate category-specific images and employs pseudo-supervised contrastive learning to enhance sample diversity and distribution distinction, validated by improved performance on three datasets.

Authors:Zhe Zhang, Mingxiu Cai, Gaochang Wu, Jing Zhang, Lingqiao Liu, Dacheng Tao, Tianyou Chai, Xiatian Zhu
Title: Unified Unsupervised Anomaly Detection via Matching Cost Filtering
Abstract:
Unsupervised anomaly detection (UAD) aims to identify image- and pixel-level anomalies using only normal training data, with wide applications such as industrial inspection and medical analysis, where anomalies are scarce due to privacy concerns and cold-start constraints. Existing methods, whether reconstruction-based (restoring normal counterparts) or embedding-based (pretrained representations), fundamentally conduct image- or feature-level matching to generate anomaly maps. Nonetheless, matching noise has been largely overlooked, limiting their detection ability. Beyond earlier focus on unimodal RGB-based UAD, recent advances expand to multimodal scenarios, e.g., RGB-3D and RGB-Text, enabled by point cloud sensing and vision-language models. Despite shared challenges, these lines remain largely isolated, hindering a comprehensive understanding and knowledge transfer. In this paper, we advocate unified UAD for both unimodal and multimodal settings in the matching perspective. Under this insight, we present Unified Cost Filtering (UCF), a generic post-hoc refinement framework for refining anomaly cost volume of any UAD model. The cost volume is constructed by matching a test sample against normal samples from the same or different modalities, followed by a learnable filtering module with multi-layer attention guidance from the test sample, mitigating matching noise and highlighting subtle anomalies. Comprehensive experiments on 22 diverse benchmarks demonstrate the efficacy of UCF in enhancing a variety of UAD methods, consistently achieving new state-of-the-art results in both unimodal (RGB) and multimodal (RGB-3D, RGB-Text) UAD scenarios. Code and models will be released at https://github.com/ZHE-SAPI/CostFilter-AD.
中文: 本文提出统一成本过滤(UCF)框架,通过多层注意力引导过滤异常成本体积来减少无监督异常检测中的匹配噪声,在多种单模态和多模态基准测试中均取得了最先进的性能。
English: This paper introduces Unified Cost Filtering (UCF), a post-hoc refinement framework that mitigates matching noise in unsupervised anomaly detection by filtering anomaly cost volumes with multi-layer attention guidance, achieving state-of-the-art results across diverse unimodal and multimodal benchmarks.

Authors:Mahdi Farahbakhsh, Vishnu Teja Kunde, Dileep Kalathil, Krishna Narayanan, Jean-Francois Chamberland
Title: Inference-Time Search using Side Information for Diffusion-based Image Reconstruction
Abstract:
Diffusion models have emerged as powerful priors for solving inverse problems. However, existing approaches typically overlook side information that could significantly improve reconstruction quality, especially in severely ill-posed settings. In this work, we propose a novel inference-time search algorithm that guides the sampling process using the side information in a manner that balances exploration and exploitation. This enables more accurate and reliable reconstructions, providing an alternative to the gradient-based guidance that is prone to reward-hacking artifacts. Our approach can be seamlessly integrated into a wide range of existing diffusion-based image reconstruction pipelines. Through extensive experiments on a number of inverse problems, such as box inpainting, super-resolution, and various deblurring tasks including motion, Gaussian, nonlinear, and blind deblurring, we show that our approach consistently improves the qualitative and quantitative performance of diffusion-based image reconstruction algorithms. We also show the superior performance of our approach with respect to other baselines, including reward gradient-based guidance algorithms. The code is available at \href{https://github.com/mhdfb/sideinfo-search-reconstruction}{this repository}.
中文: 本文提出了一种新颖的推理时搜索算法,利用辅助信息增强扩散模型解决逆问题的能力,在修复和去模糊等任务中实现了更优的重建质量,同时避免了基于梯度方法常见的伪影问题。
English: This paper introduces a novel inference-time search algorithm that leverages side information to enhance diffusion models for solving inverse problems, achieving superior reconstruction quality across tasks like inpainting and deblurring without the artifacts common in gradient-based methods.

Authors:Rong Liu, Zhongpai Gao, Benjamin Planche, Meida Chen, Van Nguyen Nguyen, Meng Zheng, Anwesa Choudhuri, Terrence Chen, Yue Wang, Andrew Feng, Ziyan Wu
Title: Universal Beta Splatting
Abstract:
We introduce Universal Beta Splatting (UBS), a unified framework that generalizes 3D Gaussian Splatting to N-dimensional anisotropic Beta kernels for explicit radiance field rendering. Unlike fixed Gaussian primitives, Beta kernels enable controllable dependency modeling across spatial, angular, and temporal dimensions within a single representation. Our unified approach captures complex light transport effects, handles anisotropic view-dependent appearance, and models scene dynamics without requiring auxiliary networks or specific color encodings. UBS maintains backward compatibility by approximating to Gaussian Splatting as a special case, guaranteeing plug-in usability and lower performance bounds. The learned Beta parameters naturally decompose scene properties into interpretable without explicit supervision: spatial (surface vs. texture), angular (diffuse vs. specular), and temporal (static vs. dynamic). Our CUDA-accelerated implementation achieves real-time rendering while consistently outperforming existing methods across static, view-dependent, and dynamic benchmarks, establishing Beta kernels as a scalable universal primitive for radiance field rendering. Our project website is available at https://rongliu-leo.github.io/universal-beta-splatting/.
中文: 通用Beta飞溅(UBS)提出了一种统一框架,利用N维Beta核建模空间、角度和时间依赖关系,在静态、视角相关和动态场景中均优于现有方法,同时保持实时渲染性能。
English: Universal Beta Splatting (UBS) introduces a unified framework using N-dimensional Beta kernels to model spatial, angular, and temporal dependencies for explicit radiance field rendering, outperforming existing methods in static, view-dependent, and dynamic scenarios while maintaining real-time performance.

Authors:Akshar Gothi
Title: Convolutional Neural Nets vs Vision Transformers: A SpaceNet Case Study with Balanced vs Imbalanced Regimes
Abstract:
We present a controlled comparison of a convolutional neural network (EfficientNet-B0) and a Vision Transformer (ViT-Base) on SpaceNet under two label-distribution regimes: a naturally imbalanced five-class split and a balanced-resampled split with 700 images per class (70:20:10 train/val/test). With matched preprocessing (224x224, ImageNet normalization), lightweight augmentations, and a 40-epoch budget on a single NVIDIA P100, we report accuracy, macro-F1, balanced accuracy, per-class recall, and deployment metrics (model size and latency). On the imbalanced split, EfficientNet-B0 reaches 93% test accuracy with strong macro-F1 and lower latency; ViT-Base is competitive at 93% with a larger parameter count and runtime. On the balanced split, both models are strong; EfficientNet-B0 reaches 99% while ViT-Base remains competitive, indicating that balancing narrows architecture gaps while CNNs retain an efficiency edge. We release manifests, logs, and per-image predictions to support reproducibility.
中文: 本研究对比了EfficientNet-B0与Vision Transformer在SpaceNet数据集上的表现,证明CNN在处理不平衡数据时效率更优,在平衡数据下两者性能相当,同时公开了所有实验材料以确保可复现性。
English: This study compares EfficientNet-B0 and Vision Transformer on SpaceNet, showing CNN's superior efficiency on imbalanced data and competitive performance with balanced resampling, while releasing all materials for reproducibility.

Authors:Junhao Xia, Ming Zhao, Limin Xiao, Xiujun Zhang
Title: SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size
Abstract:
Large language models (LLMs) face significant computational and memory challenges, making extremely low-bit quantization crucial for their efficient deployment. In this work, we introduce SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size, a novel framework that enables extremely low-bit quantization of LLMs while preserving their linguistic reasoning capabilities. A distinctive feature of SDQ-LLM is the continuous adjustability of the Over-Sampling Ratio (OSR), enabling dynamic adaptation to memory or VRAM constraints by selecting fractional OSR (e.g. 2.5 times) for an optimal trade-off between model size and accuracy. SDQ-LLM uses upsampling combined with Sigma-Delta Quantizer to binarize or ternarize LLMs weights, encoding high-precision parameters into 1-bit or 1.58-bit representations, replacing the multiplication operations within linear layers with addition. This approach significantly enhances inference efficiency under extremely low-bit quantization. To further reduce the loss of quantization precision, we incorporate Hadamard-based weight smoothing prior to quantization, improving the stability and robustness of the weight representations. Furthermore, to fully leverage the continuity of the OSR and reduce precision loss, recognizing the correlation between quantization sensitivity and weight variance, we propose a fine-grained, layer- and linear-wise OSR allocation strategy, MultiOSR. This strategy distributes OSR both across layers and within each layer, based on weight variance and parameter scale. Finally, extensive experiments on OPT and LLaMA model families demonstrate that SDQ-LLM achieves a more efficient and high-precision performance even under highly aggressive low-OSR settings. Our code is available at https://github.com/Dreamlittlecat/LLM-Quant-Factory.
中文摘要:SDQ-LLM提出创新的Sigma-Delta量化框架,通过可调过采样比实现1比特大语言模型,在哈达玛平滑和MultiOSR分配策略支持下,用加法替代乘法运算并保持语言推理能力。
English Summary: SDQ-LLM introduces a novel Sigma-Delta quantization framework enabling 1-bit LLMs with adjustable over-sampling ratios, replacing multiplications with additions while maintaining reasoning capabilities through Hadamard smoothing and MultiOSR allocation.

Authors:Talha Ahmed, Nehal Ahmed Shaikh, Hassan Mohy-ud-Din
Title: Wave-GMS: Lightweight Multi-Scale Generative Model for Medical Image Segmentation
Abstract:
For equitable deployment of AI tools in hospitals and healthcare facilities, we need Deep Segmentation Networks that offer high performance and can be trained on cost-effective GPUs with limited memory and large batch sizes. In this work, we propose Wave-GMS, a lightweight and efficient multi-scale generative model for medical image segmentation. Wave-GMS has a substantially smaller number of trainable parameters, does not require loading memory-intensive pretrained vision foundation models, and supports training with large batch sizes on GPUs with limited memory. We conducted extensive experiments on four publicly available datasets (BUS, BUSI, Kvasir-Instrument, and HAM10000), demonstrating that Wave-GMS achieves state-of-the-art segmentation performance with superior cross-domain generalizability, while requiring only ~2.6M trainable parameters. Code is available at https://github.com/ATPLab-LUMS/Wave-GMS.
Chinese: 为实现医疗领域人工智能工具的公平部署,本研究提出Wave-GMS轻量级生成模型,该模型通过极少的可训练参数在有限GPU内存下实现最优医学图像分割性能,并展现出卓越的跨领域泛化能力。
English: To enable equitable AI deployment in healthcare, this study introduces Wave-GMS, a lightweight generative model for medical image segmentation that achieves state-of-the-art performance with minimal parameters and efficient GPU usage.

Authors:Jiapeng Tang, Matthew Lavine, Dor Verbin, Stephan J. Garbin, Matthias Nießner, Ricardo Martin Brualla, Pratul P. Srinivasan, Philipp Henzler
Title: ROGR: Relightable 3D Objects using Generative Relighting
Abstract:
We introduce ROGR, a novel approach that reconstructs a relightable 3D model of an object captured from multiple views, driven by a generative relighting model that simulates the effects of placing the object under novel environment illuminations. Our method samples the appearance of the object under multiple lighting environments, creating a dataset that is used to train a lighting-conditioned Neural Radiance Field (NeRF) that outputs the object's appearance under any input environmental lighting. The lighting-conditioned NeRF uses a novel dual-branch architecture to encode the general lighting effects and specularities separately. The optimized lighting-conditioned NeRF enables efficient feed-forward relighting under arbitrary environment maps without requiring per-illumination optimization or light transport simulation. We evaluate our approach on the established TensoIR and Stanford-ORB datasets, where it improves upon the state-of-the-art on most metrics, and showcase our approach on real-world object captures.

Authors:Tianyu Xu, Jiawei Chen, Jiazhao Zhang, Wenyao Zhang, Zekun Qi, Minghan Li, Zhizheng Zhang, He Wang
Title: MM-Nav: Multi-View VLA Model for Robust Visual Navigation via Multi-Expert Learning
Abstract:
Visual navigation policy is widely regarded as a promising direction, as it mimics humans by using egocentric visual observations for navigation. However, optical information of visual observations is difficult to be explicitly modeled like LiDAR point clouds or depth maps, which subsequently requires intelligent models and large-scale data. To this end, we propose to leverage the intelligence of the Vision-Language-Action (VLA) model to learn diverse navigation capabilities from synthetic expert data in a teacher-student manner. Specifically, we implement the VLA model, MM-Nav, as a multi-view VLA (with 360 observations) based on pretrained large language models and visual foundation models. For large-scale navigation data, we collect expert data from three reinforcement learning (RL) experts trained with privileged depth information in three challenging tailor-made environments for different navigation capabilities: reaching, squeezing, and avoiding. We iteratively train our VLA model using data collected online from RL experts, where the training ratio is dynamically balanced based on performance on individual capabilities. Through extensive experiments in synthetic environments, we demonstrate that our model achieves strong generalization capability. Moreover, we find that our student VLA model outperforms the RL teachers, demonstrating the synergistic effect of integrating multiple capabilities. Extensive real-world experiments further confirm the effectiveness of our method.

Authors:Gen Li, Bo Zhao, Jianfei Yang, Laura Sevilla-Lara
Title: Mask2IV: Interaction-Centric Video Generation via Mask Trajectories
Abstract:
Generating interaction-centric videos, such as those depicting humans or robots interacting with objects, is crucial for embodied intelligence, as they provide rich and diverse visual priors for robot learning, manipulation policy training, and affordance reasoning. However, existing methods often struggle to model such complex and dynamic interactions. While recent studies show that masks can serve as effective control signals and enhance generation quality, obtaining dense and precise mask annotations remains a major challenge for real-world use. To overcome this limitation, we introduce Mask2IV, a novel framework specifically designed for interaction-centric video generation. It adopts a decoupled two-stage pipeline that first predicts plausible motion trajectories for both actor and object, then generates a video conditioned on these trajectories. This design eliminates the need for dense mask inputs from users while preserving the flexibility to manipulate the interaction process. Furthermore, Mask2IV supports versatile and intuitive control, allowing users to specify the target object of interaction and guide the motion trajectory through action descriptions or spatial position cues. To support systematic training and evaluation, we curate two benchmarks covering diverse action and object categories across both human-object interaction and robotic manipulation scenarios. Extensive experiments demonstrate that our method achieves superior visual realism and controllability compared to existing baselines.

Authors:Beibei Lin, Tingting Chen, Robby T. Tan
Title: GeoComplete: Geometry-Aware Diffusion for Reference-Driven Image Completion
Abstract:
Reference-driven image completion, which restores missing regions in a target view using additional images, is particularly challenging when the target view differs significantly from the references. Existing generative methods rely solely on diffusion priors and, without geometric cues such as camera pose or depth, often produce misaligned or implausible content. We propose GeoComplete, a novel framework that incorporates explicit 3D structural guidance to enforce geometric consistency in the completed regions, setting it apart from prior image-only approaches. GeoComplete introduces two key ideas: conditioning the diffusion process on projected point clouds to infuse geometric information, and applying target-aware masking to guide the model toward relevant reference cues. The framework features a dual-branch diffusion architecture. One branch synthesizes the missing regions from the masked target, while the other extracts geometric features from the projected point cloud. Joint self-attention across branches ensures coherent and accurate completion. To address regions visible in references but absent in the target, we project the target view into each reference to detect occluded areas, which are then masked during training. This target-aware masking directs the model to focus on useful cues, enhancing performance in difficult scenarios. By integrating a geometry-aware dual-branch diffusion architecture with a target-aware masking strategy, GeoComplete offers a unified and robust solution for geometry-conditioned image completion. Experiments show that GeoComplete achieves a 17.1 PSNR improvement over state-of-the-art methods, significantly boosting geometric accuracy while maintaining high visual quality.

Authors:Xiaoyan Kui, Qianmu Xiao, Qqinsong Li, Zexin Ji, JIelin Zhang, Beiji Zou
Title: Flip Distribution Alignment VAE for Multi-Phase MRI Synthesis
Abstract:
Separating shared and independent features is crucial for multi-phase contrast-enhanced (CE) MRI synthesis. However, existing methods use deep autoencoder generators with low parameter efficiency and lack interpretable training strategies. In this paper, we propose Flip Distribution Alignment Variational Autoencoder (FDA-VAE), a lightweight feature-decoupled VAE model for multi-phase CE MRI synthesis. Our method encodes input and target images into two latent distributions that are symmetric concerning a standard normal distribution, effectively separating shared and independent features. The Y-shaped bidirectional training strategy further enhances the interpretability of feature separation. Experimental results show that compared to existing deep autoencoder-based end-to-end synthesis methods, FDA-VAE significantly reduces model parameters and inference time while effectively improving synthesis quality. The source code is publicly available at https://github.com/QianMuXiao/FDA-VAE.
中文摘要:本文提出FDA-VAE轻量化变分自编码器,通过对称隐分布和双向训练策略有效分离多期增强MRI的共享与独立特征,在显著减少模型参数的同时提升了合成质量。
English Summary: This paper introduces FDA-VAE, a lightweight variational autoencoder that efficiently separates shared and independent features for multi-phase contrast-enhanced MRI synthesis through symmetric latent distributions and bidirectional training, significantly reducing parameters and improving synthesis quality.

Authors:Jakub Lisowski, Piotr Tyrakowski, Szymon Zyguła, Krzysztof Kaczmarski
Title: PyRadiomics-cuda: a GPU-accelerated 3D features extraction from medical images within PyRadiomics
Abstract:
PyRadiomics-cuda is a GPU-accelerated extension of the PyRadiomics library, designed to address the computational challenges of extracting three-dimensional shape features from medical images. By offloading key geometric computations to GPU hardware it dramatically reduces processing times for large volumetric datasets. The system maintains full compatibility with the original PyRadiomics API, enabling seamless integration into existing AI workflows without code modifications. This transparent acceleration facilitates efficient, scalable radiomics analysis, supporting rapid feature extraction essential for high-throughput AI pipeline. Tests performed on a typical computational cluster, budget and home devices prove usefulness in all scenarios. PyRadiomics-cuda is implemented in Python and C/CUDA and is freely available under the BSD license at https://github.com/mis-wut/pyradiomics-CUDA Additionally PyRadiomics-cuda test suite is available at https://github.com/mis-wut/pyradiomics-cuda-data-gen. It provides detailed handbook and sample scripts suited for different kinds of workflows plus detailed installation instructions. The dataset used for testing is available at Kaggle https://www.kaggle.com/datasets/sabahesaraki/kidney-tumor-segmentation-challengekits-19
中文: PyRadiomics-cuda是一个基于GPU加速的扩展工具,能大幅提升医学图像三维特征提取效率,同时完全兼容原PyRadiomics接口,实现与现有AI工作流程的无缝集成。
English: PyRadiomics-cuda is a GPU-accelerated extension that significantly speeds up 3D medical image feature extraction while maintaining full compatibility with the original PyRadiomics API for seamless AI workflow integration.

Authors:Md Zahim Hassan, Md. Osama, Muhammad Ashad Kabir, Md. Saiful Islam, Zannatul Naim
Title: ELMF4EggQ: Ensemble Learning with Multimodal Feature Fusion for Non-Destructive Egg Quality Assessment
Abstract:
Accurate, non-destructive assessment of egg quality is critical for ensuring food safety, maintaining product standards, and operational efficiency in commercial poultry production. This paper introduces ELMF4EggQ, an ensemble learning framework that employs multimodal feature fusion to classify egg grade and freshness using only external attributes - image, shape, and weight. A novel, publicly available dataset of 186 brown-shelled eggs was constructed, with egg grade and freshness levels determined through laboratory-based expert assessments involving internal quality measurements, such as yolk index and Haugh unit. To the best of our knowledge, this is the first study to apply machine learning methods for internal egg quality assessment using only external, non-invasive features, and the first to release a corresponding labeled dataset. The proposed framework integrates deep features extracted from external egg images with structural characteristics such as egg shape and weight, enabling a comprehensive representation of each egg. Image feature extraction is performed using top-performing pre-trained CNN models (ResNet152, DenseNet169, and ResNet152V2), followed by PCA-based dimensionality reduction, SMOTE augmentation, and classification using multiple machine learning algorithms. An ensemble voting mechanism combines predictions from the best-performing classifiers to enhance overall accuracy. Experimental results demonstrate that the multimodal approach significantly outperforms image-only and tabular (shape and weight) only baselines, with the multimodal ensemble approach achieving 86.57% accuracy in grade classification and 70.83% in freshness prediction. All code and data are publicly available at https://github.com/Kenshin-Keeps/Egg_Quality_Prediction_ELMF4EggQ, promoting transparency, reproducibility, and further research in this domain.
中文: 本文提出ELMF4EggQ集成学习框架,通过融合图像、形状和重量等外部特征,首次实现了基于非侵入式方法的鸡蛋等级与新鲜度分类,并发布了首个相关公开数据集,显著提升了检测精度。
English: This paper presents ELMF4EggQ, an ensemble learning framework that uses multimodal feature fusion of external attributes like image, shape, and weight to non-invasively classify egg grade and freshness, achieving high accuracy and releasing the first public dataset for such assessments.

Authors:Jingyuan Deng, Yujiu Yang
Title: MaskCD: Mitigating LVLM Hallucinations by Image Head Masked Contrastive Decoding
Abstract:
Large vision-language models (LVLMs) have shown remarkable performance in visual-language understanding for downstream multimodal tasks. While their capabilities are improving, problems emerge simultaneously. Among those problems, the hallucinations have attracted much attention, which stands for the phenomenon where LVLMs generate contradictory content to their input visual and text contents. Many approaches have been proposed to deal with this issue, such as contrastive decoding and attention manipulation. However, contrastive decoding methods struggle in constructing appropriate contrastive samples, and attention manipulation methods are highly sensitive, lacking stability. In this work, we propose image head Masked Contrastive Decoding (MaskCD). Our approach utilizes the "image heads" in LVLMs, masking them to construct contrastive samples for contrastive decoding. We evaluated MaskCD on LLaVA-1.5-7b and Qwen-VL-7b, using various benchmarks such as CHAIR, POPE, AMBER and MME. The results demonstrate that MaskCD effectively alleviates the phenomenon of hallucinations and retains the general capabilities of LVLMs. Corresponding resources could be found at: https://github.com/Deng-Jingyuan/MaskCD .
中文: 本文提出MaskCD方法,通过掩码视觉语言模型中的图像头构建对比样本,在缓解模型幻觉现象的同时保持其通用能力,多基准测试验证了该方法的有效性。
English: This paper introduces MaskCD, a novel method that mitigates hallucinations in large vision-language models by masking image heads to create contrastive samples, effectively reducing contradictions while preserving model capabilities as validated on multiple benchmarks.

Authors:Ara Seo, Bryan Sangwoo Kim, Hyungjin Chung, Jong Chul Ye
Title: Align Your Query: Representation Alignment for Multimodality Medical Object Detection
Abstract:
Medical object detection suffers when a single detector is trained on mixed medical modalities (e.g., CXR, CT, MRI) due to heterogeneous statistics and disjoint representation spaces. To address this challenge, we turn to representation alignment, an approach that has proven effective for bringing features from different sources into a shared space. Specifically, we target the representations of DETR-style object queries and propose a simple, detector-agnostic framework to align them with modality context. First, we define modality tokens: compact, text-derived embeddings encoding imaging modality that are lightweight and require no extra annotations. We integrate the modality tokens into the detection process via Multimodality Context Attention (MoCA), mixing object-query representations via self-attention to propagate modality context within the query set. This preserves DETR-style architectures and adds negligible latency while injecting modality cues into object queries. We further introduce QueryREPA, a short pretraining stage that aligns query representations to their modality tokens using a task-specific contrastive objective with modality-balanced batches. Together, MoCA and QueryREPA produce modality-aware, class-faithful queries that transfer effectively to downstream training. Across diverse modalities trained altogether, the proposed approach consistently improves AP with minimal overhead and no architectural modifications, offering a practical path toward robust multimodality medical object detection. Project page: https://araseo.github.io/alignyourquery/.
中文摘要:本研究提出一种通过多模态上下文注意力和QueryREPA预训练框架,将目标查询与模态上下文对齐,从而以最小开销提升混合医学影像模态下的目标检测性能。
English Summary: This study introduces a framework using Multimodality Context Attention and QueryREPA pretraining to align object queries with modality context, enhancing medical object detection across mixed imaging modalities with minimal overhead.

Authors:Xian Zhang, Zexi Wu, Zinuo Li, Hongming Xu, Luqi Gong, Farid Boussaid, Naoufel Werghi, Mohammed Bennamoun
Title: AdaRD-key: Adaptive Relevance-Diversity Keyframe Sampling for Long-form Video understanding
Abstract:
Understanding long-form videos remains a significant challenge for vision--language models (VLMs) due to their extensive temporal length and high information density. Most current multimodal large language models (MLLMs) rely on uniform sampling, which often overlooks critical moments, leading to incorrect responses to queries. In parallel, many keyframe selection approaches impose rigid temporal spacing: once a frame is chosen, an exclusion window suppresses adjacent timestamps to reduce redundancy. While effective at limiting overlap, this strategy frequently misses short, fine-grained cues near important events. Other methods instead emphasize visual diversity but neglect query relevance. We propose AdaRD-Key, a training-free keyframe sampling module for query-driven long-form video understanding. AdaRD-Key maximizes a unified Relevance--Diversity Max-Volume (RD-MV) objective, combining a query-conditioned relevance score with a log-determinant diversity component to yield informative yet non-redundant frames. To handle broad queries with weak alignment to the video, AdaRD-Key employs a lightweight relevance-aware gating mechanism; when the relevance distribution indicates weak alignment, the method seamlessly shifts into a diversity-only mode, enhancing coverage without additional supervision. Our pipeline is training-free, computationally efficient (running in real time on a single GPU), and compatible with existing VLMs in a plug-and-play manner. Extensive experiments on LongVideoBench and Video-MME demonstrate state-of-the-art performance, particularly on long-form videos. Code available at https://github.com/Xian867/AdaRD-Key.
中文摘要:AdaRD-Key是一种无需训练的关键帧采样模块,通过最大化统一的相关性-多样性目标,并针对弱对齐查询自适应切换至纯多样性模式,有效解决了长视频理解中的关键帧选择难题,在保持实时效率的同时实现了最先进的性能。
English Summary: AdaRD-Key is a training-free keyframe sampling module that addresses long-form video understanding challenges by maximizing a unified Relevance-Diversity objective and adaptively switching to diversity-only mode for weakly-aligned queries, achieving state-of-the-art performance with real-time efficiency.

Authors:Junyu Shi, Yong Sun, Zhiyuan Zhang, Lijiang Liu, Zhengjie Zhang, Yuxin He, Qiang Nie
Title: MoGIC: Boosting Motion Generation via Intention Understanding and Visual Context
Abstract:
Existing text-driven motion generation methods often treat synthesis as a bidirectional mapping between language and motion, but remain limited in capturing the causal logic of action execution and the human intentions that drive behavior. The absence of visual grounding further restricts precision and personalization, as language alone cannot specify fine-grained spatiotemporal details. We propose MoGIC, a unified framework that integrates intention modeling and visual priors into multimodal motion synthesis. By jointly optimizing multimodal-conditioned motion generation and intention prediction, MoGIC uncovers latent human goals, leverages visual priors to enhance generation, and exhibits versatile multimodal generative capability. We further introduce a mixture-of-attention mechanism with adaptive scope to enable effective local alignment between conditional tokens and motion subsequences. To support this paradigm, we curate Mo440H, a 440-hour benchmark from 21 high-quality motion datasets. Experiments show that after finetuning, MoGIC reduces FID by 38.6\% on HumanML3D and 34.6\% on Mo440H, surpasses LLM-based methods in motion captioning with a lightweight text head, and further enables intention prediction and vision-conditioned generation, advancing controllable motion synthesis and intention understanding. The code is available at https://github.com/JunyuShi02/MoGIC
中文摘要:现有运动生成方法缺乏因果逻辑和视觉基础,因此MoGIC提出融合意图建模与视觉先验的统一框架,通过联合优化多模态条件生成与意图预测,在多指标上实现显著性能提升。
English Summary: Existing motion generation methods lack causal logic and visual grounding, so MoGIC introduces a unified framework integrating intention modeling and visual priors to enhance multimodal motion synthesis, achieving significant performance improvements across benchmarks.

Authors:Beijia Lu, Ziyi Chen, Jing Xiao, Jun-Yan Zhu
Title: Input-Aware Sparse Attention for Real-Time Co-Speech Video Generation
Abstract:
Diffusion models can synthesize realistic co-speech video from audio for various applications, such as video creation and virtual agents. However, existing diffusion-based methods are slow due to numerous denoising steps and costly attention mechanisms, preventing real-time deployment. In this work, we distill a many-step diffusion video model into a few-step student model. Unfortunately, directly applying recent diffusion distillation methods degrades video quality and falls short of real-time performance. To address these issues, our new video distillation method leverages input human pose conditioning for both attention and loss functions. We first propose using accurate correspondence between input human pose keypoints to guide attention to relevant regions, such as the speaker's face, hands, and upper body. This input-aware sparse attention reduces redundant computations and strengthens temporal correspondences of body parts, improving inference efficiency and motion coherence. To further enhance visual quality, we introduce an input-aware distillation loss that improves lip synchronization and hand motion realism. By integrating our input-aware sparse attention and distillation loss, our method achieves real-time performance with improved visual quality compared to recent audio-driven and input-driven methods. We also conduct extensive experiments showing the effectiveness of our algorithmic design choices.

Authors:JoonHo Lee, Sunho Park
Title: Exploring OCR-augmented Generation for Bilingual VQA
Abstract:
We investigate OCR-augmented generation with Vision Language Models (VLMs), exploring tasks in Korean and English toward multilingualism. To support research in this domain, we train and release KLOCR, a strong bilingual OCR baseline trained on 100M instances to augment VLMs with OCR ability. To complement existing VQA benchmarks, we curate KOCRBench for Korean VQA, and analyze different prompting methods. Extensive experiments show that OCR-extracted text significantly boosts performance across open source and commercial models. Our work offers new insights into OCR-augmented generation for bilingual VQA. Model, code, and data are available at https://github.com/JHLee0513/KLOCR.
中文摘要:本研究提出KLOCR双语OCR模型,通过1亿实例训练增强视觉语言模型在韩语和英语多语言任务中的能力,实验表明OCR提取文本显著提升各类模型在视觉问答任务中的表现。
English Summary: This study introduces KLOCR, a bilingual OCR model trained on 100 million instances, to enhance Vision Language Models for multilingual tasks in Korean and English, demonstrating that OCR-extracted text significantly improves performance across various models and benchmarks.

Authors:Guy Ohayon, Pierre-Etienne H. Fiquet, Florentin Guth, Jona Ballé, Eero P. Simoncelli
Title: Learning a distance measure from the information-estimation geometry of data
Abstract:
We introduce the Information-Estimation Metric (IEM), a novel form of distance function derived from an underlying continuous probability density over a domain of signals. The IEM is rooted in a fundamental relationship between information theory and estimation theory, which links the log-probability of a signal with the errors of an optimal denoiser, applied to noisy observations of the signal. In particular, the IEM between a pair of signals is obtained by comparing their denoising error vectors over a range of noise amplitudes. Geometrically, this amounts to comparing the score vector fields of the blurred density around the signals over a range of blur levels. We prove that the IEM is a valid global metric and derive a closed-form expression for its local second-order approximation, which yields a Riemannian metric. For Gaussian-distributed signals, the IEM coincides with the Mahalanobis distance. But for more complex distributions, it adapts, both locally and globally, to the geometry of the distribution. In practice, the IEM can be computed using a learned denoiser (analogous to generative diffusion models) and solving a one-dimensional integral. To demonstrate the value of our framework, we learn an IEM on the ImageNet database. Experiments show that this IEM is competitive with or outperforms state-of-the-art supervised image quality metrics in predicting human perceptual judgments.
中文: 信息估计度量是一种基于概率密度的新型距离函数,通过去噪误差将信息论与估计理论联系起来,被证明是一种有效的全局度量,能适应分布几何结构,并在人类感知评估中优于现有图像质量指标。
English: The Information-Estimation Metric (IEM) is a novel distance function derived from probability densities that connects information theory with estimation theory through denoising errors, proving to be a valid global metric that adapts to distribution geometry and outperforms existing image quality metrics in human perceptual evaluations.

Authors:Sung-Yeon Park, Adam Lee, Juanwu Lu, Can Cui, Luyang Jiang, Rohit Gupta, Kyungtae Han, Ahmadreza Moradipari, Ziran Wang
Title: SIMSplat: Predictive Driving Scene Editing with Language-aligned 4D Gaussian Splatting
Abstract:
Driving scene manipulation with sensor data is emerging as a promising alternative to traditional virtual driving simulators. However, existing frameworks struggle to generate realistic scenarios efficiently due to limited editing capabilities. To address these challenges, we present SIMSplat, a predictive driving scene editor with language-aligned Gaussian splatting. As a language-controlled editor, SIMSplat enables intuitive manipulation using natural language prompts. By aligning language with Gaussian-reconstructed scenes, it further supports direct querying of road objects, allowing precise and flexible editing. Our method provides detailed object-level editing, including adding new objects and modifying the trajectories of both vehicles and pedestrians, while also incorporating predictive path refinement through multi-agent motion prediction to generate realistic interactions among all agents in the scene. Experiments on the Waymo dataset demonstrate SIMSplat's extensive editing capabilities and adaptability across a wide range of scenarios. Project page: https://sungyeonparkk.github.io/simsplat/

Authors:Eric Tillmann Bill, Enis Simsar, Thomas Hofmann
Title: Optimal Control Meets Flow Matching: A Principled Route to Multi-Subject Fidelity
Abstract:
Text-to-image (T2I) models excel on single-entity prompts but struggle with multi-subject descriptions, often showing attribute leakage, identity entanglement, and subject omissions. We introduce the first theoretical framework with a principled, optimizable objective for steering sampling dynamics toward multi-subject fidelity. Viewing flow matching (FM) through stochastic optimal control (SOC), we formulate subject disentanglement as control over a trained FM sampler. This yields two architecture-agnostic algorithms: (i) a training-free test-time controller that perturbs the base velocity with a single-pass update, and (ii) Adjoint Matching, a lightweight fine-tuning rule that regresses a control network to a backward adjoint signal while preserving base-model capabilities. The same formulation unifies prior attention heuristics, extends to diffusion models via a flow-diffusion correspondence, and provides the first fine-tuning route explicitly designed for multi-subject fidelity. Empirically, on Stable Diffusion 3.5, FLUX, and Stable Diffusion XL, both algorithms consistently improve multi-subject alignment while maintaining base-model style. Test-time control runs efficiently on commodity GPUs, and fine-tuned controllers trained on limited prompts generalize to unseen ones. We further highlight FOCUS (Flow Optimal Control for Unentangled Subjects), which achieves state-of-the-art multi-subject fidelity across models.
中文: 本研究提出了一个理论框架和两种与架构无关的算法,通过控制采样动态来提升文本到图像模型的多主体生成准确性,在保持基础模型特性的同时实现了最先进的性能表现。
English: This research introduces a theoretical framework and two architecture-agnostic algorithms that enhance multi-subject fidelity in text-to-image models by controlling sampling dynamics, achieving state-of-the-art performance while preserving base-model characteristics.

Authors:Bo-Hsu Ke, You-Zhe Xie, Yu-Lun Liu, Wei-Chen Chiu
Title: StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions
Abstract:
3D scene representation methods like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have significantly advanced novel view synthesis. As these methods become prevalent, addressing their vulnerabilities becomes critical. We analyze 3DGS robustness against image-level poisoning attacks and propose a novel density-guided poisoning method. Our method strategically injects Gaussian points into low-density regions identified via Kernel Density Estimation (KDE), embedding viewpoint-dependent illusory objects clearly visible from poisoned views while minimally affecting innocent views. Additionally, we introduce an adaptive noise strategy to disrupt multi-view consistency, further enhancing attack effectiveness. We propose a KDE-based evaluation protocol to assess attack difficulty systematically, enabling objective benchmarking for future research. Extensive experiments demonstrate our method's superior performance compared to state-of-the-art techniques. Project page: https://hentci.github.io/stealthattack/

Authors:Enxin Song, Wenhao Chai, Shusheng Yang, Ethan Armand, Xiaojun Shan, Haiyang Xu, Jianwen Xie, Zhuowen Tu
Title: VideoNSA: Native Sparse Attention Scales Video Understanding
Abstract:
Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks.
中文:VideoNSA通过将原生稀疏注意力应用于视频,增强了视频语言模型的长视频理解能力,在保持文本密集注意力的同时,优化了注意力分配,从而在时间和空间基准测试中取得了更好的性能。
English: VideoNSA enhances video-language models by applying Native Sparse Attention to videos, enabling scalable, coherent long-video understanding and improved performance on temporal and spatial benchmarks through optimized attention allocation.

Authors:Hala Sheta, Eric Huang, Shuyu Wu, Ilia Alenabi, Jiajun Hong, Ryker Lin, Ruoxi Ning, Daniel Wei, Jialin Yang, Jiawei Zhou, Ziqiao Ma, Freda Shi
Title: From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens
Abstract:
We introduce VLM-Lens, a toolkit designed to enable systematic benchmarking, analysis, and interpretation of vision-language models (VLMs) by supporting the extraction of intermediate outputs from any layer during the forward pass of open-source VLMs. VLM-Lens provides a unified, YAML-configurable interface that abstracts away model-specific complexities and supports user-friendly operation across diverse VLMs. It currently supports 16 state-of-the-art base VLMs and their over 30 variants, and is extensible to accommodate new models without changing the core logic. The toolkit integrates easily with various interpretability and analysis methods. We demonstrate its usage with two simple analytical experiments, revealing systematic differences in the hidden representations of VLMs across layers and target concepts. VLM-Lens is released as an open-sourced project to accelerate community efforts in understanding and improving VLMs.
Chinese: VLM-Lens 是一个工具包,通过提取视觉语言模型任意层的中间输出,提供统一的 YAML 可配置接口,支持多种模型和可解释性方法,便于系统化基准测试和分析。
English: VLM-Lens is a toolkit that facilitates systematic benchmarking and analysis of vision-language models by extracting intermediate outputs from any layer, offering a unified YAML-configurable interface for diverse models and interpretability methods.

Authors:Sathira Silva, Eman Ali, Chetan Arora, Muhammad Haris Khan
Title: microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification
Abstract:
Unsupervised adaptation of CLIP-based vision-language models (VLMs) for fine-grained image classification requires sensitivity to microscopic local cues. While CLIP exhibits strong zero-shot transfer, its reliance on coarse global features restricts its performance on fine-grained classification tasks. Prior efforts inject fine-grained knowledge by aligning large language model (LLM) descriptions with the CLIP $\texttt{[CLS]}$ token; however, this approach overlooks spatial precision. We propose $\textbf{microCLIP}$, a self-training framework that jointly refines CLIP's visual and textual representations using fine-grained cues. At its core is Saliency-Oriented Attention Pooling (SOAP) within a lightweight TokenFusion module, which builds a saliency-guided $\texttt{[FG]}$ token from patch embeddings and fuses it with the global $\texttt{[CLS]}$ token for coarse-fine alignment. To stabilize adaptation, we introduce a two-headed LLM-derived classifier: a frozen classifier that, via multi-view alignment, provides a stable text-based prior for pseudo-labeling, and a learnable classifier initialized from LLM descriptions and fine-tuned with TokenFusion. We further develop Dynamic Knowledge Aggregation, which convexly combines fixed LLM/CLIP priors with TokenFusion's evolving logits to iteratively refine pseudo-labels. Together, these components uncover latent fine-grained signals in CLIP, yielding a consistent $2.90\%$ average accuracy gain across 13 fine-grained benchmarks while requiring only light adaptation. Our code is available at https://github.com/sathiiii/microCLIP.
中文摘要:microCLIP 是一种自训练框架,通过融合显著性引导的局部特征与全局表征,并利用大语言模型派生的分类器稳定适应过程,显著提升了CLIP在细粒度分类任务中的性能,在13个基准测试中平均准确率提高了2.90%。
English Summary: microCLIP is a self-training framework that enhances CLIP's fine-grained classification by integrating saliency-guided local features with global representations and stabilizing adaptation through LLM-derived classifiers, achieving a 2.90% average accuracy gain across 13 benchmarks.

Authors:Tianchong Jiang, Jingtian Ji, Xiangshan Tan, Jiading Fang, Anand Bhattad, Vitor Guizilini, Matthew R. Walter
Title: Do You Know Where Your Camera Is? View-Invariant Policy Learning with Camera Conditioning
Abstract:
We study view-invariant imitation learning by explicitly conditioning policies on camera extrinsics. Using Plucker embeddings of per-pixel rays, we show that conditioning on extrinsics significantly improves generalization across viewpoints for standard behavior cloning policies, including ACT, Diffusion Policy, and SmolVLA. To evaluate policy robustness under realistic viewpoint shifts, we introduce six manipulation tasks in RoboSuite and ManiSkill that pair "fixed" and "randomized" scene variants, decoupling background cues from camera pose. Our analysis reveals that policies without extrinsics often infer camera pose using visual cues from static backgrounds in fixed scenes; this shortcut collapses when workspace geometry or camera placement shifts. Conditioning on extrinsics restores performance and yields robust RGB-only control without depth. We release the tasks, demonstrations, and code at https://ripl.github.io/know_your_camera/ .

Authors:Phuc Minh Nguyen, Chinh D. La, Duy M. H. Nguyen, Nitesh V. Chawla, Binh T. Nguyen, Khoa D. Doan
Title: The Reasoning Boundary Paradox: How Reinforcement Learning Constrains Language Models
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key method for improving Large Language Models' reasoning capabilities, yet recent evidence suggests it may paradoxically shrink the reasoning boundary rather than expand it. This paper investigates the shrinkage issue of RLVR by analyzing its learning dynamics and reveals two critical phenomena that explain this failure. First, we expose negative interference in RLVR, where learning to solve certain training problems actively reduces the likelihood of correct solutions for others, leading to the decline of Pass@$k$ performance, or the probability of generating a correct solution within $k$ attempts. Second, we uncover the winner-take-all phenomenon: RLVR disproportionately reinforces problems with high likelihood, correct solutions, under the base model, while suppressing other initially low-likelihood ones. Through extensive theoretical and empirical analysis on multiple mathematical reasoning benchmarks, we show that this effect arises from the inherent on-policy sampling in standard RL objectives, causing the model to converge toward narrow solution strategies. Based on these insights, we propose a simple yet effective data curation algorithm that focuses RLVR learning on low-likelihood problems, achieving notable improvement in Pass@$k$ performance. Our code is available at https://github.com/mail-research/SELF-llm-interference.
中文: 强化学习与可验证奖励(RLVR)会因负干扰和赢家通吃效应而限制推理能力,但针对低概率问题的数据筛选方法能有效提升性能。
English: RLVR can paradoxically limit reasoning by causing negative interference and a winner-take-all effect, but a data curation method focusing on low-likelihood problems improves performance.

Authors:Weijia Dou, Xu Zhang, Yi Bin, Jian Liu, Bo Peng, Guoqing Wang, Yang Yang, Heng Tao Shen
Title: GeoPurify: A Data-Efficient Geometric Distillation Framework for Open-Vocabulary 3D Segmentation
Abstract:
Recent attempts to transfer features from 2D Vision-Language Models (VLMs) to 3D semantic segmentation expose a persistent trade-off. Directly projecting 2D features into 3D yields noisy and fragmented predictions, whereas enforcing geometric coherence necessitates costly training pipelines and large-scale annotated 3D data. We argue that this limitation stems from the dominant segmentation-and-matching paradigm, which fails to reconcile 2D semantics with 3D geometric structure. The geometric cues are not eliminated during the 2D-to-3D transfer but remain latent within the noisy and view-aggregated features. To exploit this property, we propose GeoPurify that applies a small Student Affinity Network to purify 2D VLM-generated 3D point features using geometric priors distilled from a 3D self-supervised teacher model. During inference, we devise a Geometry-Guided Pooling module to further denoise the point cloud and ensure the semantic and structural consistency. Benefiting from latent geometric information and the learned affinity network, GeoPurify effectively mitigates the trade-off and achieves superior data efficiency. Extensive experiments on major 3D benchmarks demonstrate that GeoPurify achieves or surpasses state-of-the-art performance while utilizing only about 1.5% of the training data. Our codes and checkpoints are available at [https://github.com/tj12323/GeoPurify](https://github.com/tj12323/GeoPurify).
中文: GeoPurify方法通过师生网络提取几何先验并采用几何引导池化,有效利用2D视觉语言模型生成的3D特征中潜在的几何信息,在仅需少量训练数据的情况下突破现有权衡困境并实现最优性能。
English: The proposed GeoPurify method leverages latent geometric cues in 2D VLM-generated 3D features through a student-teacher network and geometry-guided pooling, effectively overcoming the trade-off between projection noise and geometric coherence while achieving state-of-the-art performance with minimal training data.

Authors:Jong Bum Won, Wesley De Neve, Joris Vankerschaver, Utku Ozbulak
Title: SpurBreast: A Curated Dataset for Investigating Spurious Correlations in Real-world Breast MRI Classification
Abstract:
Deep neural networks (DNNs) have demonstrated remarkable success in medical imaging, yet their real-world deployment remains challenging due to spurious correlations, where models can learn non-clinical features instead of meaningful medical patterns. Existing medical imaging datasets are not designed to systematically study this issue, largely due to restrictive licensing and limited supplementary patient data. To address this gap, we introduce SpurBreast, a curated breast MRI dataset that intentionally incorporates spurious correlations to evaluate their impact on model performance. Analyzing over 100 features involving patient, device, and imaging protocol, we identify two dominant spurious signals: magnetic field strength (a global feature influencing the entire image) and image orientation (a local feature affecting spatial alignment). Through controlled dataset splits, we demonstrate that DNNs can exploit these non-clinical signals, achieving high validation accuracy while failing to generalize to unbiased test data. Alongside these two datasets containing spurious correlations, we also provide benchmark datasets without spurious correlations, allowing researchers to systematically investigate clinically relevant and irrelevant features, uncertainty estimation, adversarial robustness, and generalization strategies. Models and datasets are available at https://github.com/utkuozbulak/spurbreast.
中文摘要:医学影像中的深度神经网络常利用磁场强度和图像方向等虚假相关性而非临床特征,为此我们开发了SpurBreast乳腺MRI数据集,通过控制数据集划分系统评估这些干扰信号对模型泛化能力的影响。
English Summary: Deep neural networks in medical imaging often exploit spurious correlations like magnetic field strength and image orientation rather than clinical patterns, prompting the creation of SpurBreast—a breast MRI dataset designed to systematically evaluate and mitigate these misleading signals.

Authors:Carlijn Lems, Leslie Tessier, John-Melle Bokhorst, Mart van Rijthoven, Witali Aswolinskiy, Matteo Pozzi, Natalie Klubickova, Suzanne Dintzis, Michela Campora, Maschenka Balkenhol, Peter Bult, Joey Spronck, Thomas Detone, Mattia Barbareschi, Enrico Munari, Giuseppe Bogina, Jelle Wesseling, Esther H. Lips, Francesco Ciompi, Frédérique Meeuwsen, Jeroen van der Laak
Title: A Multicentric Dataset for Training and Benchmarking Breast Cancer Segmentation in H&E Slides
Abstract:
Automated semantic segmentation of whole-slide images (WSIs) stained with hematoxylin and eosin (H&E) is essential for large-scale artificial intelligence-based biomarker analysis in breast cancer. However, existing public datasets for breast cancer segmentation lack the morphological diversity needed to support model generalizability and robust biomarker validation across heterogeneous patient cohorts. We introduce BrEast cancEr hisTopathoLogy sEgmentation (BEETLE), a dataset for multiclass semantic segmentation of H&E-stained breast cancer WSIs. It consists of 587 biopsies and resections from three collaborating clinical centers and two public datasets, digitized using seven scanners, and covers all molecular subtypes and histological grades. Using diverse annotation strategies, we collected annotations across four classes - invasive epithelium, non-invasive epithelium, necrosis, and other - with particular focus on morphologies underrepresented in existing datasets, such as ductal carcinoma in situ and dispersed lobular tumor cells. The dataset's diversity and relevance to the rapidly growing field of automated biomarker quantification in breast cancer ensure its high potential for reuse. Finally, we provide a well-curated, multicentric external evaluation set to enable standardized benchmarking of breast cancer segmentation models.
中文摘要:BEETLE数据集通过提供587张涵盖所有分子亚型和组织学分级的标注全切片图像,专门针对现有数据集中代表性不足的形态结构进行标注,解决了乳腺癌分割数据缺乏形态多样性的问题,从而提升模型泛化能力和生物标志物验证效果。
English Summary: The BEETLE dataset addresses the lack of morphological diversity in existing breast cancer segmentation datasets by providing 587 annotated whole-slide images covering all molecular subtypes and histological grades, specifically targeting underrepresented morphologies to enhance model generalizability and biomarker validation.

Authors:Mengtian Li, Yunshu Bai, Yimin Chu, Yijun Shen, Zhongmei Li, Weifeng Ge, Zhifeng Xie, Chaofeng Chen
Title: GaussianMorphing: Mesh-Guided 3D Gaussians for Semantic-Aware Object Morphing
Abstract:
We introduce GaussianMorphing, a novel framework for semantic-aware 3D shape and texture morphing from multi-view images. Previous approaches usually rely on point clouds or require pre-defined homeomorphic mappings for untextured data. Our method overcomes these limitations by leveraging mesh-guided 3D Gaussian Splatting (3DGS) for high-fidelity geometry and appearance modeling. The core of our framework is a unified deformation strategy that anchors 3DGaussians to reconstructed mesh patches, ensuring geometrically consistent transformations while preserving texture fidelity through topology-aware constraints. In parallel, our framework establishes unsupervised semantic correspondence by using the mesh topology as a geometric prior and maintains structural integrity via physically plausible point trajectories. This integrated approach preserves both local detail and global semantic coherence throughout the morphing process with out requiring labeled data. On our proposed TexMorph benchmark, GaussianMorphing substantially outperforms prior 2D/3D methods, reducing color consistency error ($ΔE$) by 22.2% and EI by 26.2%. Project page: https://baiyunshu.github.io/GAUSSIANMORPHING.github.io/

Authors:Yujie Zhou, Pengyang Ling, Jiazi Bu, Yibin Wang, Yuhang Zang, Jiaqi Wang, Li Niu, Guangtao Zhai
Title: $\text{G}^2$RPO: Granular GRPO for Precise Reward in Flow Models
Abstract:
The integration of online reinforcement learning (RL) into diffusion and flow models has recently emerged as a promising approach for aligning generative models with human preferences. Stochastic sampling via Stochastic Differential Equations (SDE) is employed during the denoising process to generate diverse denoising directions for RL exploration. While existing methods effectively explore potential high-value samples, they suffer from sub-optimal preference alignment due to sparse and narrow reward signals. To address these challenges, we propose a novel Granular-GRPO ($\text{G}^2$RPO ) framework that achieves precise and comprehensive reward assessments of sampling directions in reinforcement learning of flow models. Specifically, a Singular Stochastic Sampling strategy is introduced to support step-wise stochastic exploration while enforcing a high correlation between the reward and the injected noise, thereby facilitating a faithful reward for each SDE perturbation. Concurrently, to eliminate the bias inherent in fixed-granularity denoising, we introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales, producing a more comprehensive and robust evaluation of the sampling directions. Experiments conducted on various reward models, including both in-domain and out-of-domain evaluations, demonstrate that our $\text{G}^2$RPO significantly outperforms existing flow-based GRPO baselines,highlighting its effectiveness and robustness.
中文摘要:提出的Granular-GRPO框架通过引入奇异随机采样策略实现逐步探索,并结合多粒度优势集成模块进行综合奖励评估,显著提升了扩散模型与人类偏好的对齐效果,在实验中优于现有基线方法。
English Summary: The proposed Granular-GRPO framework enhances alignment with human preferences in diffusion models by employing singular stochastic sampling for step-wise exploration and multi-granularity advantage integration for robust reward evaluation, outperforming existing methods in experiments.

Authors:Yongyi Su, Haojie Zhang, Shijie Li, Nanqing Liu, Jingyi Liao, Junyi Pan, Yuan Liu, Xiaofen Xing, Chong Sun, Chen Li, Nancy F. Chen, Shuicheng Yan, Xulei Yang, Xun Xu
Title: Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs
Abstract:
Multimodal large language models (MLLMs) have advanced rapidly in recent years. However, existing approaches for vision tasks often rely on indirect representations, such as generating coordinates as text for detection, which limits performance and prevents dense prediction tasks like segmentation. To overcome these challenges, we introduce Patch-as-Decodable Token (PaDT), a unified paradigm that enables MLLMs to directly generate both textual and diverse visual outputs. Central to PaDT are Visual Reference Tokens (VRTs), derived from visual patch embeddings of query images and interleaved seamlessly with LLM's output textual tokens. A lightweight decoder then transforms LLM's outputs into detection, segmentation, and grounding predictions. Unlike prior methods, PaDT processes VRTs independently at each forward pass and dynamically expands the embedding table, thus improving localization and differentiation among similar objects. We further tailor a training strategy for PaDT by randomly selecting VRTs for supervised fine-tuning and introducing a robust per-token cross-entropy loss. Our empirical studies across four visual perception and understanding tasks suggest PaDT consistently achieving state-of-the-art performance, even compared with significantly larger MLLM models. The code is available at https://github.com/Gorilla-Lab-SCUT/PaDT.
中文摘要:PaDT框架通过视觉参考令牌实现了多模态大语言模型直接生成文本与多样化视觉输出的统一方法,在多项视觉任务中均达到了最先进的性能水平。
English Summary: The PaDT framework introduces a unified approach for multimodal large language models to directly generate both textual and diverse visual outputs through visual reference tokens, achieving state-of-the-art performance across multiple vision tasks.

Authors:Guangyao Zhai, Yue Zhou, Xinyan Deng, Lars Heckler, Nassir Navab, Benjamin Busam
Title: Foundation Visual Encoders Are Secretly Few-Shot Anomaly Detectors
Abstract:
Few-shot anomaly detection streamlines and simplifies industrial safety inspection. However, limited samples make accurate differentiation between normal and abnormal features challenging, and even more so under category-agnostic conditions. Large-scale pre-training of foundation visual encoders has advanced many fields, as the enormous quantity of data helps to learn the general distribution of normal images. We observe that the anomaly amount in an image directly correlates with the difference in the learnt embeddings and utilize this to design a few-shot anomaly detector termed FoundAD. This is done by learning a nonlinear projection operator onto the natural image manifold. The simple operator acts as an effective tool for anomaly detection to characterize and identify out-of-distribution regions in an image. Extensive experiments show that our approach supports multi-class detection and achieves competitive performance while using substantially fewer parameters than prior methods. Backed up by evaluations with multiple foundation encoders, including fresh DINOv3, we believe this idea broadens the perspective on foundation features and advances the field of few-shot anomaly detection.
中文摘要:FoundAD提出了一种基于基础视觉编码器的小样本异常检测方法,通过将图像投影到自然流形上来识别异常区域,在显著减少参数量的同时实现了优越的检测性能。
English Summary: FoundAD introduces a few-shot anomaly detection method that leverages foundation visual encoders to distinguish anomalies by projecting images onto a natural manifold, achieving competitive performance with fewer parameters.

Authors:Yi Ai, Yuanhao Cai, Yulun Zhang, Xiaokang Yang
Title: Flow-Matching Guided Deep Unfolding for Hyperspectral Image Reconstruction
Abstract:
Hyperspectral imaging (HSI) provides rich spatial-spectral information but remains costly to acquire due to hardware limitations and the difficulty of reconstructing three-dimensional data from compressed measurements. Although compressive sensing systems such as CASSI improve efficiency, accurate reconstruction is still challenged by severe degradation and loss of fine spectral details. We propose the Flow-Matching-guided Unfolding network (FMU), which, to our knowledge, is the first to integrate flow matching into HSI reconstruction by embedding its generative prior within a deep unfolding framework. To further strengthen the learned dynamics, we introduce a mean velocity loss that enforces global consistency of the flow, leading to a more robust and accurate reconstruction. This hybrid design leverages the interpretability of optimization-based methods and the generative capacity of flow matching. Extensive experiments on both simulated and real datasets show that FMU significantly outperforms existing approaches in reconstruction quality. Code and models will be available at https://github.com/YiAi03/FMU.
Chinese: 提出的流匹配引导展开网络(FMU)首次将流匹配融入高光谱成像重建的深度展开框架中,并通过引入平均速度损失增强全局一致性,在仿真和真实数据集上均显著超越了现有方法的重建质量。
English: The proposed Flow-Matching-guided Unfolding network (FMU) integrates flow matching into hyperspectral imaging reconstruction within a deep unfolding framework, enhanced by a mean velocity loss for global consistency, achieving superior performance over existing methods in both simulated and real datasets.

Authors:Pierre Musacchio, Hyunmin Lee, Jaesik Park
Title: Holistic Order Prediction in Natural Scenes
Abstract:
Even in controlled settings, understanding instance-wise geometries is a challenging task for a wide range of visual models. Although specialized systems exist, modern arts rely on expensive input formats (category labels, binary segmentation masks) and inference costs (a quadratic amount of forward passes). We mitigate these limitations by proposing InstaFormer, a network capable of holistic order prediction. That is, solely given an input RGB image, InstaFormer returns the full occlusion and depth orderings for all the instances in the scene in a single forward pass. At its core, InstaFormer relies on interactions between object queries and latent mask descriptors that semantically represent the same objects while carrying complementary information. We comprehensively benchmark and ablate our approach to highlight its effectiveness. Our code and models are open-source and available at this URL: https://github.com/SNU-VGILab/InstaOrder.
Chinese: InstaFormer 是一种新型网络,仅通过单次前向传播即可从RGB图像中预测场景内所有实例的完整遮挡和深度顺序,有效解决了现有方法对昂贵输入和二次推理成本的依赖问题。
English: InstaFormer is a novel network that predicts complete occlusion and depth orderings for all instances in a scene from a single RGB image in one forward pass, overcoming the limitations of expensive inputs and quadratic inference costs in existing methods.

Authors:Jiacong Xu, Yiqun Mei, Ke Zhang, Vishal M. Patel
Title: FreeViS: Training-free Video Stylization with Inconsistent References
Abstract:
Video stylization plays a key role in content creation, but it remains a challenging problem. Naïvely applying image stylization frame-by-frame hurts temporal consistency and reduces style richness. Alternatively, training a dedicated video stylization model typically requires paired video data and is computationally expensive. In this paper, we propose FreeViS, a training-free video stylization framework that generates stylized videos with rich style details and strong temporal coherence. Our method integrates multiple stylized references to a pretrained image-to-video (I2V) model, effectively mitigating the propagation errors observed in prior works, without introducing flickers and stutters. In addition, it leverages high-frequency compensation to constrain the content layout and motion, together with flow-based motion cues to preserve style textures in low-saliency regions. Through extensive evaluations, FreeViS delivers higher stylization fidelity and superior temporal consistency, outperforming recent baselines and achieving strong human preference. Our training-free pipeline offers a practical and economic solution for high-quality, temporally coherent video stylization. The code and videos can be accessed via https://xujiacong.github.io/FreeViS/

Authors:Ke Jia, Ji Zhou, Hanxin Li, Zhigan Zhou, Haojie Chu, Xiaojie Li
Title: An Efficient Deep Template Matching and In-Plane Pose Estimation Method via Template-Aware Dynamic Convolution
Abstract:
In industrial inspection and component alignment tasks, template matching requires efficient estimation of a target's position and geometric state (rotation and scaling) under complex backgrounds to support precise downstream operations. Traditional methods rely on exhaustive enumeration of angles and scales, leading to low efficiency under compound transformations. Meanwhile, most deep learning-based approaches only estimate similarity scores without explicitly modeling geometric pose, making them inadequate for real-world deployment. To overcome these limitations, we propose a lightweight end-to-end framework that reformulates template matching as joint localization and geometric regression, outputting the center coordinates, rotation angle, and independent horizontal and vertical scales. A Template-Aware Dynamic Convolution Module (TDCM) dynamically injects template features at inference to guide generalizable matching. The compact network integrates depthwise separable convolutions and pixel shuffle for efficient matching. To enable geometric-annotation-free training, we introduce a rotation-shear-based augmentation strategy with structure-aware pseudo labels. A lightweight refinement module further improves angle and scale precision via local optimization. Experiments show our 3.07M model achieves high precision and 14ms inference under compound transformations. It also demonstrates strong robustness in small-template and multi-object scenarios, making it highly suitable for deployment in real-time industrial applications. The code is available at:https://github.com/ZhouJ6610/PoseMatch-TDCM.
中文摘要:本文提出了一种轻量级端到端框架,将模板匹配重构为联合定位与几何回归,在复杂变换下实现了高精度快速推理,适用于实时工业应用。
English Summary: This paper introduces a lightweight end-to-end framework that reformulates template matching as joint localization and geometric regression, achieving high precision and fast inference under complex transformations for real-time industrial applications.

Authors:Jin Cao, Hongrui Wu, Ziyong Feng, Hujun Bao, Xiaowei Zhou, Sida Peng
Title: UniVerse: Unleashing the Scene Prior of Video Diffusion Models for Robust Radiance Field Reconstruction
Abstract:
This paper tackles the challenge of robust reconstruction, i.e., the task of reconstructing a 3D scene from a set of inconsistent multi-view images. Some recent works have attempted to simultaneously remove image inconsistencies and perform reconstruction by integrating image degradation modeling into neural 3D scene representations. However, these methods rely heavily on dense observations for robustly optimizing model parameters. To address this issue, we propose to decouple robust reconstruction into two subtasks: restoration and reconstruction, which naturally simplifies the optimization process. To this end, we introduce UniVerse, a unified framework for robust reconstruction based on a video diffusion model. Specifically, UniVerse first converts inconsistent images into initial videos, then uses a specially designed video diffusion model to restore them into consistent images, and finally reconstructs the 3D scenes from these restored images. Compared with case-by-case per-view degradation modeling, the diffusion model learns a general scene prior from large-scale data, making it applicable to diverse image inconsistencies. Extensive experiments on both synthetic and real-world datasets demonstrate the strong generalization capability and superior performance of our method in robust reconstruction. Moreover, UniVerse can control the style of the reconstructed 3D scene. Project page: https://jin-cao-tma.github.io/UniVerse.github.io/
Chinese: 本文提出UniVerse统一框架,将鲁棒三维重建分解为修复与重建两个子任务,利用视频扩散模型处理不一致多视角图像,在泛化能力和性能上均表现优异。
English: This paper introduces UniVerse, a unified framework that decouples robust 3D reconstruction into restoration and reconstruction tasks, using a video diffusion model to handle inconsistent multi-view images and achieve superior performance with strong generalization.

Authors:Xiaoyang Liu, Zhengyan Zhou, Zihang Xu, Jiezhang Cao, Zheng Chen, Yulun Zhang
Title: FideDiff: Efficient Diffusion Model for High-Fidelity Image Motion Deblurring
Abstract:
Recent advancements in image motion deblurring, driven by CNNs and transformers, have made significant progress. Large-scale pre-trained diffusion models, which are rich in true-world modeling, have shown great promise for high-quality image restoration tasks such as deblurring, demonstrating stronger generative capabilities than CNN and transformer-based methods. However, challenges such as unbearable inference time and compromised fidelity still limit the full potential of the diffusion models. To address this, we introduce FideDiff, a novel single-step diffusion model designed for high-fidelity deblurring. We reformulate motion deblurring as a diffusion-like process where each timestep represents a progressively blurred image, and we train a consistency model that aligns all timesteps to the same clean image. By reconstructing training data with matched blur trajectories, the model learns temporal consistency, enabling accurate one-step deblurring. We further enhance model performance by integrating Kernel ControlNet for blur kernel estimation and introducing adaptive timestep prediction. Our model achieves superior performance on full-reference metrics, surpassing previous diffusion-based methods and matching the performance of other state-of-the-art models. FideDiff offers a new direction for applying pre-trained diffusion models to high-fidelity image restoration tasks, establishing a robust baseline for further advancing diffusion models in real-world industrial applications. Our dataset and code will be available at https://github.com/xyLiu339/FideDiff.
中文摘要:针对扩散模型在图像去模糊中存在的推理时间长和保真度不足的问题,FideDiff通过单步扩散模型实现了时序一致性,显著提升了去模糊性能。
English Summary: Recent advances in image deblurring using diffusion models face challenges in inference time and fidelity, which FideDiff addresses with a single-step model that ensures temporal consistency and high performance.

Authors:Angen Ye, Zeyu Zhang, Boyuan Wang, Xiaofeng Wang, Dapeng Zhang, Zheng Zhu
Title: VLA-R1: Enhancing Reasoning in Vision-Language-Action Models
Abstract:
Vision-Language-Action (VLA) models aim to unify perception, language understanding, and action generation, offering strong cross-task and cross-scene generalization with broad impact on embodied AI. However, current VLA models often lack explicit step-by-step reasoning, instead emitting final actions without considering affordance constraints or geometric relations. Their post-training pipelines also rarely reinforce reasoning quality, relying primarily on supervised fine-tuning with weak reward design. To address these challenges, we present VLA-R1, a reasoning-enhanced VLA that integrates Reinforcement Learning from Verifiable Rewards (RLVR) with Group Relative Policy Optimization (GRPO) to systematically optimize both reasoning and execution. Specifically, we design an RLVR-based post-training strategy with verifiable rewards for region alignment, trajectory consistency, and output formatting, thereby strengthening reasoning robustness and execution accuracy. Moreover, we develop VLA-CoT-13K, a high-quality dataset that provides chain-of-thought supervision explicitly aligned with affordance and trajectory annotations. Furthermore, extensive evaluations on in-domain, out-of-domain, simulation, and real-robot platforms demonstrate that VLA-R1 achieves superior generalization and real-world performance compared to prior VLA methods. We plan to release the model, code, and dataset following the publication of this work. Code: https://github.com/GigaAI-research/VLA-R1. Website: https://gigaai-research.github.io/VLA-R1.
中文: VLA-R1通过结合可验证奖励的强化学习及高质量数据集,增强了视觉-语言-动作模型的推理鲁棒性和执行精度,在多种平台上展现出卓越的泛化性能。
English: VLA-R1 enhances Vision-Language-Action models by integrating reinforcement learning with verifiable rewards and a high-quality dataset to improve reasoning robustness and execution accuracy, demonstrating superior generalization across diverse platforms.

Authors:Changmin Lee, Jihyun Lee, Tae-Kyun Kim
Title: MPMAvatar: Learning 3D Gaussian Avatars with Accurate and Robust Physics-Based Dynamics
Abstract:
While there has been significant progress in the field of 3D avatar creation from visual observations, modeling physically plausible dynamics of humans with loose garments remains a challenging problem. Although a few existing works address this problem by leveraging physical simulation, they suffer from limited accuracy or robustness to novel animation inputs. In this work, we present MPMAvatar, a framework for creating 3D human avatars from multi-view videos that supports highly realistic, robust animation, as well as photorealistic rendering from free viewpoints. For accurate and robust dynamics modeling, our key idea is to use a Material Point Method-based simulator, which we carefully tailor to model garments with complex deformations and contact with the underlying body by incorporating an anisotropic constitutive model and a novel collision handling algorithm. We combine this dynamics modeling scheme with our canonical avatar that can be rendered using 3D Gaussian Splatting with quasi-shadowing, enabling high-fidelity rendering for physically realistic animations. In our experiments, we demonstrate that MPMAvatar significantly outperforms the existing state-of-the-art physics-based avatar in terms of (1) dynamics modeling accuracy, (2) rendering accuracy, and (3) robustness and efficiency. Additionally, we present a novel application in which our avatar generalizes to unseen interactions in a zero-shot manner-which was not achievable with previous learning-based methods due to their limited simulation generalizability. Our project page is at: https://KAISTChangmin.github.io/MPMAvatar/

Authors:Ricardo Gonzalez Penuela, Felipe Arias-Russi, Victor Capriles
Title: Guiding Multimodal Large Language Models with Blind and Low Vision People Visual Questions for Proactive Visual Interpretations
Abstract:
Multimodal large language models (MLLMs) have been integrated into visual interpretation applications to support Blind and Low Vision (BLV) users because of their accuracy and ability to provide rich, human-like interpretations. However, these applications often default to comprehensive, lengthy descriptions regardless of context. This leads to inefficient exchanges, as users must go through irrelevant details rather than receiving the specific information they are likely to seek. To deliver more contextually-relevant information, we developed a system that draws on historical BLV users questions. When given an image, our system identifies similar past visual contexts from the VizWiz-LF dataset and uses the associated questions to guide the MLLM generate descriptions more relevant to BLV users. An evaluation with three human labelers who revised 92 context-aware and context-free descriptions showed that context-aware descriptions anticipated and answered users' questions in 76.1% of cases (70 out of 92) and were preferred in 54.4% of comparisons (50 out of 92). Our paper reviews, and data analysis are publicly available in a Github repository at https://github.com/rgonzalezp/guiding-multimodal-large-language-models-with-blind-and-low-vision-people-visual-questions .
中文: 研究人员开发了一种系统,利用盲人和低视力用户的历史提问引导多模态大语言模型生成更贴合情境的图像描述,相比通用描述显著提升了信息相关性和用户偏好。
English: Researchers developed a system that uses historical BLV user questions to guide multimodal large language models in generating more contextually relevant image descriptions, significantly improving relevance and user preference over generic descriptions.

Authors:Renrong Shao, Wei Zhang, Kangyang Luo, Qin Li, and Jun Wang
Title: Consistent Assistant Domains Transformer for Source-free Domain Adaptation
Abstract:
Source-free domain adaptation (SFDA) aims to address the challenge of adapting to a target domain without accessing the source domain directly. However, due to the inaccessibility of source domain data, deterministic invariable features cannot be obtained. Current mainstream methods primarily focus on evaluating invariant features in the target domain that closely resemble those in the source domain, subsequently aligning the target domain with the source domain. However, these methods are susceptible to hard samples and influenced by domain bias. In this paper, we propose a Consistent Assistant Domains Transformer for SFDA, abbreviated as CADTrans, which solves the issue by constructing invariable feature representations of domain consistency. Concretely, we develop an assistant domain module for CADTrans to obtain diversified representations from the intermediate aggregated global attentions, which addresses the limitation of existing methods in adequately representing diversity. Based on assistant and target domains, invariable feature representations are obtained by multiple consistent strategies, which can be used to distinguish easy and hard samples. Finally, to align the hard samples to the corresponding easy samples, we construct a conditional multi-kernel max mean discrepancy (CMK-MMD) strategy to distinguish between samples of the same category and those of different categories. Extensive experiments are conducted on various benchmarks such as Office-31, Office-Home, VISDA-C, and DomainNet-126, proving the significant performance improvements achieved by our proposed approaches. Code is available at https://github.com/RoryShao/CADTrans.git.
中文: 本文提出CADTrans方法,通过构建辅助域和多重一致性策略获得不变特征表示,并采用条件多核最大均值差异有效处理困难样本和域偏差问题,显著提升了无源域适应的性能。
English: The paper introduces CADTrans, a method for source-free domain adaptation that constructs invariant feature representations through assistant domains and multiple consistency strategies, effectively addressing hard samples and domain bias with a novel conditional multi-kernel discrepancy measure.

Authors:Hanyu Wang, Jiaming Han, Ziyan Yang, Qi Zhao, Shanchuan Lin, Xiangyu Yue, Abhinav Shrivastava, Zhenheng Yang, Hao Chen
Title: Growing Visual Generative Capacity for Pre-Trained MLLMs
Abstract:
Multimodal large language models (MLLMs) extend the success of language models to visual understanding, and recent efforts have sought to build unified MLLMs that support both understanding and generation. However, constructing such models remains challenging: hybrid approaches combine continuous embeddings with diffusion or flow-based objectives, producing high-quality images but breaking the autoregressive paradigm, while pure autoregressive approaches unify text and image prediction over discrete visual tokens but often face trade-offs between semantic alignment and pixel-level fidelity. In this work, we present Bridge, a pure autoregressive unified MLLM that augments pre-trained visual understanding models with generative ability through a Mixture-of-Transformers architecture, enabling both image understanding and generation within a single next-token prediction framework. To further improve visual generation fidelity, we propose a semantic-to-pixel discrete representation that integrates compact semantic tokens with fine-grained pixel tokens, achieving strong language alignment and precise description of visual details with only a 7.9% increase in sequence length. Extensive experiments across diverse multimodal benchmarks demonstrate that Bridge achieves competitive or superior results in both understanding and generation benchmarks, while requiring less training data and reduced training time compared to prior unified MLLMs.

Authors:Meilong Xu, Xiaoling Hu, Shahira Abousamra, Chen Li, Chao Chen
Title: MATCH: Multi-faceted Adaptive Topo-Consistency for Semi-Supervised Histopathology Segmentation
Abstract:
In semi-supervised segmentation, capturing meaningful semantic structures from unlabeled data is essential. This is particularly challenging in histopathology image analysis, where objects are densely distributed. To address this issue, we propose a semi-supervised segmentation framework designed to robustly identify and preserve relevant topological features. Our method leverages multiple perturbed predictions obtained through stochastic dropouts and temporal training snapshots, enforcing topological consistency across these varied outputs. This consistency mechanism helps distinguish biologically meaningful structures from transient and noisy artifacts. A key challenge in this process is to accurately match the corresponding topological features across the predictions in the absence of ground truth. To overcome this, we introduce a novel matching strategy that integrates spatial overlap with global structural alignment, minimizing discrepancies among predictions. Extensive experiments demonstrate that our approach effectively reduces topological errors, resulting in more robust and accurate segmentations essential for reliable downstream analysis. Code is available at \href{https://github.com/Melon-Xu/MATCH}{https://github.com/Melon-Xu/MATCH}.
中文: 本研究提出了一种半监督分割框架,通过在多组扰动预测中保持拓扑一致性来区分病理图像中的真实结构与噪声伪影,并采用创新的特征匹配策略显著提升了分割精度。
English: This study introduces a semi-supervised segmentation framework that enforces topological consistency across multiple perturbed predictions to distinguish meaningful structures from artifacts in histopathology images, achieving improved accuracy through a novel feature matching strategy.

Authors:Yuxuan Ou, Ning Bi, Jiazhen Pan, Jiancheng Yang, Boliang Yu, Usama Zidan, Regent Lee, Vicente Grau
Title: AortaDiff: A Unified Multitask Diffusion Framework For Contrast-Free AAA Imaging
Abstract:
While contrast-enhanced CT (CECT) is standard for assessing abdominal aortic aneurysms (AAA), the required iodinated contrast agents pose significant risks, including nephrotoxicity, patient allergies, and environmental harm. To reduce contrast agent use, recent deep learning methods have focused on generating synthetic CECT from non-contrast CT (NCCT) scans. However, most adopt a multi-stage pipeline that first generates images and then performs segmentation, which leads to error accumulation and fails to leverage shared semantic and anatomical structures. To address this, we propose a unified deep learning framework that generates synthetic CECT images from NCCT scans while simultaneously segmenting the aortic lumen and thrombus. Our approach integrates conditional diffusion models (CDM) with multi-task learning, enabling end-to-end joint optimization of image synthesis and anatomical segmentation. Unlike previous multitask diffusion models, our approach requires no initial predictions (e.g., a coarse segmentation mask), shares both encoder and decoder parameters across tasks, and employs a semi-supervised training strategy to learn from scans with missing segmentation labels, a common constraint in real-world clinical data. We evaluated our method on a cohort of 264 patients, where it consistently outperformed state-of-the-art single-task and multi-stage models. For image synthesis, our model achieved a PSNR of 25.61 dB, compared to 23.80 dB from a single-task CDM. For anatomical segmentation, it improved the lumen Dice score to 0.89 from 0.87 and the challenging thrombus Dice score to 0.53 from 0.48 (nnU-Net). These segmentation enhancements led to more accurate clinical measurements, reducing the lumen diameter MAE to 4.19 mm from 5.78 mm and the thrombus area error to 33.85% from 41.45% when compared to nnU-Net. Code is available at https://github.com/yuxuanou623/AortaDiff.git.
Chinese: 本研究提出了一种统一的深度学习框架,通过结合条件扩散模型与多任务学习,能够从非增强CT扫描中同步生成合成增强CT图像并分割主动脉结构,在图像质量和解剖分割精度上均优于现有方法。
English: This study introduces a unified deep learning framework that simultaneously generates synthetic contrast-enhanced CT images from non-contrast scans and segments aortic structures, outperforming existing methods in both image quality and anatomical accuracy through integrated conditional diffusion models and multi-task learning.

Authors:Nilay Naharas, Dang Nguyen, Nesihan Bulut, Mohammadhossein Bateni, Vahab Mirrokni, Baharan Mirzasoleiman
Title: Data Selection for Fine-tuning Vision Language Models via Cross Modal Alignment Trajectories
Abstract:
Data-efficient learning aims to eliminate redundancy in large training datasets by training models on smaller subsets of the most informative examples. While data selection has been extensively explored for vision models and large language models (LLMs), it remains underexplored for Large Vision-Language Models (LVLMs). Notably, none of existing methods can outperform random selection at different subset sizes. In this work, we propose the first principled method for data-efficient instruction tuning of LVLMs. We prove that examples with similar cross-modal attention matrices during instruction tuning have similar gradients. Thus, they influence model parameters in a similar manner and convey the same information to the model during training. Building on this insight, we propose XMAS, which clusters examples based on the trajectories of the top singular values of their attention matrices obtained from fine-tuning a small proxy LVLM. By sampling a balanced subset from these clusters, XMAS effectively removes redundancy in large-scale LVLM training data. Extensive experiments show that XMAS can discard 50% of the LLaVA-665k dataset and 85% of the Vision-Flan dataset while fully preserving performance of LLaVA-1.5-7B on 10 downstream benchmarks and speeding up its training by 1.2x. This is 30% more data reduction compared to the best baseline for LLaVA-665k. The project's website can be found at https://bigml-cs-ucla.github.io/XMAS-project-page/.

Authors:Shijia Feng, Michael Wray, Walterio Mayol-Cuevas
Title: EvoStruggle: A Dataset Capturing the Evolution of Struggle across Activities and Skill Levels
Abstract:
The ability to determine when a person struggles during skill acquisition is crucial for both optimizing human learning and enabling the development of effective assistive systems. As skills develop, the type and frequency of struggles tend to change, and understanding this evolution is key to determining the user's current stage of learning. However, existing manipulation datasets have not focused on how struggle evolves over time. In this work, we collect a dataset for struggle determination, featuring 61.68 hours of video recordings, 2,793 videos, and 5,385 annotated temporal struggle segments collected from 76 participants. The dataset includes 18 tasks grouped into four diverse activities -- tying knots, origami, tangram puzzles, and shuffling cards, representing different task variations. In addition, participants repeated the same task five times to capture their evolution of skill. We define the struggle determination problem as a temporal action localization task, focusing on identifying and precisely localizing struggle segments with start and end times. Experimental results show that Temporal Action Localization models can successfully learn to detect struggle cues, even when evaluated on unseen tasks or activities. The models attain an overall average mAP of 34.56% when generalizing across tasks and 19.24% across activities, indicating that struggle is a transferable concept across various skill-based tasks while still posing challenges for further improvement in struggle detection. Our dataset is available at https://github.com/FELIXFENG2019/EvoStruggle.
中文: 本研究提出了一个用于识别技能学习过程中困难的数据集,证明时序动作定位模型能有效检测并跨任务泛化困难信号,但仍需进一步改进以应对挑战。
English: This research introduces a dataset for identifying struggles during skill acquisition, demonstrating that temporal action localization models can effectively detect and generalize struggle cues across various tasks, though challenges remain for further improvement.

Authors:Berker Demirel, Marco Fumero, Theofanis Karaletsos, Francesco Locatello
Title: MorphGen: Controllable and Morphologically Plausible Generative Cell-Imaging
Abstract:
Simulating in silico cellular responses to interventions is a promising direction to accelerate high-content image-based assays, critical for advancing drug discovery and gene editing. To support this, we introduce MorphGen, a state-of-the-art diffusion-based generative model for fluorescent microscopy that enables controllable generation across multiple cell types and perturbations. To capture biologically meaningful patterns consistent with known cellular morphologies, MorphGen is trained with an alignment loss to match its representations to the phenotypic embeddings of OpenPhenom, a state-of-the-art biological foundation model. Unlike prior approaches that compress multichannel stains into RGB images -- thus sacrificing organelle-specific detail -- MorphGen generates the complete set of fluorescent channels jointly, preserving per-organelle structures and enabling a fine-grained morphological analysis that is essential for biological interpretation. We demonstrate biological consistency with real images via CellProfiler features, and MorphGen attains an FID score over $35\%$ lower than the prior state-of-the-art MorphoDiff, which only generates RGB images for a single cell type. Code is available at https://github.com/czi-ai/MorphGen.
Chinese: MorphGen是一种基于扩散的生成模型,可在多种细胞类型和扰动下可控生成荧光显微镜图像,同时保留细胞器特异性细节,其FID分数比现有最佳方法降低超过35%。
English: MorphGen is a diffusion-based generative model that enables controllable generation of fluorescent microscopy images across multiple cell types and perturbations while preserving organelle-specific details, achieving over 35% lower FID score than previous methods.

Authors:Chetwin Low, Weimin Wang, Calder Katyal
Title: Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation
Abstract:
Audio-video generation has often relied on complex multi-stage architectures or sequential synthesis of sound and visuals. We introduce Ovi, a unified paradigm for audio-video generation that models the two modalities as a single generative process. By using blockwise cross-modal fusion of twin-DiT modules, Ovi achieves natural synchronization and removes the need for separate pipelines or post hoc alignment. To facilitate fine-grained multimodal fusion modeling, we initialize an audio tower with an architecture identical to that of a strong pretrained video model. Trained from scratch on hundreds of thousands of hours of raw audio, the audio tower learns to generate realistic sound effects, as well as speech that conveys rich speaker identity and emotion. Fusion is obtained by jointly training the identical video and audio towers via blockwise exchange of timing (via scaled-RoPE embeddings) and semantics (through bidirectional cross-attention) on a vast video corpus. Our model enables cinematic storytelling with natural speech and accurate, context-matched sound effects, producing movie-grade video clips. All the demos, code and model weights are published at https://aaxwaz.github.io/Ovi

Authors:Fei Shen, Weihao Xu, Rui Yan, Dong Zhang, Xiangbo Shu, Jinhui Tang
Title: IMAGEdit: Let Any Subject Transform
Abstract:
In this paper, we present IMAGEdit, a training-free framework for any number of video subject editing that manipulates the appearances of multiple designated subjects while preserving non-target regions, without finetuning or retraining. We achieve this by providing robust multimodal conditioning and precise mask sequences through a prompt-guided multimodal alignment module and a prior-based mask retargeting module. We first leverage large models' understanding and generation capabilities to produce multimodal information and mask motion sequences for multiple subjects across various types. Then, the obtained prior mask sequences are fed into a pretrained mask-driven video generation model to synthesize the edited video. With strong generalization capability, IMAGEdit remedies insufficient prompt-side multimodal conditioning and overcomes mask boundary entanglement in videos with any number of subjects, thereby significantly expanding the applicability of video editing. More importantly, IMAGEdit is compatible with any mask-driven video generation model, significantly improving overall performance. Extensive experiments on our newly constructed multi-subject benchmark MSVBench verify that IMAGEdit consistently surpasses state-of-the-art methods. Code, models, and datasets are publicly available at https://github.com/XWH-A/IMAGEdit.
中文: IMAGEdit是一种无需训练的多主体视频编辑框架,通过多模态条件引导和精确掩码序列实现指定主体的外观修改,同时保持非目标区域不变,在多项基准测试中均优于现有先进方法且无需模型微调。
English: IMAGEdit is a training-free framework for multi-subject video editing that uses multimodal conditioning and precise mask sequences to modify designated subjects while preserving other areas, demonstrating superior performance on benchmarks without requiring model retraining.

Authors:Jiahao Wang, Luoxin Ye, TaiMing Lu, Junfei Xiao, Jiahan Zhang, Yuxiang Guo, Xijun Liu, Rama Chellappa, Cheng Peng, Alan Yuille, Jieneng Chen
Title: EvoWorld: Evolving Panoramic World Generation with Explicit 3D Memory
Abstract:
Humans possess a remarkable ability to mentally explore and replay 3D environments they have previously experienced. Inspired by this mental process, we present EvoWorld: a world model that bridges panoramic video generation with evolving 3D memory to enable spatially consistent long-horizon exploration. Given a single panoramic image as input, EvoWorld first generates future video frames by leveraging a video generator with fine-grained view control, then evolves the scene's 3D reconstruction using a feedforward plug-and-play transformer, and finally synthesizes futures by conditioning on geometric reprojections from this evolving explicit 3D memory. Unlike prior state-of-the-arts that synthesize videos only, our key insight lies in exploiting this evolving 3D reconstruction as explicit spatial guidance for the video generation process, projecting the reconstructed geometry onto target viewpoints to provide rich spatial cues that significantly enhance both visual realism and geometric consistency. To evaluate long-range exploration capabilities, we introduce the first comprehensive benchmark spanning synthetic outdoor environments, Habitat indoor scenes, and challenging real-world scenarios, with particular emphasis on loop-closure detection and spatial coherence over extended trajectories. Extensive experiments demonstrate that our evolving 3D memory substantially improves visual fidelity and maintains spatial scene coherence compared to existing approaches, representing a significant advance toward long-horizon spatially consistent world modeling.
中文摘要:EvoWorld是一种将全景视频生成与演进式3D记忆相结合的世界模型,通过利用重建几何作为显式空间指引,在长时序探索中显著提升了视觉真实感与空间连贯性。
English Summary: EvoWorld is a world model that integrates panoramic video generation with evolving 3D memory to achieve long-horizon spatial consistency by using reconstructed geometry as explicit guidance for enhanced visual realism and coherence.

Authors:Jiye Lee, Chenghui Li, Linh Tran, Shih-En Wei, Jason Saragih, Alexander Richard, Hanbyul Joo, Shaojie Bai
Title: Audio Driven Real-Time Facial Animation for Social Telepresence
Abstract:
We present an audio-driven real-time system for animating photorealistic 3D facial avatars with minimal latency, designed for social interactions in virtual reality for anyone. Central to our approach is an encoder model that transforms audio signals into latent facial expression sequences in real time, which are then decoded as photorealistic 3D facial avatars. Leveraging the generative capabilities of diffusion models, we capture the rich spectrum of facial expressions necessary for natural communication while achieving real-time performance (<15ms GPU time). Our novel architecture minimizes latency through two key innovations: an online transformer that eliminates dependency on future inputs and a distillation pipeline that accelerates iterative denoising into a single step. We further address critical design challenges in live scenarios for processing continuous audio signals frame-by-frame while maintaining consistent animation quality. The versatility of our framework extends to multimodal applications, including semantic modalities such as emotion conditions and multimodal sensors with head-mounted eye cameras on VR headsets. Experimental results demonstrate significant improvements in facial animation accuracy over existing offline state-of-the-art baselines, achieving 100 to 1000 times faster inference speed. We validate our approach through live VR demonstrations and across various scenarios such as multilingual speeches.

Authors:Yanzhe Chen, Kevin Qinghong Lin, Mike Zheng Shou
Title: Code2Video: A Code-centric Paradigm for Educational Video Generation
Abstract:
While recent generative models advance pixel-space video synthesis, they remain limited in producing professional educational videos, which demand disciplinary knowledge, precise visual structures, and coherent transitions, limiting their applicability in educational scenarios. Intuitively, such requirements are better addressed through the manipulation of a renderable environment, which can be explicitly controlled via logical commands (e.g., code). In this work, we propose Code2Video, a code-centric agent framework for generating educational videos via executable Python code. The framework comprises three collaborative agents: (i) Planner, which structures lecture content into temporally coherent flows and prepares corresponding visual assets; (ii) Coder, which converts structured instructions into executable Python codes while incorporating scope-guided auto-fix to enhance efficiency; and (iii) Critic, which leverages vision-language models (VLM) with visual anchor prompts to refine spatial layout and ensure clarity. To support systematic evaluation, we build MMMC, a benchmark of professionally produced, discipline-specific educational videos. We evaluate MMMC across diverse dimensions, including VLM-as-a-Judge aesthetic scores, code efficiency, and particularly, TeachQuiz, a novel end-to-end metric that quantifies how well a VLM, after unlearning, can recover knowledge by watching the generated videos. Our results demonstrate the potential of Code2Video as a scalable, interpretable, and controllable approach, achieving 40% improvement over direct code generation and producing videos comparable to human-crafted tutorials. The code and datasets are available at https://github.com/showlab/Code2Video.
中文摘要:Code2Video是一个基于代码的框架,通过三个协作智能体将教学指令转化为可执行的Python代码来生成教育视频,在质量和效率上相比传统方法实现显著提升。
English Summary: Code2Video is a code-driven framework that uses collaborative agents to generate educational videos through executable Python code, achieving significant improvements in quality and efficiency over traditional methods.

Authors:Zhanpeng Luo, Haoxi Ran, Li Lu
Title: Instant4D: 4D Gaussian Splatting in Minutes
Abstract:
Dynamic view synthesis has seen significant advances, yet reconstructing scenes from uncalibrated, casual video remains challenging due to slow optimization and complex parameter estimation. In this work, we present Instant4D, a monocular reconstruction system that leverages native 4D representation to efficiently process casual video sequences within minutes, without calibrated cameras or depth sensors. Our method begins with geometric recovery through deep visual SLAM, followed by grid pruning to optimize scene representation. Our design significantly reduces redundancy while maintaining geometric integrity, cutting model size to under 10% of its original footprint. To handle temporal dynamics efficiently, we introduce a streamlined 4D Gaussian representation, achieving a 30x speed-up and reducing training time to within two minutes, while maintaining competitive performance across several benchmarks. Our method reconstruct a single video within 10 minutes on the Dycheck dataset or for a typical 200-frame video. We further apply our model to in-the-wild videos, showcasing its generalizability. Our project website is published at https://instant4d.github.io/.

Authors:Siheng Wan, Zhengtao Yao, Zhengdao Li, Junhao Dong, Yanshu Li, Yikai Li, Linshan Li, Haoyan Xu, Yijiang Li, Zhikang Dong, Huacan Wang, Jifeng Shen
Title: JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation
Abstract:
Modern Text-to-Image (T2I) generation increasingly relies on token-centric architectures that are trained with self-supervision, yet effectively fusing text with visual tokens remains a challenge. We propose \textbf{JEPA-T}, a unified multimodal framework that encodes images and captions into discrete visual and textual tokens, processed by a joint-embedding predictive Transformer. To enhance fusion, we incorporate cross-attention after the feature predictor for conditional denoising while maintaining a task-agnostic backbone. Additionally, raw texts embeddings are injected prior to the flow matching loss to improve alignment during training. During inference, the same network performs both class-conditional and free-text image generation by iteratively denoising visual tokens conditioned on text. Evaluations on ImageNet-1K demonstrate that JEPA-T achieves strong data efficiency, open-vocabulary generalization, and consistently outperforms non-fusion and late-fusion baselines. Our approach shows that late architectural fusion combined with objective-level alignment offers an effective balance between conditioning strength and backbone generality in token-based T2I.The code is now available: https://github.com/justin-herry/JEPA-T.git
中文摘要:JEPA-T 是一个统一的多模态框架,通过交叉注意力和嵌入注入有效融合视觉与文本标记,在文本到图像生成中实现了卓越的数据效率和开放词汇泛化能力,性能优于非融合及后期融合基线方法。
English Summary: JEPA-T is a unified multimodal framework that enhances text-to-image generation by effectively fusing visual and textual tokens through cross-attention and embedding injection, achieving superior data efficiency and open-vocabulary generalization compared to baseline methods.

Authors:Ziqing Zhang, Kai Liu, Zheng Chen, Xi Li, Yucong Chen, Bingnan Duan, Linghe Kong, Yulun Zhang
Title: InfVSR: Breaking Length Limits of Generic Video Super-Resolution
Abstract:
Real-world videos often extend over thousands of frames. Existing video super-resolution (VSR) approaches, however, face two persistent challenges when processing long sequences: (1) inefficiency due to the heavy cost of multi-step denoising for full-length sequences; and (2) poor scalability hindered by temporal decomposition that causes artifacts and discontinuities. To break these limits, we propose InfVSR, which novelly reformulates VSR as an autoregressive-one-step-diffusion paradigm. This enables streaming inference while fully leveraging pre-trained video diffusion priors. First, we adapt the pre-trained DiT into a causal structure, maintaining both local and global coherence via rolling KV-cache and joint visual guidance. Second, we distill the diffusion process into a single step efficiently, with patch-wise pixel supervision and cross-chunk distribution matching. Together, these designs enable efficient and scalable VSR for unbounded-length videos. To fill the gap in long-form video evaluation, we build a new benchmark tailored for extended sequences and further introduce semantic-level metrics to comprehensively assess temporal consistency. Our method pushes the frontier of long-form VSR, achieves state-of-the-art quality with enhanced semantic consistency, and delivers up to 58x speed-up over existing methods such as MGLD-VSR. Code will be available at https://github.com/Kai-Liu001/InfVSR.
中文: InfVSR提出了一种自回归单步扩散范式,通过因果结构和单步蒸馏实现了对无限长度视频的高效超分辨率处理,在提升语义一致性的同时大幅加快了处理速度。
English: InfVSR introduces an autoregressive one-step diffusion paradigm for video super-resolution, enabling efficient and scalable processing of unbounded-length videos while achieving state-of-the-art quality with significant speed improvements.

Authors:Ali Shadman Yazdi, Annalisa Cappella, Benedetta Baldini, Riccardo Solazzo, Gianluca Tartaglia, Chiarella Sforza, Giuseppe Baselli
Title: PAL-Net: A Point-Wise CNN with Patch-Attention for 3D Facial Landmark Localization
Abstract:
Manual annotation of anatomical landmarks on 3D facial scans is a time-consuming and expertise-dependent task, yet it remains critical for clinical assessments, morphometric analysis, and craniofacial research. While several deep learning methods have been proposed for facial landmark localization, most focus on pseudo-landmarks or require complex input representations, limiting their clinical applicability. This study presents a fully automated deep learning pipeline (PAL-Net) for localizing 50 anatomical landmarks on stereo-photogrammetry facial models. The method combines coarse alignment, region-of-interest filtering, and an initial approximation of landmarks with a patch-based pointwise CNN enhanced by attention mechanisms. Trained and evaluated on 214 annotated scans from healthy adults, PAL-Net achieved a mean localization error of 3.686 mm and preserves relevant anatomical distances with a 2.822 mm average error, comparable to intra-observer variability. To assess generalization, the model was further evaluated on 700 subjects from the FaceScape dataset, achieving a point-wise error of 0.41\,mm and a distance-wise error of 0.38\,mm. Compared to existing methods, PAL-Net offers a favorable trade-off between accuracy and computational cost. While performance degrades in regions with poor mesh quality (e.g., ears, hairline), the method demonstrates consistent accuracy across most anatomical regions. PAL-Net generalizes effectively across datasets and facial regions, outperforming existing methods in both point-wise and structural evaluations. It provides a lightweight, scalable solution for high-throughput 3D anthropometric analysis, with potential to support clinical workflows and reduce reliance on manual annotation. Source code can be found at https://github.com/Ali5hadman/PAL-Net-A-Point-Wise-CNN-with-Patch-Attention
中文: 本研究提出的PAL-Net自动化深度学习流程能精准定位3D面部扫描的解剖标志点,其精度达到专家水平,为临床和研究应用提供了高效的计算解决方案。
English: This study introduces PAL-Net, an automated deep learning pipeline that accurately localizes anatomical landmarks on 3D facial scans with performance comparable to human experts, offering a computationally efficient solution for clinical and research applications.

Authors:Hyun-kyu Ko, Youbin Kim, Jihyeon Park, Dongheok Park, Gyeongjin Kang, Wonjun Cho, Hyung Yi, Eunbyung Park
Title: Gather-Scatter Mamba: Accelerating Propagation with Efficient State Space Model
Abstract:
State Space Models (SSMs)-most notably RNNs-have historically played a central role in sequential modeling. Although attention mechanisms such as Transformers have since dominated due to their ability to model global context, their quadratic complexity and limited scalability make them less suited for long sequences. Video super-resolution (VSR) methods have traditionally relied on recurrent architectures to propagate features across frames. However, such approaches suffer from well-known issues including vanishing gradients, lack of parallelism, and slow inference speed. Recent advances in selective SSMs like Mamba offer a compelling alternative: by enabling input-dependent state transitions with linear-time complexity, Mamba mitigates these issues while maintaining strong long-range modeling capabilities. Despite this potential, Mamba alone struggles to capture fine-grained spatial dependencies due to its causal nature and lack of explicit context aggregation. To address this, we propose a hybrid architecture that combines shifted window self-attention for spatial context aggregation with Mamba-based selective scanning for efficient temporal propagation. Furthermore, we introduce Gather-Scatter Mamba (GSM), an alignment-aware mechanism that warps features toward a center anchor frame within the temporal window before Mamba propagation and scatters them back afterward, effectively reducing occlusion artifacts and ensuring effective redistribution of aggregated information across all frames. The official implementation is provided at: https://github.com/Ko-Lani/GSMamba.
中文摘要:本文提出了一种混合架构,结合移位窗口自注意力进行空间上下文建模与基于Mamba的选择性扫描实现高效时序传播,并通过收集-散射Mamba机制减少视频超分辨率中的遮挡伪影。
English Summary: This paper introduces a hybrid architecture combining shifted window self-attention for spatial context with Mamba-based selective scanning for efficient temporal propagation, along with a Gather-Scatter Mamba mechanism to reduce occlusion artifacts in video super-resolution.

Authors:Xiangtao Kong, Rongyuan Wu, Shuaizheng Liu, Lingchen Sun, Lei Zhang
Title: NSARM: Next-Scale Autoregressive Modeling for Robust Real-World Image Super-Resolution
Abstract:
Most recent real-world image super-resolution (Real-ISR) methods employ pre-trained text-to-image (T2I) diffusion models to synthesize the high-quality image either from random Gaussian noise, which yields realistic results but is slow due to iterative denoising, or directly from the input low-quality image, which is efficient but at the price of lower output quality. These approaches train ControlNet or LoRA modules while keeping the pre-trained model fixed, which often introduces over-enhanced artifacts and hallucinations, suffering from the robustness to inputs of varying degradations. Recent visual autoregressive (AR) models, such as pre-trained Infinity, can provide strong T2I generation capabilities while offering superior efficiency by using the bitwise next-scale prediction strategy. Building upon next-scale prediction, we introduce a robust Real-ISR framework, namely Next-Scale Autoregressive Modeling (NSARM). Specifically, we train NSARM in two stages: a transformation network is first trained to map the input low-quality image to preliminary scales, followed by an end-to-end full-model fine-tuning. Such a comprehensive fine-tuning enhances the robustness of NSARM in Real-ISR tasks without compromising its generative capability. Extensive quantitative and qualitative evaluations demonstrate that as a pure AR model, NSARM achieves superior visual results over existing Real-ISR methods while maintaining a fast inference speed. Most importantly, it demonstrates much higher robustness to the quality of input images, showing stronger generalization performance. Project page: https://github.com/Xiangtaokong/NSARM
Chinese: 近期基于扩散模型的真实图像超分方法在质量与效率间难以兼顾,而提出的下一尺度自回归建模(NSARM)框架通过两阶段训练策略,在保持快速推理的同时显著提升了处理不同退化输入图像的鲁棒性,获得了更优的视觉重建效果。
English: Recent real-world image super-resolution methods using diffusion models face trade-offs between quality and efficiency, but the proposed Next-Scale Autoregressive Modeling (NSARM) framework overcomes these by employing a two-stage training approach that enhances robustness and maintains fast inference while achieving superior visual results.

Authors:Joana C. Costa, Tiago Roxo, Hugo Proença, Pedro R. M. Inácio
Title: ZQBA: Zero Query Black-box Adversarial Attack
Abstract:
Current black-box adversarial attacks either require multiple queries or diffusion models to produce adversarial samples that can impair the target model performance. However, these methods require training a surrogate loss or diffusion models to produce adversarial samples, which limits their applicability in real-world settings. Thus, we propose a Zero Query Black-box Adversarial (ZQBA) attack that exploits the representations of Deep Neural Networks (DNNs) to fool other networks. Instead of requiring thousands of queries to produce deceiving adversarial samples, we use the feature maps obtained from a DNN and add them to clean images to impair the classification of a target model. The results suggest that ZQBA can transfer the adversarial samples to different models and across various datasets, namely CIFAR and Tiny ImageNet. The experiments also show that ZQBA is more effective than state-of-the-art black-box attacks with a single query, while maintaining the imperceptibility of perturbations, evaluated both quantitatively (SSIM) and qualitatively, emphasizing the vulnerabilities of employing DNNs in real-world contexts. All the source code is available at https://github.com/Joana-Cabral/ZQBA.
中文: 提出的零查询黑盒对抗攻击(ZQBA)利用深度神经网络的特征映射生成难以察觉的对抗样本,无需查询即可跨数据集有效欺骗目标模型,其性能优于现有方法。
English: The proposed Zero Query Black-box Adversarial (ZQBA) attack uses feature maps from a Deep Neural Network to create imperceptible adversarial samples that effectively fool target models across datasets without requiring queries, outperforming existing methods.

Authors:Myungkyu Koo, Daewon Choi, Taeyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, Jinwoo Shin
Title: HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy
Abstract:
Inherently, robotic manipulation tasks are history-dependent: leveraging past context could be beneficial. However, most existing Vision-Language-Action models (VLAs) have been designed without considering this aspect, i.e., they rely solely on the current observation, ignoring preceding context. In this paper, we propose HAMLET, a scalable framework to adapt VLAs to attend to the historical context during action prediction. Specifically, we introduce moment tokens that compactly encode perceptual information at each timestep. Their representations are initialized with time-contrastive learning, allowing them to better capture temporally distinctive aspects. Next, we employ a lightweight memory module that integrates the moment tokens across past timesteps into memory features, which are then leveraged for action prediction. Through empirical evaluation, we show that HAMLET successfully transforms a state-of-the-art VLA into a history-aware policy, especially demonstrating significant improvements on long-horizon tasks that require historical context. In particular, on top of GR00T N1.5, HAMLET achieves an average success rate of 76.4% on history-dependent real-world tasks, surpassing the baseline performance by 47.2%. Furthermore, HAMLET pushes prior art performance from 64.1% to 66.4% on RoboCasa Kitchen (100-demo setup) and from 95.6% to 97.7% on LIBERO, highlighting its effectiveness even under generic robot-manipulation benchmarks.

Authors:Steffen Meinert, Philipp Schlinge, Nils Strodthoff, Martin Atzmueller
Title: ProtoMask: Segmentation-Guided Prototype Learning
Abstract:
XAI gained considerable importance in recent years. Methods based on prototypical case-based reasoning have shown a promising improvement in explainability. However, these methods typically rely on additional post-hoc saliency techniques to explain the semantics of learned prototypes. Multiple critiques have been raised about the reliability and quality of such techniques. For this reason, we study the use of prominent image segmentation foundation models to improve the truthfulness of the mapping between embedding and input space. We aim to restrict the computation area of the saliency map to a predefined semantic image patch to reduce the uncertainty of such visualizations. To perceive the information of an entire image, we use the bounding box from each generated segmentation mask to crop the image. Each mask results in an individual input in our novel model architecture named ProtoMask. We conduct experiments on three popular fine-grained classification datasets with a wide set of metrics, providing a detailed overview on explainability characteristics. The comparison with other popular models demonstrates competitive performance and unique explainability features of our model. https://github.com/uos-sis/quanproto
中文: 本研究提出ProtoMask模型,通过图像分割技术改进原型映射并降低可视化不确定性,从而提升人工智能的可解释性,在细粒度数据集上展现出竞争优势。
English: The study introduces ProtoMask, a novel model that enhances explainability in AI by using image segmentation to refine prototype mappings and reduce visualization uncertainty, demonstrating competitive performance on fine-grained datasets.

Authors:Francesco Galati, Daniele Falcetta, Rosa Cortese, Ferran Prados, Ninon Burgos, Maria A. Zuluaga
Title: Multi-Domain Brain Vessel Segmentation Through Feature Disentanglement
Abstract:
The intricate morphology of brain vessels poses significant challenges for automatic segmentation models, which usually focus on a single imaging modality. However, accurately treating brain-related conditions requires a comprehensive understanding of the cerebrovascular tree, regardless of the specific acquisition procedure. Our framework effectively segments brain arteries and veins in various datasets through image-to-image translation while avoiding domain-specific model design and data harmonization between the source and the target domain. This is accomplished by employing disentanglement techniques to independently manipulate different image properties, allowing them to move from one domain to another in a label-preserving manner. Specifically, we focus on manipulating vessel appearances during adaptation while preserving spatial information, such as shapes and locations, which are crucial for correct segmentation. Our evaluation effectively bridges large and varied domain gaps across medical centers, image modalities, and vessel types. Additionally, we conduct ablation studies on the optimal number of required annotations and other architectural choices. The results highlight our framework's robustness and versatility, demonstrating the potential of domain adaptation methodologies to perform cerebrovascular image segmentation in multiple scenarios accurately. Our code is available at https://github.com/i-vesseg/MultiVesSeg.
中文: 该框架通过图像转换和解缠技术,在跨域适应血管外观的同时保留空间细节,有效克服了脑部血管分割的挑战,在多样化医疗数据集中实现了稳健性能。
English: This framework overcomes brain vessel segmentation challenges by using image-to-image translation and disentanglement techniques to adapt vessel appearances across domains while preserving spatial details, achieving robust performance across diverse medical datasets.

Authors:Beomsu Kim, Byunghee Cha, Jong Chul Ye
Title: Align Your Tangent: Training Better Consistency Models via Manifold-Aligned Tangents
Abstract:
With diffusion and flow matching models achieving state-of-the-art generating performance, the interest of the community now turned to reducing the inference time without sacrificing sample quality. Consistency Models (CMs), which are trained to be consistent on diffusion or probability flow ordinary differential equation (PF-ODE) trajectories, enable one or two-step flow or diffusion sampling. However, CMs typically require prolonged training with large batch sizes to obtain competitive sample quality. In this paper, we examine the training dynamics of CMs near convergence and discover that CM tangents -- CM output update directions -- are quite oscillatory, in the sense that they move parallel to the data manifold, not towards the manifold. To mitigate oscillatory tangents, we propose a new loss function, called the manifold feature distance (MFD), which provides manifold-aligned tangents that point toward the data manifold. Consequently, our method -- dubbed Align Your Tangent (AYT) -- can accelerate CM training by orders of magnitude and even out-perform the learned perceptual image patch similarity metric (LPIPS). Furthermore, we find that our loss enables training with extremely small batch sizes without compromising sample quality. Code: https://github.com/1202kbs/AYT
中文: 本文提出对齐切线(AYT)方法,通过引入流形特征距离损失来修正一致性模型中的振荡训练方向,大幅加速训练过程,并在小批量情况下保持样本质量。
English: The paper introduces Align Your Tangent (AYT), a method that uses a manifold feature distance loss to correct oscillatory training directions in Consistency Models, significantly accelerating training and maintaining sample quality even with small batch sizes.

Authors:Guozhen Zhang, Haiguang Wang, Chunyu Wang, Yuan Zhou, Qinglin Lu, Limin Wang
Title: Arbitrary Generative Video Interpolation
Abstract:
Video frame interpolation (VFI), which generates intermediate frames from given start and end frames, has become a fundamental function in video generation applications. However, existing generative VFI methods are constrained to synthesize a fixed number of intermediate frames, lacking the flexibility to adjust generated frame rates or total sequence duration. In this work, we present ArbInterp, a novel generative VFI framework that enables efficient interpolation at any timestamp and of any length. Specifically, to support interpolation at any timestamp, we propose the Timestamp-aware Rotary Position Embedding (TaRoPE), which modulates positions in temporal RoPE to align generated frames with target normalized timestamps. This design enables fine-grained control over frame timestamps, addressing the inflexibility of fixed-position paradigms in prior work. For any-length interpolation, we decompose long-sequence generation into segment-wise frame synthesis. We further design a novel appearance-motion decoupled conditioning strategy: it leverages prior segment endpoints to enforce appearance consistency and temporal semantics to maintain motion coherence, ensuring seamless spatiotemporal transitions across segments. Experimentally, we build comprehensive benchmarks for multi-scale frame interpolation (2x to 32x) to assess generalizability across arbitrary interpolation factors. Results show that ArbInterp outperforms prior methods across all scenarios with higher fidelity and more seamless spatiotemporal continuity. Project website: https://mcg-nju.github.io/ArbInterp-Web/.

Authors:Kaiqi Zhang, Mingguan Yang, Dali Chang, Chun Chen, Yuxiang Zhang, Kexun He, Jing Zhao
Title: Relative-Absolute Fusion: Rethinking Feature Extraction in Image-Based Iterative Method Selection for Solving Sparse Linear Systems
Abstract:
Iterative method selection is crucial for solving sparse linear systems because these methods inherently lack robustness. Though image-based selection approaches have shown promise, their feature extraction techniques might encode distinct matrices into identical image representations, leading to the same selection and suboptimal method. In this paper, we introduce RAF (Relative-Absolute Fusion), an efficient feature extraction technique to enhance image-based selection approaches. By simultaneously extracting and fusing image representations as relative features with corresponding numerical values as absolute features, RAF achieves comprehensive matrix representations that prevent feature ambiguity across distinct matrices, thus improving selection accuracy and unlocking the potential of image-based selection approaches. We conducted comprehensive evaluations of RAF on SuiteSparse and our developed BMCMat (Balanced Multi-Classification Matrix dataset), demonstrating solution time reductions of 0.08s-0.29s for sparse linear systems, which is 5.86%-11.50% faster than conventional image-based selection approaches and achieves state-of-the-art (SOTA) performance. BMCMat is available at https://github.com/zkqq/BMCMat.
中文: 本文提出RAF特征提取技术,通过融合相对图像特征与绝对数值特征来避免矩阵表示的模糊性,从而提升稀疏线性系统迭代方法选择的准确性,实现了显著的速度提升和最优性能。
English: The paper introduces RAF, a feature extraction technique that fuses relative image features with absolute numerical values to prevent ambiguity in matrix representations, thereby improving the accuracy of iterative method selection for sparse linear systems and achieving state-of-the-art performance with significant speed improvements.

Authors:Yuexin Wang, Xiaolei Wang, Yizheng Gong, Jimin Xiao
Title: Normal-Abnormal Guided Generalist Anomaly Detection
Abstract:
Generalist Anomaly Detection (GAD) aims to train a unified model on an original domain that can detect anomalies in new target domains. Previous GAD methods primarily use only normal samples as references, overlooking the valuable information contained in anomalous samples that are often available in real-world scenarios. To address this limitation, we propose a more practical approach: normal-abnormal-guided generalist anomaly detection, which leverages both normal and anomalous samples as references to guide anomaly detection across diverse domains. We introduce the Normal-Abnormal Generalist Learning (NAGL) framework, consisting of two key components: Residual Mining (RM) and Anomaly Feature Learning (AFL). RM extracts abnormal patterns from normal-abnormal reference residuals to establish transferable anomaly representations, while AFL adaptively learns anomaly features in query images through residual mapping to identify instance-aware anomalies. Our approach effectively utilizes both normal and anomalous references for more accurate and efficient cross-domain anomaly detection. Extensive experiments across multiple benchmarks demonstrate that our method significantly outperforms existing GAD approaches. This work represents the first to adopt a mixture of normal and abnormal samples as references in generalist anomaly detection. The code and datasets are available at https://github.com/JasonKyng/NAGL.
中文摘要:本文提出的正常-异常通用学习(NAGL)框架通过同时利用正常和异常样本作为参考,在跨领域异常检测中显著超越了仅使用正常样本的现有方法。
English Summary: This paper introduces the Normal-Abnormal Generalist Learning (NAGL) framework, which leverages both normal and abnormal samples to significantly improve cross-domain anomaly detection performance over previous methods that used only normal references.

Authors:Yuheng Ji, Huajie Tan, Cheng Chi, Yijie Xu, Yuting Zhao, Enshen Zhou, Huaihai Lyu, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang, Xiaolong Zheng
Title: MathSticks: A Benchmark for Visual Symbolic Compositional Reasoning with Matchstick Puzzles
Abstract:
We introduce \textsc{MathSticks}, a benchmark for Visual Symbolic Compositional Reasoning (VSCR), which unifies visual perception, symbolic manipulation, and arithmetic consistency. Each task presents an incorrect matchstick equation that must be corrected by moving one or two sticks under strict conservation rules. The benchmark includes both text-guided and purely visual settings, systematically covering digit scale, move complexity, solution multiplicity, and operator variation, with 1.4M generated instances and a curated test set. Evaluations of 14 vision--language models reveal substantial limitations: closed-source models succeed only on simple cases, open-source models fail in the visual regime, while humans exceed 90\% accuracy. These findings establish \textsc{MathSticks} as a rigorous testbed for advancing compositional reasoning across vision and symbols. Our code and dataset are publicly available at https://github.com/Yuheng2000/MathSticks.
中文: MathSticks基准通过要求按守恒规则修正火柴棍等式来评估视觉符号组合推理能力,结果显示当前视觉语言模型存在明显不足,而人类准确率超过90%。
English: The MathSticks benchmark evaluates visual symbolic compositional reasoning by requiring correction of matchstick equations under conservation rules, revealing significant limitations in current vision-language models compared to human performance.

Authors:Atif Belal, Heitor R. Medeiros, Marco Pedersoli, Eric Granger
Title: VLOD-TTA: Test-Time Adaptation of Vision-Language Object Detectors
Abstract:
Vision-language object detectors (VLODs) such as YOLO-World and Grounding DINO achieve impressive zero-shot recognition by aligning region proposals with text representations. However, their performance often degrades under domain shift. We introduce VLOD-TTA, a test-time adaptation (TTA) framework for VLODs that leverages dense proposal overlap and image-conditioned prompt scores. First, an IoU-weighted entropy objective is proposed that concentrates adaptation on spatially coherent proposal clusters and reduces confirmation bias from isolated boxes. Second, image-conditioned prompt selection is introduced, which ranks prompts by image-level compatibility and fuses the most informative prompts with the detector logits. Our benchmarking across diverse distribution shifts -- including stylized domains, driving scenes, low-light conditions, and common corruptions -- shows the effectiveness of our method on two state-of-the-art VLODs, YOLO-World and Grounding DINO, with consistent improvements over the zero-shot and TTA baselines. Code : https://github.com/imatif17/VLOD-TTA
中文摘要:VLOD-TTA是一种测试时自适应框架,通过交并比加权熵优化和图像条件提示融合,有效提升视觉语言目标检测器在领域偏移下的性能,在多种挑战性场景中均展现出稳定改进。
English Summary: VLOD-TTA is a test-time adaptation framework that enhances vision-language object detectors' robustness to domain shifts through IoU-weighted entropy optimization and image-conditioned prompt fusion, demonstrating consistent improvements across diverse challenging conditions.

Authors:Junhyeok Lee, Han Jang, Kyu Sung Choi
Title: Domain-Specialized Interactive Segmentation Framework for Meningioma Radiotherapy Planning
Abstract:
Precise delineation of meningiomas is crucial for effective radiotherapy (RT) planning, directly influencing treatment efficacy and preservation of adjacent healthy tissues. While automated deep learning approaches have demonstrated considerable potential, achieving consistently accurate clinical segmentation remains challenging due to tumor heterogeneity. Interactive Medical Image Segmentation (IMIS) addresses this challenge by integrating advanced AI techniques with clinical input. However, generic segmentation tools, despite widespread applicability, often lack the specificity required for clinically critical and disease-specific tasks like meningioma RT planning. To overcome these limitations, we introduce Interactive-MEN-RT, a dedicated IMIS tool specifically developed for clinician-assisted 3D meningioma segmentation in RT workflows. The system incorporates multiple clinically relevant interaction methods, including point annotations, bounding boxes, lasso tools, and scribbles, enhancing usability and clinical precision. In our evaluation involving 500 contrast-enhanced T1-weighted MRI scans from the BraTS 2025 Meningioma RT Segmentation Challenge, Interactive-MEN-RT demonstrated substantial improvement compared to other segmentation methods, achieving Dice similarity coefficients of up to 77.6\% and Intersection over Union scores of 64.8\%. These results emphasize the need for clinically tailored segmentation solutions in critical applications such as meningioma RT planning. The code is publicly available at: https://github.com/snuh-rad-aicon/Interactive-MEN-RT
中文: Interactive-MEN-RT是一款专为脑膜瘤放疗规划设计的交互式分割工具,通过结合临床医生操作与人工智能技术,在评估中显著提升了分割精度并取得了最优性能指标。
English: Interactive-MEN-RT is a specialized interactive tool that enhances meningioma segmentation accuracy in radiotherapy planning by integrating clinician input with AI, achieving superior performance metrics in clinical evaluations.

Authors:Hengtao Li, Pengxiang Ding, Runze Suo, Yihao Wang, Zirui Ge, Dongyuan Zang, Kexian Yu, Mingyang Sun, Hongyin Zhang, Donglin Wang, Weihua Su
Title: VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators
Abstract:
Vision-Language-Action (VLA) models enable embodied decision-making but rely heavily on imitation learning, leading to compounding errors and poor robustness under distribution shift. Reinforcement learning (RL) can mitigate these issues yet typically demands costly real-world interactions or suffers from sim-to-real gaps. We introduce VLA-RFT, a reinforcement fine-tuning framework that leverages a data-driven world model as a controllable simulator. Trained from real interaction data, the simulator predicts future visual observations conditioned on actions, allowing policy rollouts with dense, trajectory-level rewards derived from goal-achieving references. This design delivers an efficient and action-aligned learning signal, drastically lowering sample requirements. With fewer than 400 fine-tuning steps, VLA-RFT surpasses strong supervised baselines and achieves greater efficiency than simulator-based RL. Moreover, it exhibits strong robustness under perturbed conditions, sustaining stable task execution. Our results establish world-model-based RFT as a practical post-training paradigm to enhance the generalization and robustness of VLA models. For more details, please refer to https://vla-rft.github.io/.
Chinese: VLA-RFT提出了一种利用数据驱动世界模型作为可控模拟器的强化微调框架,通过极少的微调步骤显著提升了视觉-语言-动作模型的决策效率和鲁棒性。
English: VLA-RFT introduces a reinforcement fine-tuning framework using a data-driven world model as a controllable simulator, which significantly improves decision-making efficiency and robustness in Vision-Language-Action models with minimal fine-tuning steps.

Authors:Jiancong Xie, Wenjin Wang, Zhuomeng Zhang, Zihan Liu, Qi Liu, Ke Feng, Zixun Sun, Yuedong Yang
Title: OIG-Bench: A Multi-Agent Annotated Benchmark for Multimodal One-Image Guides Understanding
Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities. However, evaluating their capacity for human-like understanding in One-Image Guides remains insufficiently explored. One-Image Guides are a visual format combining text, imagery, and symbols to present reorganized and structured information for easier comprehension, which are specifically designed for human viewing and inherently embody the characteristics of human perception and understanding. Here, we present OIG-Bench, a comprehensive benchmark focused on One-Image Guide understanding across diverse domains. To reduce the cost of manual annotation, we developed a semi-automated annotation pipeline in which multiple intelligent agents collaborate to generate preliminary image descriptions, assisting humans in constructing image-text pairs. With OIG-Bench, we have conducted a comprehensive evaluation of 29 state-of-the-art MLLMs, including both proprietary and open-source models. The results show that Qwen2.5-VL-72B performs the best among the evaluated models, with an overall accuracy of 77%. Nevertheless, all models exhibit notable weaknesses in semantic understanding and logical reasoning, indicating that current MLLMs still struggle to accurately interpret complex visual-text relationships. In addition, we also demonstrate that the proposed multi-agent annotation system outperforms all MLLMs in image captioning, highlighting its potential as both a high-quality image description generator and a valuable tool for future dataset construction. Datasets are available at https://github.com/XiejcSYSU/OIG-Bench.
中文: 本文提出了OIG-Bench基准来评估多模态模型对单图导览的理解能力,结果显示尽管Qwen2.5-VL-72B以77%准确率表现最佳,但所有模型在处理复杂图文关系时仍存在明显的语义理解和逻辑推理缺陷。
English: This paper introduces OIG-Bench, a benchmark for evaluating multimodal models' understanding of One-Image Guides, revealing that while Qwen2.5-VL-72B performs best with 77% accuracy, all models struggle with semantic and logical reasoning in complex visual-text relationships.

Authors:Xianjie Liu, Yiman Hu, Yixiong Zou, Liang Wu, Jian Xu, Bo Zheng
Title: HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling
Abstract:
Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding tasks. However, their performance on high-resolution images remains suboptimal. While existing approaches often attribute this limitation to perceptual constraints and argue that MLLMs struggle to recognize small objects, leading them to use "zoom in" strategies for better detail, our analysis reveals a different cause: the main issue is not object size, but rather caused by complex background interference. We systematically analyze this "zoom in" operation through a series of decoupling experiments and propose the Hierarchical Decoupling Framework (HiDe), a training-free framework that uses Token-wise Attention Decoupling (TAD) to decouple the question tokens and identify the key information tokens, then leverages their attention weights to achieve precise alignment with the target visual regions. Subsequently, it employs Layout-Preserving Decoupling (LPD) to decouple these regions from the background and reconstructs a compact representation that preserves essential spatial layouts while eliminating background interference. HiDe sets a new SOTA on V*Bench, HRBench4K, and HRBench8K, boosting Qwen2.5-VL 7B and InternVL3 8B to SOTA (92.1% and 91.6% on V*Bench), even surpassing RL methods. After optimization, HiDe uses 75% less memory than the previous training-free approach. Code is provided in https://github.com/Tennine2077/HiDe.
中文: 研究发现多模态大语言模型在高分辨率图像上的主要局限是复杂背景干扰而非物体尺寸,并提出了无需训练的HiDe框架,通过令牌和布局解耦实现最优性能,同时显著降低内存消耗。
English: The study identifies complex background interference, not object size, as the primary limitation in MLLMs' performance on high-resolution images and introduces HiDe, a training-free framework that uses token and layout decoupling to achieve state-of-the-art results with reduced memory usage.

Authors:Youquan Fu, Ruiyang Si, Hongfa Wang, Dongzhan Zhou, Jiacheng Sun, Ping Luo, Di Hu, Hongyuan Zhang, Xuelong Li
Title: Object-AVEdit: An Object-level Audio-Visual Editing Model
Abstract:
There is a high demand for audio-visual editing in video post-production and the film making field. While numerous models have explored audio and video editing, they struggle with object-level audio-visual operations. Specifically, object-level audio-visual editing requires the ability to perform object addition, replacement, and removal across both audio and visual modalities, while preserving the structural information of the source instances during the editing process. In this paper, we present \textbf{Object-AVEdit}, achieving the object-level audio-visual editing based on the inversion-regeneration paradigm. To achieve the object-level controllability during editing, we develop a word-to-sounding-object well-aligned audio generation model, bridging the gap in object-controllability between audio and current video generation models. Meanwhile, to achieve the better structural information preservation and object-level editing effect, we propose an inversion-regeneration holistically-optimized editing algorithm, ensuring both information retention during the inversion and better regeneration effect. Extensive experiments demonstrate that our editing model achieved advanced results in both audio-video object-level editing tasks with fine audio-visual semantic alignment. In addition, our developed audio generation model also achieved advanced performance. More results on our project page: https://gewu-lab.github.io/Object_AVEdit-website/.

Authors:Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, Anpei Chen
Title: TTT3R: 3D Reconstruction as Test-Time Training
Abstract:
Modern Recurrent Neural Networks have become a competitive architecture for 3D reconstruction due to their linear-time complexity. However, their performance degrades significantly when applied beyond the training context length, revealing limited length generalization. In this work, we revisit the 3D reconstruction foundation models from a Test-Time Training perspective, framing their designs as an online learning problem. Building on this perspective, we leverage the alignment confidence between the memory state and incoming observations to derive a closed-form learning rate for memory updates, to balance between retaining historical information and adapting to new observations. This training-free intervention, termed TTT3R, substantially improves length generalization, achieving a $2\times$ improvement in global pose estimation over baselines, while operating at 20 FPS with just 6 GB of GPU memory to process thousands of images. Code available in https://rover-xingyu.github.io/TTT3R
中文: TTT3R提出了一种无需训练的方法,通过对齐置信度动态调整记忆更新,显著提升了3D重建中的长度泛化能力,在保持高效的同时将姿态估计精度提高了一倍。
English: TTT3R introduces a training-free method that uses alignment confidence to dynamically adjust memory updates, significantly enhancing length generalization in 3D reconstruction and doubling pose estimation accuracy while maintaining efficiency.

Authors:Jessica Bader, Mateusz Pach, Maria A. Bravo, Serge Belongie, Zeynep Akata
Title: Stitch: Training-Free Position Control in Multimodal Diffusion Transformers
Abstract:
Text-to-Image (T2I) generation models have advanced rapidly in recent years, but accurately capturing spatial relationships like "above" or "to the right of" poses a persistent challenge. Earlier methods improved spatial relationship following with external position control. However, as architectures evolved to enhance image quality, these techniques became incompatible with modern models. We propose Stitch, a training-free method for incorporating external position control into Multi-Modal Diffusion Transformers (MMDiT) via automatically-generated bounding boxes. Stitch produces images that are both spatially accurate and visually appealing by generating individual objects within designated bounding boxes and seamlessly stitching them together. We find that targeted attention heads capture the information necessary to isolate and cut out individual objects mid-generation, without needing to fully complete the image. We evaluate Stitch on PosEval, our benchmark for position-based T2I generation. Featuring five new tasks that extend the concept of Position beyond the basic GenEval task, PosEval demonstrates that even top models still have significant room for improvement in position-based generation. Tested on Qwen-Image, FLUX, and SD3.5, Stitch consistently enhances base models, even improving FLUX by 218% on GenEval's Position task and by 206% on PosEval. Stitch achieves state-of-the-art results with Qwen-Image on PosEval, improving over previous models by 54%, all accomplished while integrating position control into leading models training-free. Code is available at https://github.com/ExplainableML/Stitch.
中文:Stitch是一种无需训练的方法,通过自动生成的边界框在现代文本到图像模型中创建并无缝整合对象,从而提升空间关系的准确性,并在基于位置的生成任务中实现了最先进的性能。
English: Stitch is a training-free method that enhances spatial accuracy in modern text-to-image models by using automatically generated bounding boxes to create and seamlessly integrate objects, achieving state-of-the-art performance on position-based tasks.

Authors:Junlin Han, Shengbang Tong, David Fan, Yufan Ren, Koustuv Sinha, Philip Torr, Filippos Kokkinos
Title: Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training
Abstract:
Large Language Models (LLMs), despite being trained on text alone, surprisingly develop rich visual priors. These priors allow latent visual capabilities to be unlocked for vision tasks with a relatively small amount of multimodal data, and in some cases, to perform visual tasks without ever having seen an image. Through systematic analysis, we reveal that visual priors-the implicit, emergent knowledge about the visual world acquired during language pre-training-are composed of separable perception and reasoning priors with unique scaling trends and origins. We show that an LLM's latent visual reasoning ability is predominantly developed by pre-training on reasoning-centric data (e.g., code, math, academia) and scales progressively. This reasoning prior acquired from language pre-training is transferable and universally applicable to visual reasoning. In contrast, a perception prior emerges more diffusely from broad corpora, and perception ability is more sensitive to the vision encoder and visual instruction tuning data. In parallel, text describing the visual world proves crucial, though its performance impact saturates rapidly. Leveraging these insights, we propose a data-centric recipe for pre-training vision-aware LLMs and verify it in 1T token scale pre-training. Our findings are grounded in over 100 controlled experiments consuming 500,000 GPU-hours, spanning the full MLLM construction pipeline-from LLM pre-training to visual alignment and supervised multimodal fine-tuning-across five model scales, a wide range of data categories and mixtures, and multiple adaptation setups. Along with our main findings, we propose and investigate several hypotheses, and introduce the Multi-Level Existence Bench (MLE-Bench). Together, this work provides a new way of deliberately cultivating visual priors from language pre-training, paving the way for the next generation of multimodal LLMs.

Authors:Xiyi Chen, Shaofei Wang, Marko Mihajlovic, Taewon Kang, Sergey Prokudin, Ming Lin
Title: HART: Human Aligned Reconstruction Transformer
Abstract:
We introduce HART, a unified framework for sparse-view human reconstruction. Given a small set of uncalibrated RGB images of a person as input, it outputs a watertight clothed mesh, the aligned SMPL-X body mesh, and a Gaussian-splat representation for photorealistic novel-view rendering. Prior methods for clothed human reconstruction either optimize parametric templates, which overlook loose garments and human-object interactions, or train implicit functions under simplified camera assumptions, limiting applicability in real scenes. In contrast, HART predicts per-pixel 3D point maps, normals, and body correspondences, and employs an occlusion-aware Poisson reconstruction to recover complete geometry, even in self-occluded regions. These predictions also align with a parametric SMPL-X body model, ensuring that reconstructed geometry remains consistent with human structure while capturing loose clothing and interactions. These human-aligned meshes initialize Gaussian splats to further enable sparse-view rendering. While trained on only 2.3K synthetic scans, HART achieves state-of-the-art results: Chamfer Distance improves by 18-23 percent for clothed-mesh reconstruction, PA-V2V drops by 6-27 percent for SMPL-X estimation, LPIPS decreases by 15-27 percent for novel-view synthesis on a wide range of datasets. These results suggest that feed-forward transformers can serve as a scalable model for robust human reconstruction in real-world settings. Code and models will be released.

Authors:Haodong Li, Wangguangdong Zheng, Jing He, Yuhao Liu, Xin Lin, Xin Yang, Ying-Cong Chen, Chunchao Guo
Title: DA$^2$: Depth Anything in Any Direction
Abstract:
Panorama has a full FoV (360$^\circ\times$180$^\circ$), offering a more complete visual description than perspective images. Thanks to this characteristic, panoramic depth estimation is gaining increasing traction in 3D vision. However, due to the scarcity of panoramic data, previous methods are often restricted to in-domain settings, leading to poor zero-shot generalization. Furthermore, due to the spherical distortions inherent in panoramas, many approaches rely on perspective splitting (e.g., cubemaps), which leads to suboptimal efficiency. To address these challenges, we propose $\textbf{DA}$$^{\textbf{2}}$: $\textbf{D}$epth $\textbf{A}$nything in $\textbf{A}$ny $\textbf{D}$irection, an accurate, zero-shot generalizable, and fully end-to-end panoramic depth estimator. Specifically, for scaling up panoramic data, we introduce a data curation engine for generating high-quality panoramic depth data from perspective, and create $\sim$543K panoramic RGB-depth pairs, bringing the total to $\sim$607K. To further mitigate the spherical distortions, we present SphereViT, which explicitly leverages spherical coordinates to enforce the spherical geometric consistency in panoramic image features, yielding improved performance. A comprehensive benchmark on multiple datasets clearly demonstrates DA$^{2}$'s SoTA performance, with an average 38% improvement on AbsRel over the strongest zero-shot baseline. Surprisingly, DA$^{2}$ even outperforms prior in-domain methods, highlighting its superior zero-shot generalization. Moreover, as an end-to-end solution, DA$^{2}$ exhibits much higher efficiency over fusion-based approaches. Both the code and the curated panoramic data will be released. Project page: https://depth-any-in-any-dir.github.io/.

Authors:Jian Guo Pan, Lin Wang, Xia Cai
Title: Automated and Scalable SEM Image Analysis of Perovskite Solar Cell Materials via a Deep Segmentation Framework
Abstract:
Scanning Electron Microscopy (SEM) is indispensable for characterizing the microstructure of thin films during perovskite solar cell fabrication. Accurate identification and quantification of lead iodide and perovskite phases are critical because residual lead iodide strongly influences crystallization pathways and defect formation, while the morphology of perovskite grains governs carrier transport and device stability. Yet current SEM image analysis is still largely manual, limiting throughput and consistency. Here, we present an automated deep learning-based framework for SEM image segmentation that enables precise and efficient identification of lead iodide, perovskite and defect domains across diverse morphologies. Built upon an improved YOLOv8x architecture, our model named PerovSegNet incorporates two novel modules: (i) Adaptive Shuffle Dilated Convolution Block, which enhances multi-scale and fine-grained feature extraction through group convolutions and channel mixing; and (ii) Separable Adaptive Downsampling module, which jointly preserves fine-scale textures and large-scale structures for more robust boundary recognition. Trained on an augmented dataset of 10,994 SEM images, PerovSegNet achieves a mean Average Precision of 87.25% with 265.4 Giga Floating Point Operations, outperforming the baseline YOLOv8x-seg by 4.08%, while reducing model size and computational load by 24.43% and 25.22%, respectively. Beyond segmentation, the framework provides quantitative grain-level metrics, such as lead iodide/perovskite area and count, which can serve as reliable indicators of crystallization efficiency and microstructural quality. These capabilities establish PerovSegNet as a scalable tool for real-time process monitoring and data-driven optimization of perovskite thin-film fabrication.The source code is available at:https://github.com/wlyyj/PerovSegNet/tree/master.
中文摘要:本研究提出了PerovSegNet深度学习框架,通过改进YOLOv8x架构实现了对钙钛矿太阳能电池SEM图像中碘化铅、钙钛矿和缺陷区域的自动精确分割,为薄膜制备过程的实时监控和优化提供了高效工具。
English Summary: This study introduces PerovSegNet, an automated deep learning framework based on enhanced YOLOv8x architecture that achieves precise segmentation of lead iodide, perovskite, and defect domains in SEM images, significantly improving analysis efficiency and accuracy for perovskite solar cell fabrication.

Authors:Yida Xue, Mingjun Mao, Xiangyuan Ru, Yuqi Zhu, Baochang Ren, Shuofei Qiao, Mengru Wang, Shumin Deng, Xinyu An, Ningyu Zhang, Ying Chen, Huajun Chen
Title: OceanGym: A Benchmark Environment for Underwater Embodied Agents
Abstract:
We introduce OceanGym, the first comprehensive benchmark for ocean underwater embodied agents, designed to advance AI in one of the most demanding real-world environments. Unlike terrestrial or aerial domains, underwater settings present extreme perceptual and decision-making challenges, including low visibility, dynamic ocean currents, making effective agent deployment exceptionally difficult. OceanGym encompasses eight realistic task domains and a unified agent framework driven by Multi-modal Large Language Models (MLLMs), which integrates perception, memory, and sequential decision-making. Agents are required to comprehend optical and sonar data, autonomously explore complex environments, and accomplish long-horizon objectives under these harsh conditions. Extensive experiments reveal substantial gaps between state-of-the-art MLLM-driven agents and human experts, highlighting the persistent difficulty of perception, planning, and adaptability in ocean underwater environments. By providing a high-fidelity, rigorously designed platform, OceanGym establishes a testbed for developing robust embodied AI and transferring these capabilities to real-world autonomous ocean underwater vehicles, marking a decisive step toward intelligent agents capable of operating in one of Earth's last unexplored frontiers. The code and data are available at https://github.com/OceanGPT/OceanGym.
中文: OceanGym是首个面向水下具身智能体的综合基准,通过多模态大语言模型框架整合感知与决策,应对低能见度和洋流等极端挑战,旨在推动AI在真实海洋环境中达到人类专家水平,为探索地球最后边疆奠定基础。
English: OceanGym is the first comprehensive benchmark for underwater embodied AI agents, featuring realistic tasks and a unified MLLM-driven framework to tackle extreme challenges like low visibility and dynamic currents, aiming to bridge the gap between current AI and human expertise for real-world ocean exploration.

Authors:Alessio Masano, Matteo Pennisi, Federica Proietto Salanitri, Concetto Spampinato, Giovanni Bellitto
Title: Zero-Shot Decentralized Federated Learning
Abstract:
CLIP has revolutionized zero-shot learning by enabling task generalization without fine-tuning. While prompting techniques like CoOp and CoCoOp enhance CLIP's adaptability, their effectiveness in Federated Learning (FL) remains an open challenge. Existing federated prompt learning approaches, such as FedCoOp and FedTPG, improve performance but face generalization issues, high communication costs, and reliance on a central server, limiting scalability and privacy. We propose Zero-shot Decentralized Federated Learning (ZeroDFL), a fully decentralized framework that enables zero-shot adaptation across distributed clients without a central coordinator. ZeroDFL employs an iterative prompt-sharing mechanism, allowing clients to optimize and exchange textual prompts to enhance generalization while drastically reducing communication overhead. We validate ZeroDFL on nine diverse image classification datasets, demonstrating that it consistently outperforms--or remains on par with--state-of-the-art federated prompt learning methods. More importantly, ZeroDFL achieves this performance in a fully decentralized setting while reducing communication overhead by 118x compared to FedTPG. These results highlight that our approach not only enhances generalization in federated zero-shot learning but also improves scalability, efficiency, and privacy preservation--paving the way for decentralized adaptation of large vision-language models in real-world applications.
中文: ZeroDFL提出了一种完全去中心化的联邦学习框架,通过迭代式提示共享实现零样本自适应,在显著降低通信成本118倍的同时超越现有方法,并提升了可扩展性与隐私保护能力。
English: ZeroDFL introduces a fully decentralized federated learning framework that enables zero-shot adaptation through iterative prompt sharing, significantly outperforming existing methods while reducing communication costs by 118x and enhancing scalability and privacy.

Authors:Artur Barros, Carlos Caetano, João Macedo, Jefersson A. dos Santos, Sandra Avila
Title: Attention over Scene Graphs: Indoor Scene Representations Toward CSAI Classification
Abstract:
Indoor scene classification is a critical task in computer vision, with wide-ranging applications that go from robotics to sensitive content analysis, such as child sexual abuse imagery (CSAI) classification. The problem is particularly challenging due to the intricate relationships between objects and complex spatial layouts. In this work, we propose the Attention over Scene Graphs for Sensitive Content Analysis (ASGRA), a novel framework that operates on structured graph representations instead of raw pixels. By first converting images into Scene Graphs and then employing a Graph Attention Network for inference, ASGRA directly models the interactions between a scene's components. This approach offers two key benefits: (i) inherent explainability via object and relationship identification, and (ii) privacy preservation, enabling model training without direct access to sensitive images. On Places8, we achieve 81.27% balanced accuracy, surpassing image-based methods. Real-world CSAI evaluation with law enforcement yields 74.27% balanced accuracy. Our results establish structured scene representations as a robust paradigm for indoor scene classification and CSAI classification. Code is publicly available at https://github.com/tutuzeraa/ASGRA.
中文摘要:ASGRA框架通过场景图和图注意力网络进行室内场景分类与敏感内容分析,在提高准确率的同时兼具可解释性和隐私保护能力。
English Summary: The ASGRA framework uses scene graphs and graph attention networks to improve indoor scene classification and sensitive content analysis, achieving higher accuracy with inherent explainability and privacy protection.

Authors:Zhiwei Yang, Chen Gao, Mike Zheng Shou
Title: PANDA: Towards Generalist Video Anomaly Detection via Agentic AI Engineer
Abstract:
Video anomaly detection (VAD) is a critical yet challenging task due to the complex and diverse nature of real-world scenarios. Previous methods typically rely on domain-specific training data and manual adjustments when applying to new scenarios and unseen anomaly types, suffering from high labor costs and limited generalization. Therefore, we aim to achieve generalist VAD, i.e., automatically handle any scene and any anomaly types without training data or human involvement. In this work, we propose PANDA, an agentic AI engineer based on MLLMs. Specifically, we achieve PANDA by comprehensively devising four key capabilities: (1) self-adaptive scene-aware strategy planning, (2) goal-driven heuristic reasoning, (3) tool-augmented self-reflection, and (4) self-improving chain-of-memory. Concretely, we develop a self-adaptive scene-aware RAG mechanism, enabling PANDA to retrieve anomaly-specific knowledge for anomaly detection strategy planning. Next, we introduce a latent anomaly-guided heuristic prompt strategy to enhance reasoning precision. Furthermore, PANDA employs a progressive reflection mechanism alongside a suite of context-aware tools to iteratively refine decision-making in complex scenarios. Finally, a chain-of-memory mechanism enables PANDA to leverage historical experiences for continual performance improvement. Extensive experiments demonstrate that PANDA achieves state-of-the-art performance in multi-scenario, open-set, and complex scenario settings without training and manual involvement, validating its generalizable and robust anomaly detection capability. Code is released at https://github.com/showlab/PANDA.
中文: 本文提出PANDA这一通用视频异常检测系统,它通过自适应策略规划、启发式推理、工具增强反思和记忆链自我改进,无需训练数据或人工干预即可自主处理各种场景和异常类型。
English: This paper introduces PANDA, a generalist video anomaly detection system that autonomously handles diverse scenarios and anomaly types without training data or human intervention through adaptive strategy planning, heuristic reasoning, tool-augmented reflection, and memory-driven self-improvement.

Authors:Junjie Zhou, Ze Liu, Lei Xiong, Jin-Ge Yao, Yueze Wang, Shitao Xiao, Fenfen Lin, Miguel Hu Chen, Zhicheng Dou, Siqi Bao, Defu Lian, Yongping Xiong, Zheng Liu
Title: MR$^2$-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval
Abstract:
Multimodal retrieval is becoming a crucial component of modern AI applications, yet its evaluation lags behind the demands of more realistic and challenging scenarios. Existing benchmarks primarily probe surface-level semantic correspondence (e.g., object-text matching) while failing to assess the deeper reasoning required to capture complex relationships between visual and textual information. To address this gap, we introduce MR$^2$-Bench, a reasoning-intensive benchmark for multimodal retrieval. MR$^2$-Bench presents the following critical values: 1) all tasks are reasoning-driven, going beyond shallow matching to effectively assess models' capacity for logical, spatial, and causal inference; 2) it features diverse multimodal data, such as natural images, diagrams, and visual puzzles, enabling comprehensive evaluation across content types; 3) it supports complex queries and documents containing multiple images and covers diverse retrieval scenarios, more accurately reflecting real-world applications. Our benchmark contains 1,309 curated queries, derived either from manual collection and annotation or from selective consolidation of public datasets. Despite achieving strong results on existing benchmarks, current state-of-the-art models still struggle on MR$^2$-Bench: for example, the leading Seed1.6-Embedding model attains a Recall@1 of 77.78 on MMEB, but only 9.91 on MR$^2$-Bench. This substantial performance gap highlights both the increased challenge posed by our benchmark and the pressing need for further advances in reasoning-intensive multimodal retrieval. The dataset and evaluation code will be made publicly available at https://github.com/VectorSpaceLab/MR2-Bench.
中文: MR$^2$-Bench 是一个推理密集型多模态检索基准,旨在评估超越浅层匹配的逻辑、空间和因果推理能力,揭示了现有先进模型的显著性能差距,并强调了该领域进一步发展的必要性。
English: MR$^2$-Bench is introduced as a reasoning-intensive multimodal retrieval benchmark that assesses deeper logical, spatial, and causal inference beyond surface-level matching, revealing significant performance gaps in state-of-the-art models and highlighting the need for advancements in this area.

Authors:Harold Haodong Chen, Xianfeng Wu, Wen-Jie Shu, Rongjin Guo, Disen Lan, Harry Yang, Ying-Cong Chen
Title: Go with Your Gut: Scaling Confidence for Autoregressive Image Generation
Abstract:
Test-time scaling (TTS) has demonstrated remarkable success in enhancing large language models, yet its application to next-token prediction (NTP) autoregressive (AR) image generation remains largely uncharted. Existing TTS approaches for visual AR (VAR), which rely on frequent partial decoding and external reward models, are ill-suited for NTP-based image generation due to the inherent incompleteness of intermediate decoding results. To bridge this gap, we introduce ScalingAR, the first TTS framework specifically designed for NTP-based AR image generation that eliminates the need for early decoding or auxiliary rewards. ScalingAR leverages token entropy as a novel signal in visual token generation and operates at two complementary scaling levels: (i) Profile Level, which streams a calibrated confidence state by fusing intrinsic and conditional signals; and (ii) Policy Level, which utilizes this state to adaptively terminate low-confidence trajectories and dynamically schedule guidance for phase-appropriate conditioning strength. Experiments on both general and compositional benchmarks show that ScalingAR (1) improves base models by 12.5% on GenEval and 15.2% on TIIF-Bench, (2) efficiently reduces visual token consumption by 62.0% while outperforming baselines, and (3) successfully enhances robustness, mitigating performance drops by 26.0% in challenging scenarios.
中文:ScalingAR首次为基于下一令牌预测的自回归图像生成设计了测试时缩放框架,利用令牌熵实现自适应轨迹终止和动态引导调度,显著提升了生成性能与效率。
English: ScalingAR introduces the first test-time scaling framework for next-token prediction autoregressive image generation, using token entropy to enable adaptive trajectory termination and dynamic guidance scheduling, achieving significant performance gains and efficiency improvements.

Authors:Keming Wu, Sicong Jiang, Max Ku, Ping Nie, Minghao Liu, Wenhu Chen
Title: EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing
Abstract:
Recently, we have witnessed great progress in image editing with natural language instructions. Several closed-source models like GPT-Image-1, Seedream, and Google-Nano-Banana have shown highly promising progress. However, the open-source models are still lagging. The main bottleneck is the lack of a reliable reward model to scale up high-quality synthetic training data. To address this critical bottleneck, we built \mname, trained with our new large-scale human preference dataset, meticulously annotated by trained experts following a rigorous protocol containing over 200K preference pairs. \mname demonstrates superior alignment with human preferences in instruction-guided image editing tasks. Experiments show that \mname achieves state-of-the-art human correlation on established benchmarks such as GenAI-Bench, AURORA-Bench, ImagenHub, and our new \benchname, outperforming a wide range of VLM-as-judge models. Furthermore, we use \mname to select a high-quality subset from the existing noisy ShareGPT-4o-Image dataset. We train Step1X-Edit on the selected subset, which shows significant improvement over training on the full set. This demonstrates \mname's ability to serve as a reward model to scale up high-quality training data for image editing. Furthermore, its strong alignment suggests potential for advanced applications like reinforcement learning-based post-training and test-time scaling of image editing models. \mname with its training dataset will be released to help the community build more high-quality image editing training datasets.

Authors:Balamurugan Thambiraja, Malte Prinzler, Sadegh Aliakbarian, Darren Cosker, Justus Thies
Title: 3DiFACE: Synthesizing and Editing Holistic 3D Facial Animation
Abstract:
Creating personalized 3D animations with precise control and realistic head motions remains challenging for current speech-driven 3D facial animation methods. Editing these animations is especially complex and time consuming, requires precise control and typically handled by highly skilled animators. Most existing works focus on controlling style or emotion of the synthesized animation and cannot edit/regenerate parts of an input animation. They also overlook the fact that multiple plausible lip and head movements can match the same audio input. To address these challenges, we present 3DiFACE, a novel method for holistic speech-driven 3D facial animation. Our approach produces diverse plausible lip and head motions for a single audio input and allows for editing via keyframing and interpolation. Specifically, we propose a fully-convolutional diffusion model that can leverage the viseme-level diversity in our training corpus. Additionally, we employ a speaking-style personalization and a novel sparsely-guided motion diffusion to enable precise control and editing. Through quantitative and qualitative evaluations, we demonstrate that our method is capable of generating and editing diverse holistic 3D facial animations given a single audio input, with control between high fidelity and diversity. Code and models are available here: https://balamuruganthambiraja.github.io/3DiFACE

Authors:Jiayi Guo, Chuanhao Yan, Xingqian Xu, Yulin Wang, Kai Wang, Gao Huang, Humphrey Shi
Title: IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance
Abstract:
Ensuring precise multimodal alignment between diffusion-generated images and input prompts has been a long-standing challenge. Earlier works finetune diffusion weight using high-quality preference data, which tends to be limited and difficult to scale up. Recent editing-based methods further refine local regions of generated images but may compromise overall image quality. In this work, we propose Implicit Multimodal Guidance (IMG), a novel re-generation-based multimodal alignment framework that requires no extra data or editing operations. Specifically, given a generated image and its prompt, IMG a) utilizes a multimodal large language model (MLLM) to identify misalignments; b) introduces an Implicit Aligner that manipulates diffusion conditioning features to reduce misalignments and enable re-generation; and c) formulates the re-alignment goal into a trainable objective, namely Iteratively Updated Preference Objective. Extensive qualitative and quantitative evaluations on SDXL, SDXL-DPO, and FLUX show that IMG outperforms existing alignment methods. Furthermore, IMG acts as a flexible plug-and-play adapter, seamlessly enhancing prior finetuning-based alignment methods. Our code will be available at https://github.com/SHI-Labs/IMG-Multimodal-Diffusion-Alignment.
中文: 提出的隐式多模态引导框架通过多模态模型识别图像与提示的不匹配,并操控扩散特征进行重新生成,无需额外数据或编辑操作即可优于现有对齐方法。
English: The proposed Implicit Multimodal Guidance (IMG) framework improves alignment between generated images and prompts by identifying misalignments with a multimodal model and manipulating diffusion features for regeneration, outperforming existing methods without requiring additional data or editing operations.

Authors:Haiyang Zheng, Nan Pu, Wenjing Li, Nicu Sebe, Zhun Zhong
Title: Generalized Fine-Grained Category Discovery with Multi-Granularity Conceptual Experts
Abstract:
Generalized Category Discovery (GCD) is an open-world problem that clusters unlabeled data by leveraging knowledge from partially labeled categories. A key challenge is that unlabeled data may contain both known and novel categories. Existing approaches suffer from two main limitations. First, they fail to exploit multi-granularity conceptual information in visual data, which limits representation quality. Second, most assume that the number of unlabeled categories is known during training, which is impractical in real-world scenarios. To address these issues, we propose a Multi-Granularity Conceptual Experts (MGCE) framework that adaptively mines visual concepts and integrates multi-granularity knowledge for accurate category discovery. MGCE consists of two modules: (1) Dynamic Conceptual Contrastive Learning (DCCL), which alternates between concept mining and dual-level representation learning to jointly optimize feature learning and category discovery; and (2) Multi-Granularity Experts Collaborative Learning (MECL), which extends the single-expert paradigm by introducing additional experts at different granularities and by employing a concept alignment matrix for effective cross-expert collaboration. Importantly, MGCE can automatically estimate the number of categories in unlabeled data, making it suitable for practical open-world settings. Extensive experiments on nine fine-grained visual recognition benchmarks demonstrate that MGCE achieves state-of-the-art results, particularly in novel-class accuracy. Notably, even without prior knowledge of category numbers, MGCE outperforms parametric approaches that require knowing the exact number of categories, with an average improvement of 3.6\%. Code is available at https://github.com/HaiyangZheng/MGCE.
中文: 提出的多粒度概念专家(MGCE)框架通过自适应挖掘视觉概念并整合多粒度知识,解决了广义类别发现中的关键限制,实现了自动类别数量估计并在九个基准测试中取得了最优性能。
English: The proposed Multi-Granularity Conceptual Experts (MGCE) framework addresses limitations in Generalized Category Discovery by adaptively mining visual concepts and integrating multi-granularity knowledge, enabling automatic category estimation and achieving state-of-the-art performance across nine benchmarks.

Authors:Chenyang Jiang, Zhengcen Li, Hang Zhao, Qiben Shan, Shaocong Wu, Jingyong Su
Title: Beyond Pixels: Efficient Dataset Distillation via Sparse Gaussian Representation
Abstract:
Dataset distillation has emerged as a promising paradigm that synthesizes compact, informative datasets capable of retaining the knowledge of large-scale counterparts, thereby addressing the substantial computational and storage burdens of modern model training. Conventional approaches typically rely on dense pixel-level representations, which introduce redundancy and are difficult to scale up. In this work, we propose GSDD, a novel and efficient sparse representation for dataset distillation based on 2D Gaussians. Instead of representing all pixels equally, GSDD encodes critical discriminative information in a distilled image using only a small number of Gaussian primitives. This sparse representation could improve dataset diversity under the same storage budget, enhancing coverage of difficult samples and boosting distillation performance. To ensure both efficiency and scalability, we adapt CUDA-based splatting operators for parallel inference and training, enabling high-quality rendering with minimal computational and memory overhead. Our method is simple yet effective, broadly applicable to different distillation pipelines, and highly scalable. Experiments show that GSDD achieves state-of-the-art performance on CIFAR-10, CIFAR-100, and ImageNet subsets, while remaining highly efficient encoding and decoding cost. Our code is available at https://github.com/j-cyoung/GSDatasetDistillation.
中文: GSDD提出了一种基于二维高斯分布的稀疏数据集蒸馏方法,仅用少量高斯基元编码关键图像信息,在多个基准测试中以最小计算成本实现了最优性能。
English: GSDD introduces a novel sparse dataset distillation method using 2D Gaussians to encode critical image information efficiently, achieving state-of-the-art performance with minimal computational overhead across multiple benchmarks.

Authors:Yuansen Liu, Haiming Tang, Jinlong Peng, Jiangning Zhang, Xiaozhong Ji, Qingdong He, Donghao Luo, Zhenye Gan, Junwei Zhu, Yunhang Shen, Chaoyou Fu, Chengjie Wang, Xiaobin Hu, Shuicheng Yan
Title: Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks. However, their capacity to comprehend human-centric scenes has rarely been explored, primarily due to the absence of comprehensive evaluation benchmarks that take into account both the human-oriented granular level and higher-dimensional causal reasoning ability. Such high-quality evaluation benchmarks face tough obstacles, given the physical complexity of the human body and the difficulty of annotating granular structures. In this paper, we propose Human-MME, a curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric scene understanding. Compared with other existing benchmarks, our work provides three key features: 1. Diversity in human scene, spanning 4 primary visual domains with 15 secondary domains and 43 sub-fields to ensure broad scenario coverage. 2. Progressive and diverse evaluation dimensions, evaluating the human-based activities progressively from the human-oriented granular perception to the higher-dimensional reasoning, consisting of eight dimensions with 19,945 real-world image question pairs and an evaluation suite. 3. High-quality annotations with rich data paradigms, constructing the automated annotation pipeline and human-annotation platform, supporting rigorous manual labeling to facilitate precise and reliable model assessment. Our benchmark extends the single-target understanding to the multi-person and multi-image mutual understanding by constructing the choice, short-answer, grounding, ranking and judgment question components, and complex questions of their combination. The extensive experiments on 17 state-of-the-art MLLMs effectively expose the limitations and guide future MLLMs research toward better human-centric image understanding. All data and code are available at https://github.com/Yuan-Hou/Human-MME.
中文: 本文提出了Human-MME基准,旨在全面评估多模态大语言模型在以人为中心的场景理解能力,通过提供多样化场景、渐进式评估维度和高质量标注,弥补现有评估工具的不足,为未来研究指明方向。
English: This paper introduces Human-MME, a comprehensive benchmark designed to holistically evaluate Multimodal Large Language Models in human-centric scene understanding, addressing the lack of existing evaluation tools by providing diverse scenarios, progressive assessment dimensions, and high-quality annotations to guide future research.

Authors:Kyeongryeol Go
Title: Towards Continual Expansion of Data Coverage: Automatic Text-guided Edge-case Synthesis
Abstract:
The performance of deep neural networks is strongly influenced by the quality of their training data. However, mitigating dataset bias by manually curating challenging edge cases remains a major bottleneck. To address this, we propose an automated pipeline for text-guided edge-case synthesis. Our approach employs a Large Language Model, fine-tuned via preference learning, to rephrase image captions into diverse textual prompts that steer a Text-to-Image model toward generating difficult visual scenarios. Evaluated on the FishEye8K object detection benchmark, our method achieves superior robustness, surpassing both naive augmentation and manually engineered prompts. This work establishes a scalable framework that shifts data curation from manual effort to automated, targeted synthesis, offering a promising direction for developing more reliable and continuously improving AI systems. Code is available at https://github.com/gokyeongryeol/ATES.
中文: 本文提出了一种自动化流程,通过微调的大型语言模型生成多样化文本提示,引导文本到图像模型合成具有挑战性的边缘案例,从而提升深度神经网络的鲁棒性,并在FishEye8K基准测试中验证了其优越性。
English: This paper introduces an automated pipeline that leverages a fine-tuned Large Language Model to generate diverse textual prompts, enabling a Text-to-Image model to synthesize challenging edge cases for improving deep neural network robustness, as validated on the FishEye8K benchmark.

Authors:Sachith Abeywickrama, Emadeldeen Eldele, Min Wu, Xiaoli Li, Chau Yuen
Title: EntroPE: Entropy-Guided Dynamic Patch Encoder for Time Series Forecasting
Abstract:
Transformer-based models have significantly advanced time series forecasting, with patch-based input strategies offering efficiency and improved long-horizon modeling. Yet, existing approaches rely on temporally-agnostic patch construction, where arbitrary starting positions and fixed lengths fracture temporal coherence by splitting natural transitions across boundaries. This naive segmentation often disrupts short-term dependencies and weakens representation learning. In response, we propose EntroPE (Entropy-Guided Dynamic Patch Encoder), a novel, temporally informed framework that dynamically detects transition points via conditional entropy and dynamically places patch boundaries. This preserves temporal structure while retaining the computational benefits of patching. EntroPE consists of two key modules, namely an Entropy-based Dynamic Patcher (EDP) that applies information-theoretic criteria to locate natural temporal shifts and determine patch boundaries, and an Adaptive Patch Encoder (APE) that employs pooling and cross-attention to capture intra-patch dependencies and produce fixed-size latent representations. These embeddings are then processed by a global transformer to model inter-patch dynamics. Experiments across long-term forecasting benchmarks demonstrate that EntroPE improves both accuracy and efficiency, establishing entropy-guided dynamic patching as a promising new paradigm for time series modeling. Code is available at: https://github.com/Sachithx/EntroPE.
中文摘要:提出的EntroPE框架通过熵引导的动态分块技术保持时间序列的时序连贯性,在保留计算效率的同时克服了固定分块方法的局限性。
English Summary: The proposed EntroPE framework introduces entropy-guided dynamic patching to preserve temporal coherence in time series forecasting, overcoming limitations of fixed patch segmentation while maintaining computational efficiency.

Authors:Shigui Li, Wei Chen, Delu Zeng
Title: EVODiff: Entropy-aware Variance Optimized Diffusion Inference
Abstract:
Diffusion models (DMs) excel in image generation, but suffer from slow inference and the training-inference discrepancies. Although gradient-based solvers like DPM-Solver accelerate the denoising inference, they lack theoretical foundations in information transmission efficiency. In this work, we introduce an information-theoretic perspective on the inference processes of DMs, revealing that successful denoising fundamentally reduces conditional entropy in reverse transitions. This principle leads to our key insights into the inference processes: (1) data prediction parameterization outperforms its noise counterpart, and (2) optimizing conditional variance offers a reference-free way to minimize both transition and reconstruction errors. Based on these insights, we propose an entropy-aware variance optimized method for the generative process of DMs, called EVODiff, which systematically reduces uncertainty by optimizing conditional entropy during denoising. Extensive experiments on DMs validate our insights and demonstrate that our method significantly and consistently outperforms state-of-the-art (SOTA) gradient-based solvers. For example, compared to the DPM-Solver++, EVODiff reduces the reconstruction error by up to 45.5\% (FID improves from 5.10 to 2.78) at 10 function evaluations (NFE) on CIFAR-10, cuts the NFE cost by 25\% (from 20 to 15 NFE) for high-quality samples on ImageNet-256, and improves text-to-image generation while reducing artifacts. Code is available at https://github.com/ShiguiLi/EVODiff.
Chinese: 本文提出EVODiff方法,通过优化条件熵来降低去噪过程中的不确定性,显著提升了扩散模型的生成效率与质量,在多项图像生成任务中优于现有梯度求解器。
English: This paper introduces EVODiff, an entropy-aware variance optimization method that enhances diffusion models by reducing conditional entropy during denoising, achieving superior performance over existing solvers in image generation tasks.

Authors:Ioana Ciuclea, Giorgio Longari, Alice Barbara Tumpach
Title: Geometric Learning of Canonical Parameterizations of $2D$-curves
Abstract:
Most datasets encountered in computer vision and medical applications present symmetries that should be taken into account in classification tasks. A typical example is the symmetry by rotation and/or scaling in object detection. A common way to build neural networks that learn the symmetries is to use data augmentation. In order to avoid data augmentation and build more sustainable algorithms, we present an alternative method to mod out symmetries based on the notion of section of a principal fiber bundle. This framework allows the use of simple metrics on the space of objects in order to measure dissimilarities between orbits of objects under the symmetry group. Moreover, the section used can be optimized to maximize separation of classes. We illustrate this methodology on a dataset of contours of objects for the groups of translations, rotations, scalings and reparameterizations. In particular, we present a $2$-parameter family of canonical parameterizations of curves, containing the constant-speed parameterization as a special case, which we believe is interesting in its own right. We hope that this simple application will serve to convey the geometric concepts underlying this method, which have a wide range of possible applications. The code is available at the following link: $\href{https://github.com/GiLonga/Geometric-Learning}{https://github.com/GiLonga/Geometric-Learning}$. A tutorial notebook showcasing an application of the code to a specific dataset is available at the following link: $\href{https://github.com/ioanaciuclea/geometric-learning-notebook}{https://github.com/ioanaciuclea/geometric-learning-notebook}$
中文: 本文提出了一种基于主纤维丛截面的几何框架,用于消除数据集中的对称性,为分类任务提供了一种比数据增强更可持续的方法,并在物体轮廓数据集上展示了其应用及一种新的参数化技术。
English: This paper introduces a geometric framework using principal fiber bundle sections to eliminate symmetries in datasets, offering a sustainable alternative to data augmentation for classification tasks and demonstrating its application on object contours with a novel parameterization method.

Authors:Yang Zhou, Kunhao Yuan, Ye Wei, Jishizhan Chen
Title: Multi-modal Liver Segmentation and Fibrosis Staging Using Real-world MRI Images
Abstract:
Liver fibrosis represents the accumulation of excessive extracellular matrix caused by sustained hepatic injury. It disrupts normal lobular architecture and function, increasing the chances of cirrhosis and liver failure. Precise staging of fibrosis for early diagnosis and intervention is often invasive, which carries risks and complications. To address this challenge, recent advances in artificial intelligence-based liver segmentation and fibrosis staging offer a non-invasive alternative. As a result, the CARE 2025 Challenge aimed for automated methods to quantify and analyse liver fibrosis in real-world scenarios, using multi-centre, multi-modal, and multi-phase MRI data. This challenge included tasks of precise liver segmentation (LiSeg) and fibrosis staging (LiFS). In this study, we developed an automated pipeline for both tasks across all the provided MRI modalities. This pipeline integrates pseudo-labelling based on multi-modal co-registration, liver segmentation using deep neural networks, and liver fibrosis staging based on shape, textural, appearance, and directional (STAD) features derived from segmentation masks and MRI images. By solely using the released data with limited annotations, our proposed pipeline demonstrated excellent generalisability for all MRI modalities, achieving top-tier performance across all competition subtasks. This approach provides a rapid and reproducible framework for quantitative MRI-based liver fibrosis assessment, supporting early diagnosis and clinical decision-making. Code is available at https://github.com/YangForever/care2025_liver_biodreamer.
中文摘要:近期人工智能技术通过多模态MRI实现肝脏自动分割和纤维化分期,提供了一种快速、可重复的无创评估框架,有助于早期诊断和临床决策支持。
English Summary: Recent AI advancements enable non-invasive liver fibrosis assessment through automated segmentation and staging using multi-modal MRI, offering a rapid, reproducible framework for early diagnosis and clinical decision-making.

Authors:Gagandeep Singh, Samudi Amarsinghe, Priyanka Singh, Xue Li
Title: DGM4+: Dataset Extension for Global Scene Inconsistency
Abstract:
The rapid advances in generative models have significantly lowered the barrier to producing convincing multimodal disinformation. Fabricated images and manipulated captions increasingly co-occur to create persuasive false narratives. While the Detecting and Grounding Multi-Modal Media Manipulation (DGM4) dataset established a foundation for research in this area, it is restricted to local manipulations such as face swaps, attribute edits, and caption changes. This leaves a critical gap: global inconsistencies, such as mismatched foregrounds and backgrounds, which are now prevalent in real-world forgeries. To address this, we extend DGM4 with 5,000 high-quality samples that introduce Foreground-Background (FG-BG) mismatches and their hybrids with text manipulations. Using OpenAI's gpt-image-1 and carefully designed prompts, we generate human-centric news-style images where authentic figures are placed into absurd or impossible backdrops (e.g., a teacher calmly addressing students on the surface of Mars). Captions are produced under three conditions: literal, text attribute, and text split, yielding three new manipulation categories: FG-BG, FG-BG+TA, and FG-BG+TS. Quality control pipelines enforce one-to-three visible faces, perceptual hash deduplication, OCR-based text scrubbing, and realistic headline length. By introducing global manipulations, our extension complements existing datasets, creating a benchmark DGM4+ that tests detectors on both local and global reasoning. This resource is intended to strengthen evaluation of multimodal models such as HAMMER, which currently struggle with FG-BG inconsistencies. We release our DGM4+ dataset and generation script at https://github.com/Gaganx0/DGM4plus
中文: 本研究扩展了DGM4数据集,新增5000个包含前景-背景错位及混合文本篡改的样本,以解决多模态虚假信息中的全局不一致性问题,建立了DGM4+基准来提升检测模型的评估效果。
English: This study extends the DGM4 dataset with 5,000 samples featuring foreground-background mismatches and hybrid text manipulations to address global inconsistencies in multimodal disinformation, creating the DGM4+ benchmark for improved detection model evaluation.

Authors:Gagandeep Singh, Samudi Amarsinghe, Urawee Thani, Ki Fung Wong, Priyanka Singh, Xue Li
Title: SGS: Segmentation-Guided Scoring for Global Scene Inconsistencies
Abstract:
We extend HAMMER, a state-of-the-art model for multimodal manipulation detection, to handle global scene inconsistencies such as foreground-background (FG-BG) mismatch. While HAMMER achieves strong performance on the DGM4 dataset, it consistently fails when the main subject is contextually misplaced into an implausible background. We diagnose this limitation as a combination of label-space bias, local attention focus, and spurious text-foreground alignment. To remedy this without retraining, we propose a lightweight segmentation-guided scoring (SGS) pipeline. SGS uses person/face segmentation masks to separate foreground and background regions, extracts embeddings with a joint vision-language model, and computes region-aware coherence scores. These scores are fused with HAMMER's original prediction to improve binary detection, grounding, and token-level explanations. SGS is inference-only, incurs negligible computational overhead, and significantly enhances robustness to global manipulations. This work demonstrates the importance of region-aware reasoning in multimodal disinformation detection. We release scripts for segmentation and scoring at https://github.com/Gaganx0/HAMMER-sgs
中文摘要:本研究通过引入分割引导评分(SGS)方法增强HAMMER模型,有效检测图像篡改中的全局场景不一致性,无需重新训练即可提升检测精度,同时保持计算效率。
English Summary: This study enhances the HAMMER model by introducing a segmentation-guided scoring (SGS) method to detect global scene inconsistencies in manipulated images, improving detection accuracy without retraining while maintaining computational efficiency.

Authors:Christoph Timmermann, Hyunse Lee, Woojin Lee
Title: SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP
Abstract:
While Contrastive Language-Image Pretraining (CLIP) excels at zero-shot tasks by aligning image and text embeddings, its performance in few-shot classification is hindered by a critical limitation: intra-modal misalignment. This issue, caused by a persistent modality gap and CLIP's exclusively inter-modal training objective, leaves the embedding spaces uncalibrated, making direct image-to-image comparisons unreliable. Existing methods attempt to address this by refining similarity logits or by computationally expensive per-sample optimization. To overcome these challenges, we introduce SeMoBridge, a lightweight yet powerful approach that directly addresses the misalignment. Our method maps images into the text modality, while keeping their semantic content intact through what we call a Semantic Modality Bridge. SeMoBridge is closed-form and can optionally be trained through multi-modal supervision, combining image and text-alignment losses to optimize the projection. Experiments show that the trained version, SeMoBridge-T, requires only a fraction of the training time while overall outperforming other methods, particularly in low-data scenarios (1, 2, and 4 shots). The code is available at https://github.com/christti98/semobridge.
中文: SeMoBridge 是一种轻量级方法,通过将图像映射到文本模态并保持语义完整性来解决 CLIP 的模态内错位问题,在少量样本场景中以极短训练时间实现卓越性能。
English: SeMoBridge is a lightweight method that addresses CLIP's intra-modal misalignment by mapping images into the text modality while preserving semantics, achieving superior few-shot performance with minimal training time.

Authors:Lubian Bai, Xiuyuan Zhang, Siqi Zhang, Zepeng Zhang, Haoyu Wang, Wei Qin, Shihong Du
Title: GeoLink: Empowering Remote Sensing Foundation Model with OpenStreetMap Data
Abstract:
Integrating ground-level geospatial data with rich geographic context, like OpenStreetMap (OSM), into remote sensing (RS) foundation models (FMs) is essential for advancing geospatial intelligence and supporting a broad spectrum of tasks. However, modality gap between RS and OSM data, including differences in data structure, content, and spatial granularity, makes effective synergy highly challenging, and most existing RS FMs focus on imagery alone. To this end, this study presents GeoLink, a multimodal framework that leverages OSM data to enhance RS FM during both the pretraining and downstream task stages. Specifically, GeoLink enhances RS self-supervised pretraining using multi-granularity learning signals derived from OSM data, guided by cross-modal spatial correlations for information interaction and collaboration. It also introduces image mask-reconstruction to enable sparse input for efficient pretraining. For downstream tasks, GeoLink generates both unimodal and multimodal fine-grained encodings to support a wide range of applications, from common RS interpretation tasks like land cover classification to more comprehensive geographic tasks like urban function zone mapping. Extensive experiments show that incorporating OSM data during pretraining enhances the performance of the RS image encoder, while fusing RS and OSM data in downstream tasks improves the FM's adaptability to complex geographic scenarios. These results underscore the potential of multimodal synergy in advancing high-level geospatial artificial intelligence. Moreover, we find that spatial correlation plays a crucial role in enabling effective multimodal geospatial data integration. Code, checkpoints, and using examples are released at https://github.com/bailubin/GeoLink_NeurIPS2025
中文: 本研究提出GeoLink多模态框架,通过在预训练和下游任务中融合OpenStreetMap数据与遥感基础模型,有效弥合模态差异并提升地理空间智能任务的性能表现。
English: This study introduces GeoLink, a multimodal framework that integrates OpenStreetMap data with remote sensing foundation models during pretraining and downstream tasks to bridge modality gaps and enhance performance in geospatial intelligence applications.

Authors:Espen Uri Høgstedt, Christian Schellewald, Annette Stahl, Rudolf Mester
Title: A Multi-purpose Tracking Framework for Salmon Welfare Monitoring in Challenging Environments
Abstract:
Computer Vision (CV)-based continuous, automated and precise salmon welfare monitoring is a key step toward reduced salmon mortality and improved salmon welfare in industrial aquaculture net pens. Available CV methods for determining welfare indicators focus on single indicators and rely on object detectors and trackers from other application areas to aid their welfare indicator calculation algorithm. This comes with a high resource demand for real-world applications, since each indicator must be calculated separately. In addition, the methods are vulnerable to difficulties in underwater salmon scenes, such as object occlusion, similar object appearance, and similar object motion. To address these challenges, we propose a flexible tracking framework that uses a pose estimation network to extract bounding boxes around salmon and their corresponding body parts, and exploits information about the body parts, through specialized modules, to tackle challenges specific to underwater salmon scenes. Subsequently, the high-detail body part tracks are employed to calculate welfare indicators. We construct two novel datasets assessing two salmon tracking challenges: salmon ID transfers in crowded scenes and salmon ID switches during turning. Our method outperforms the current state-of-the-art pedestrian tracker, BoostTrack, for both salmon tracking challenges. Additionally, we create a dataset for calculating salmon tail beat wavelength, demonstrating that our body part tracking method is well-suited for automated welfare monitoring based on tail beat analysis. Datasets and code are available at https://github.com/espenbh/BoostCompTrack.
中文: 一种基于姿态估计和专用模块的新型计算机视觉框架,能够在水下环境中有效追踪鲑鱼及其身体部位,在拥挤和转向场景中优于现有方法,并通过尾鳍摆动分析实现自动化福利监测。
English: A novel computer vision framework using pose estimation and specialized modules effectively tracks salmon and their body parts in underwater environments, outperforming existing methods in crowded and turning scenarios while enabling automated welfare monitoring through tail beat analysis.

Authors:Yuan Zhao, Youwei Pang, Lihe Zhang, Hanqi Liu, Jiaming Zuo, Huchuan Lu, Xiaoqi Zhao
Title: UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression
Abstract:
Existing anomaly detection (AD) methods often treat the modality and class as independent factors. Although this paradigm has enriched the development of AD research branches and produced many specialized models, it has also led to fragmented solutions and excessive memory overhead. Moreover, reconstruction-based multi-class approaches typically rely on shared decoding paths, which struggle to handle large variations across domains, resulting in distorted normality boundaries, domain interference, and high false alarm rates. To address these limitations, we propose UniMMAD, a unified framework for multi-modal and multi-class anomaly detection. At the core of UniMMAD is a Mixture-of-Experts (MoE)-driven feature decompression mechanism, which enables adaptive and disentangled reconstruction tailored to specific domains. This process is guided by a ``general to specific'' paradigm. In the encoding stage, multi-modal inputs of varying combinations are compressed into compact, general-purpose features. The encoder incorporates a feature compression module to suppress latent anomalies, encourage cross-modal interaction, and avoid shortcut learning. In the decoding stage, the general features are decompressed into modality-specific and class-specific forms via a sparsely-gated cross MoE, which dynamically selects expert pathways based on input modality and class. To further improve efficiency, we design a grouped dynamic filtering mechanism and a MoE-in-MoE structure, reducing parameter usage by 75\% while maintaining sparse activation and fast inference. UniMMAD achieves state-of-the-art performance on 9 anomaly detection datasets, spanning 3 fields, 12 modalities, and 66 classes. The source code will be available at https://github.com/yuanzhao-CVLAB/UniMMAD.
Chinese: UniMMAD提出了一种统一的多模态多类别异常检测框架,通过基于专家混合的特征解压缩机制实现自适应解耦重建,在多个数据集上达到最优性能,同时将参数减少75%。
English: UniMMAD introduces a unified framework for multi-modal and multi-class anomaly detection, utilizing a Mixture-of-Experts-driven feature decompression mechanism to enable adaptive, disentangled reconstruction and achieve state-of-the-art performance across diverse datasets while reducing parameters by 75%.

Authors:Jundong Xu, Hao Fei, Yuhui Zhang, Liangming Pan, Qijun Huang, Qian Liu, Preslav Nakov, Min-Yen Kan, William Yang Wang, Mong-Li Lee, Wynne Hsu
Title: MuSLR: Multimodal Symbolic Logical Reasoning
Abstract:
Multimodal symbolic logical reasoning, which aims to deduce new facts from multimodal input via formal logic, is critical in high-stakes applications such as autonomous driving and medical diagnosis, as its rigorous, deterministic reasoning helps prevent serious consequences. To evaluate such capabilities of current state-of-the-art vision language models (VLMs), we introduce the first benchmark MuSLR for multimodal symbolic logical reasoning grounded in formal logical rules. MuSLR comprises 1,093 instances across 7 domains, including 35 atomic symbolic logic and 976 logical combinations, with reasoning depths ranging from 2 to 9. We evaluate 7 state-of-the-art VLMs on MuSLR and find that they all struggle with multimodal symbolic reasoning, with the best model, GPT-4.1, achieving only 46.8%. Thus, we propose LogiCAM, a modular framework that applies formal logical rules to multimodal inputs, boosting GPT-4.1's Chain-of-Thought performance by 14.13%, and delivering even larger gains on complex logics such as first-order logic. We also conduct a comprehensive error analysis, showing that around 70% of failures stem from logical misalignment between modalities, offering key insights to guide future improvements. All data and code are publicly available at https://llm-symbol.github.io/MuSLR.
中文:MuSLR基准旨在评估视觉语言模型的多模态符号逻辑推理能力,发现现有模型表现不佳,并提出了LogiCAM模块化框架,通过应用形式逻辑规则显著提升了多模态推理性能。
English: The MuSLR benchmark is introduced to evaluate multimodal symbolic logical reasoning in vision-language models, revealing their limitations and proposing LogiCAM, a modular framework that significantly enhances performance by applying formal logic rules to multimodal inputs.

Authors:Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Fabian Waschkowski, Lukas Wesemann, Peter Tu, Jing Zhang
Title: More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models
Abstract:
Reasoning has emerged as a pivotal capability in Large Language Models (LLMs). Through Reinforcement Learning (RL), typically Group Relative Policy Optimization (GRPO), these models are able to solve complex tasks such as mathematics and code generation. Building on these advances, recent research has sought to extend reasoning to Vision-Language Models (VLMs), yielding promising results across diverse visual tasks. Despite this progress, our study uncovers the dual nature of multimodal reasoning: while it substantially enhances logical inference and facilitates performance on challenging problems, it may gradually impair perceptual grounding, leading to recognition failures on otherwise basic visual questions. Through further analysis, we attribute this phenomenon to visual forgetting, wherein prolonged reasoning causes the model to increasingly disregard visual input. To address this, we propose Vision-Anchored Policy Optimization (VAPO), a simple yet effective method that explicitly steers the reasoning process toward visually grounded trajectories. Our result model, VAPO-Thinker-7B, significantly strengthens the model's reliance on visual information and achieves new state-of-the-art results on a wide range of established benchmarks. Project page: https://xytian1008.github.io/VAPO/

Authors:Yuan Gao, Sangwook Kim, Chris McIntosh
Title: EchoingECG: An Electrocardiogram Cross-Modal Model for Echocardiogram Tasks
Abstract:
Electrocardiogram (ECG) is a widely used tool for assessing cardiac function due to its low cost and accessibility. Emergent research shows that ECGs can help make predictions on key outcomes traditionally derived from more complex modalities such as echocardiograms (ECHO), enabling the use of ECGs as a more accessible method to predict broader measurements of cardiac function. ECHO, in particular, are of great importance because they require considerable hospital resources while playing a key role in clinical cardiac assessment. To aid this use case, we introduce EchoingECG, a probabilistic student-teacher model that leverages uncertainty-aware ECG embeddings and ECHO supervision to improve ECG-based cardiac function prediction. Our approach integrates Probabilistic Cross-Modal Embeddings (PCME++), a probabilistic contrastive framework, with ECHO-CLIP, a vision-language pre-trained model trained on ECHO-text pairs, to distill ECHO knowledge into ECG representations. Through experiments and external validation, we showed that EchoingECG outperforms state-of-the-art foundation ECG models in zero-shot, few-shot, and fine-tune settings for ECHO predictions based on ECG. We also highlighted that variance estimation (enabled through our method) enhanced our understanding of model performance by identifying underlying regions of uncertainty within ECGs. The code is available: https://github.com/mcintoshML/EchoingECG.
中文摘要:本研究提出的EchoingECG概率师生模型,通过不确定性感知的心电图嵌入和超声心动图监督,将超声知识蒸馏到心电图表征中,显著提升了基于心电图的心脏功能预测性能,在多场景下超越现有先进模型。
English Summary: The study introduces EchoingECG, a probabilistic student-teacher model that enhances ECG-based cardiac function prediction by distilling knowledge from echocardiograms through uncertainty-aware embeddings and cross-modal integration, outperforming existing methods across various settings.

Authors:Jia Jun Cheng Xian, Muchen Li, Haotian Yang, Xin Tao, Pengfei Wan, Leonid Sigal, Renjie Liao
Title: Free Lunch Alignment of Text-to-Image Diffusion Models without Preference Image Pairs
Abstract:
Recent advances in diffusion-based text-to-image (T2I) models have led to remarkable success in generating high-quality images from textual prompts. However, ensuring accurate alignment between the text and the generated image remains a significant challenge for state-of-the-art diffusion models. To address this, existing studies employ reinforcement learning with human feedback (RLHF) to align T2I outputs with human preferences. These methods, however, either rely directly on paired image preference data or require a learned reward function, both of which depend heavily on costly, high-quality human annotations and thus face scalability limitations. In this work, we introduce Text Preference Optimization (TPO), a framework that enables "free-lunch" alignment of T2I models, achieving alignment without the need for paired image preference data. TPO works by training the model to prefer matched prompts over mismatched prompts, which are constructed by perturbing original captions using a large language model. Our framework is general and compatible with existing preference-based algorithms. We extend both DPO and KTO to our setting, resulting in TDPO and TKTO. Quantitative and qualitative evaluations across multiple benchmarks show that our methods consistently outperform their original counterparts, delivering better human preference scores and improved text-to-image alignment. Our Open-source code is available at https://github.com/DSL-Lab/T2I-Free-Lunch-Alignment.
中文: 本文提出文本偏好优化(TPO)框架,通过训练模型区分匹配与不匹配提示词来实现文本-图像模型的免标注对齐,在多个基准测试中均显著提升人类偏好分数与图文对齐效果。
English: This paper introduces Text Preference Optimization (TPO), a novel framework that enhances text-to-image model alignment by training models to prefer matched over mismatched prompts, eliminating the need for costly human-annotated image preference data while outperforming existing methods in human preference scores and alignment accuracy.

Authors:Xinyu Pu, Hongsong Wang, Jie Gui, Pan Zhou
Title: Dragging with Geometry: From Pixels to Geometry-Guided Image Editing
Abstract:
Interactive point-based image editing serves as a controllable editor, enabling precise and flexible manipulation of image content. However, most drag-based methods operate primarily on the 2D pixel plane with limited use of 3D cues. As a result, they often produce imprecise and inconsistent edits, particularly in geometry-intensive scenarios such as rotations and perspective transformations. To address these limitations, we propose a novel geometry-guided drag-based image editing method - GeoDrag, which addresses three key challenges: 1) incorporating 3D geometric cues into pixel-level editing, 2) mitigating discontinuities caused by geometry-only guidance, and 3) resolving conflicts arising from multi-point dragging. Built upon a unified displacement field that jointly encodes 3D geometry and 2D spatial priors, GeoDrag enables coherent, high-fidelity, and structure-consistent editing in a single forward pass. In addition, a conflict-free partitioning strategy is introduced to isolate editing regions, effectively preventing interference and ensuring consistency. Extensive experiments across various editing scenarios validate the effectiveness of our method, showing superior precision, structural consistency, and reliable multi-point editability. The code will be available on https://github.com/xinyu-pu/GeoDrag .
中文: GeoDrag提出了一种几何引导的拖拽式图像编辑方法,通过融合三维几何线索与二维空间先验的统一位移场,在单次前向传播中实现精确、连贯且结构一致的编辑,并采用无冲突分区策略有效解决多点拖拽的干扰问题。
English: GeoDrag introduces a geometry-guided drag-based image editing method that integrates 3D geometric cues with 2D spatial priors through a unified displacement field, enabling precise, coherent, and structure-consistent edits in a single forward pass while resolving multi-point conflicts via region partitioning.

Authors:Shunpeng Chen, Changwei Wang, Rongtao Xu, Xingtian Pei, Yukun Song, Jinzhou Lin, Wenhao Xu, Jingyi Zhang, Li Guo, Shibiao Xu
Title: SAGE: Spatial-visual Adaptive Graph Exploration for Visual Place Recognition
Abstract:
Visual Place Recognition (VPR) requires robust retrieval of geotagged images despite large appearance, viewpoint, and environmental variation. Prior methods focus on descriptor fine-tuning or fixed sampling strategies yet neglect the dynamic interplay between spatial context and visual similarity during training. We present SAGE (Spatial-visual Adaptive Graph Exploration), a unified training pipeline that enhances granular spatial-visual discrimination by jointly improving local feature aggregation, organize samples during training, and hard sample mining. We introduce a lightweight Soft Probing module that learns residual weights from training data for patch descriptors before bilinear aggregation, boosting distinctive local cues. During training we reconstruct an online geo-visual graph that fuses geographic proximity and current visual similarity so that candidate neighborhoods reflect the evolving embedding landscape. To concentrate learning on the most informative place neighborhoods, we seed clusters from high-affinity anchors and iteratively expand them with a greedy weighted clique expansion sampler. Implemented with a frozen DINOv2 backbone and parameter-efficient fine-tuning, SAGE achieves SOTA across eight benchmarks. It attains 98.9%, 95.8%, 94.5%, and 96.0% Recall@1 on SPED, Pitts30k-test, MSLS-val, and Nordland, respectively. Notably, our method obtains 100% Recall@10 on SPED only using 4096D global descriptors. Code and model will be available at: https://github.com/chenshunpeng/SAGE.
中文摘要:SAGE提出了一种统一的训练流程,通过自适应图探索和困难样本挖掘动态整合空间上下文与视觉相似性,从而在多个基准测试中实现了高召回率的最先进视觉地点识别性能。
English Summary: SAGE introduces a unified training pipeline that enhances visual place recognition by dynamically integrating spatial context and visual similarity through adaptive graph exploration and hard sample mining, achieving state-of-the-art results across multiple benchmarks with high recall rates.

Authors:Tingyu Shi, Fan Lyu, Shaoliang Peng
Title: Annotation-Efficient Active Test-Time Adaptation with Conformal Prediction
Abstract:
Active Test-Time Adaptation (ATTA) improves model robustness under domain shift by selectively querying human annotations at deployment, but existing methods use heuristic uncertainty measures and suffer from low data selection efficiency, wasting human annotation budget. We propose Conformal Prediction Active TTA (CPATTA), which first brings principled, coverage-guaranteed uncertainty into ATTA. CPATTA employs smoothed conformal scores with a top-K certainty measure, an online weight-update algorithm driven by pseudo coverage, a domain-shift detector that adapts human supervision, and a staged update scheme balances human-labeled and model-labeled data. Extensive experiments demonstrate that CPATTA consistently outperforms the state-of-the-art ATTA methods by around 5% in accuracy. Our code and datasets are available at https://github.com/tingyushi/CPATTA.
中文摘要:CPATTA采用基于保形预测的理论框架改进主动测试时适应方法,通过优化的不确定性度量与自适应标注策略,在多项实验中比现有最优方法准确率提升约5%。
English Summary: CPATTA introduces a principled conformal prediction framework to enhance active test-time adaptation, achieving approximately 5% higher accuracy than existing methods through improved uncertainty measurement and adaptive human annotation strategies.

Authors:Kaiyu Li, Zixuan Jiang, Xiangyong Cao, Jiayu Wang, Yuchen Xiao, Deyu Meng, Zhi Wang
Title: DescribeEarth: Describe Anything for Remote Sensing Images
Abstract:
Automated textual description of remote sensing images is crucial for unlocking their full potential in diverse applications, from environmental monitoring to urban planning and disaster management. However, existing studies in remote sensing image captioning primarily focus on the image level, lacking object-level fine-grained interpretation, which prevents the full utilization and transformation of the rich semantic and structural information contained in remote sensing images. To address this limitation, we propose Geo-DLC, a novel task of object-level fine-grained image captioning for remote sensing. To support this task, we construct DE-Dataset, a large-scale dataset contains 25 categories and 261,806 annotated instances with detailed descriptions of object attributes, relationships, and contexts. Furthermore, we introduce DE-Benchmark, a LLM-assisted question-answering based evaluation suite designed to systematically measure model capabilities on the Geo-DLC task. We also present DescribeEarth, a Multi-modal Large Language Model (MLLM) architecture explicitly designed for Geo-DLC, which integrates a scale-adaptive focal strategy and a domain-guided fusion module leveraging remote sensing vision-language model features to encode high-resolution details and remote sensing category priors while maintaining global context. Our DescribeEarth model consistently outperforms state-of-the-art general MLLMs on DE-Benchmark, demonstrating superior factual accuracy, descriptive richness, and grammatical soundness, particularly in capturing intrinsic object features and surrounding environmental attributes across simple, complex, and even out-of-distribution remote sensing scenarios. All data, code and weights are released at https://github.com/earth-insights/DescribeEarth.
中文摘要:本研究提出了Geo-DLC这一针对遥感图像的细粒度目标级描述新任务,构建了DE-Dataset数据集并建立了DE-Benchmark评估体系,所设计的DescribeEarth模型在准确性、丰富性和语法规范性上均优于现有先进多模态大语言模型。
English Summary: The study introduces Geo-DLC, a novel object-level fine-grained captioning task for remote sensing images, supported by the DE-Dataset and evaluated through the DE-Benchmark, with the proposed DescribeEarth model demonstrating superior performance in accuracy and detail over existing methods.

Authors:Qinsi Wang, Bo Liu, Tianyi Zhou, Jing Shi, Yueqian Lin, Yiran Chen, Hai Helen Li, Kun Wan, Wentian Zhao
Title: Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play
Abstract:
Although reinforcement learning (RL) can effectively enhance the reasoning capabilities of vision-language models (VLMs), current methods remain heavily dependent on labor-intensive datasets that require extensive manual construction and verification, leading to extremely high training costs and consequently constraining the practical deployment of VLMs. To address this challenge, we propose Vision-Zero, a domain-agnostic framework enabling VLM self-improvement through competitive visual games generated from arbitrary image pairs. Specifically, Vision-Zero encompasses three main attributes: (1) Strategic Self-Play Framework: Vision-Zero trains VLMs in "Who Is the Spy"-style games, where the models engage in strategic reasoning and actions across multiple roles. Through interactive gameplay, models autonomously generate their training data without human annotation. (2) Gameplay from Arbitrary Images: Unlike existing gamified frameworks, Vision-Zero can generate games from arbitrary images, thereby enhancing the model's reasoning ability across diverse domains and showing strong generalization to different tasks. We demonstrate this versatility using three distinct types of image datasets: CLEVR-based synthetic scenes, charts, and real-world images. (3) Sustainable Performance Gain: We introduce Iterative Self-Play Policy Optimization (Iterative-SPO), a novel training algorithm that alternates between Self-Play and reinforcement learning with verifiable rewards (RLVR), mitigating the performance plateau often seen in self-play-only training and achieving sustained long-term improvements. Despite using label-free data, Vision-Zero achieves state-of-the-art performance on reasoning, chart question answering, and vision-centric understanding tasks, surpassing other annotation-based methods. Models and code has been released at https://github.com/wangqinsi1/Vision-Zero.
中文摘要:Vision-Zero 是一种领域无关的框架,通过任意图像对生成竞争性视觉游戏,使视觉语言模型能够实现自我优化,无需人工标注即可在多项推理任务中达到最先进性能。
English Summary: Vision-Zero is a domain-agnostic framework that enables vision-language models to self-improve through competitive visual games generated from arbitrary image pairs, eliminating the need for manual annotation while achieving state-of-the-art performance across multiple reasoning tasks.

Authors:Yuyou Zhang, Radu Corcodel, Chiori Hori, Anoop Cherian, Ding Zhao
Title: SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs
Abstract:
We present SpinBench, a cognitively grounded diagnostic benchmark for evaluating spatial reasoning in vision language models (VLMs). SpinBench is designed around the core challenge of spatial reasoning: perspective taking, the ability to reason about how scenes and object relations change under viewpoint transformation. Since perspective taking requires multiple cognitive capabilities, such as recognizing objects across views, relative positions grounding, and mentally simulating transformations, SpinBench introduces a set of fine-grained diagnostic categories. Our categories target translation, rotation, object relative pose, and viewpoint change, and are progressively structured so that single-object simpler tasks scaffold toward the most demanding multi-object perspective-taking setting. We evaluate 37 state-of-the-art VLMs, both proprietary and open source. Results reveal systematic weaknesses: strong egocentric bias, poor rotational understanding, and inconsistencies under symmetrical and syntactic reformulations. Scaling analysis shows both smooth improvements and emergent capabilities. While human subjects achieve high accuracy (91.2\%), task difficulty as measured by human response time shows strong correlation with VLM accuracy, indicating that SpinBench captures spatial reasoning challenges shared across humans and VLMs. We believe SpinBench provides critical insights into spatial reasoning in VLMs and highlights key gaps in their ability to reason about physical space. Our website can be found at https://spinbench25.github.io/.

Authors:Paul Gavrikov, Wei Lin, M. Jehanzeb Mirza, Soumya Jahagirdar, Muhammad Huzaifa, Sivan Doveh, Serena Yeung-Levy, James Glass, Hilde Kuehne
Title: VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes
Abstract:
Is basic visual understanding really solved in state-of-the-art VLMs? We present VisualOverload, a slightly different visual question answering (VQA) benchmark comprising 2,720 question-answer pairs, with privately held ground-truth responses. Unlike prior VQA datasets that typically focus on near global image understanding, VisualOverload challenges models to perform simple, knowledge-free vision tasks in densely populated (or, overloaded) scenes. Our dataset consists of high-resolution scans of public-domain paintings that are populated with multiple figures, actions, and unfolding subplots set against elaborately detailed backdrops. We manually annotated these images with questions across six task categories to probe for a thorough understanding of the scene. We hypothesize that current benchmarks overestimate the performance of VLMs, and encoding and reasoning over details is still a challenging task for them, especially if they are confronted with densely populated scenes. Indeed, we observe that even the best model (o3) out of 37 tested models only achieves 19.6% accuracy on our hardest test split and overall 69.5% accuracy on all questions. Beyond a thorough evaluation, we complement our benchmark with an error analysis that reveals multiple failure modes, including a lack of counting skills, failure in OCR, and striking logical inconsistencies under complex tasks. Altogether, VisualOverload exposes a critical gap in current vision models and offers a crucial resource for the community to develop better models. Benchmark: http://paulgavrikov.github.io/visualoverload

Authors:Liangjian Wen, Qun Dai, Jianzhuang Liu, Jiangtao Zheng, Yong Dai, Dongkai Wang, Zhao Kang, Jun Wang, Zenglin Xu, Jiang Duan
Title: InfMasking: Unleashing Synergistic Information by Contrastive Multimodal Interactions
Abstract:
In multimodal representation learning, synergistic interactions between modalities not only provide complementary information but also create unique outcomes through specific interaction patterns that no single modality could achieve alone. Existing methods may struggle to effectively capture the full spectrum of synergistic information, leading to suboptimal performance in tasks where such interactions are critical. This is particularly problematic because synergistic information constitutes the fundamental value proposition of multimodal representation. To address this challenge, we introduce InfMasking, a contrastive synergistic information extraction method designed to enhance synergistic information through an Infinite Masking strategy. InfMasking stochastically occludes most features from each modality during fusion, preserving only partial information to create representations with varied synergistic patterns. Unmasked fused representations are then aligned with masked ones through mutual information maximization to encode comprehensive synergistic information. This infinite masking strategy enables capturing richer interactions by exposing the model to diverse partial modality combinations during training. As computing mutual information estimates with infinite masking is computationally prohibitive, we derive an InfMasking loss to approximate this calculation. Through controlled experiments, we demonstrate that InfMasking effectively enhances synergistic information between modalities. In evaluations on large-scale real-world datasets, InfMasking achieves state-of-the-art performance across seven benchmarks. Code is released at https://github.com/brightest66/InfMasking.
Chinese: InfMasking提出了一种无限掩码策略,通过在融合过程中随机遮蔽模态特征并利用互信息最大化对齐掩码表示,有效增强了模态间的协同信息,在七个基准测试中取得了最优性能。
English: InfMasking introduces an infinite masking strategy in multimodal learning that stochastically occludes modality features during fusion and aligns masked representations through mutual information maximization, achieving state-of-the-art performance across seven benchmarks by enhancing synergistic interactions.

Authors:Yang Liu, Chuanchen Luo, Zimo Tang, Junran Peng, Zhaoxiang Zhang
Title: VGGT-X: When VGGT Meets Dense Novel View Synthesis
Abstract:
We study the problem of applying 3D Foundation Models (3DFMs) to dense Novel View Synthesis (NVS). Despite significant progress in Novel View Synthesis powered by NeRF and 3DGS, current approaches remain reliant on accurate 3D attributes (e.g., camera poses and point clouds) acquired from Structure-from-Motion (SfM), which is often slow and fragile in low-texture or low-overlap captures. Recent 3DFMs showcase orders of magnitude speedup over the traditional pipeline and great potential for online NVS. But most of the validation and conclusions are confined to sparse-view settings. Our study reveals that naively scaling 3DFMs to dense views encounters two fundamental barriers: dramatically increasing VRAM burden and imperfect outputs that degrade initialization-sensitive 3D training. To address these barriers, we introduce VGGT-X, incorporating a memory-efficient VGGT implementation that scales to 1,000+ images, an adaptive global alignment for VGGT output enhancement, and robust 3DGS training practices. Extensive experiments show that these measures substantially close the fidelity gap with COLMAP-initialized pipelines, achieving state-of-the-art results in dense COLMAP-free NVS and pose estimation. Additionally, we analyze the causes of remaining gaps with COLMAP-initialized rendering, providing insights for the future development of 3D foundation models and dense NVS. Our project page is available at https://dekuliutesla.github.io/vggt-x.github.io/
Chinese: 本研究通过引入VGGT-X系统,解决了三维基础模型在密集新视角合成中的内存限制和输出缺陷问题,实现了不依赖传统COLMAP流程的最先进性能。
English: This research addresses the challenges of scaling 3D Foundation Models for dense Novel View Synthesis by introducing VGGT-X, which overcomes memory limitations and output imperfections to achieve state-of-the-art results without relying on traditional COLMAP pipelines.

Authors:Penghao Wu, Yushan Zhang, Haiwen Diao, Bo Li, Lewei Lu, Ziwei Liu
Title: Visual Jigsaw Post-Training Improves MLLMs
Abstract:
Reinforcement learning based post-training has recently emerged as a powerful paradigm for enhancing the alignment and reasoning capabilities of multimodal large language models (MLLMs). While vision-centric post-training is crucial for enhancing MLLMs' intrinsic understanding of visual signals, current post-training paradigms are predominantly text-centric, where dense visual inputs are only leveraged to extract sparse cues for text-based reasoning. There exist a few approaches in this direction, however, they often still rely on text as an intermediate mediator or introduce additional visual generative designs. In this work, we introduce Visual Jigsaw, a generic self-supervised post-training framework designed to strengthen visual understanding in MLLMs. Visual Jigsaw is formulated as a general ordering task: visual inputs are partitioned, shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language. This naturally aligns with reinforcement learning from verifiable rewards (RLVR), requires no additional visual generative components, and derives its supervisory signal automatically without any annotations. We instantiate Visual Jigsaw across three visual modalities, including images, videos, and 3D data. Extensive experiments demonstrate substantial improvements in fine-grained perception, temporal reasoning, and 3D spatial understanding. Our findings highlight the potential of self-supervised vision-centric tasks in post-training MLLMs and aim to inspire further research on vision-centric pretext designs. Project Page: https://penghao-wu.github.io/visual_jigsaw/
中文摘要:本文提出Visual Jigsaw这一自监督强化学习框架,通过让多模态大语言模型用自然语言重构被打乱的视觉序列来增强其视觉理解能力,在多种视觉模态上显著提升了细粒度感知、时序推理和空间理解能力。
English Summary: This paper introduces Visual Jigsaw, a self-supervised reinforcement learning framework that enhances multimodal language models' visual understanding by having them reconstruct shuffled visual sequences through natural language permutations, demonstrating significant improvements in perception and reasoning across multiple visual modalities.

Authors:Yunyang Ge, Xinhua Cheng, Chengshu Zhao, Xianyi He, Shenghai Yuan, Bin Lin, Bin Zhu, Li Yuan
Title: FlashI2V: Fourier-Guided Latent Shifting Prevents Conditional Image Leakage in Image-to-Video Generation
Abstract:
In Image-to-Video (I2V) generation, a video is created using an input image as the first-frame condition. Existing I2V methods concatenate the full information of the conditional image with noisy latents to achieve high fidelity. However, the denoisers in these methods tend to shortcut the conditional image, which is known as conditional image leakage, leading to performance degradation issues such as slow motion and color inconsistency. In this work, we further clarify that conditional image leakage leads to overfitting to in-domain data and decreases the performance in out-of-domain scenarios. Moreover, we introduce Fourier-Guided Latent Shifting I2V, named FlashI2V, to prevent conditional image leakage. Concretely, FlashI2V consists of: (1) Latent Shifting. We modify the source and target distributions of flow matching by subtracting the conditional image information from the noisy latents, thereby incorporating the condition implicitly. (2) Fourier Guidance. We use high-frequency magnitude features obtained by the Fourier Transform to accelerate convergence and enable the adjustment of detail levels in the generated video. Experimental results show that our method effectively overcomes conditional image leakage and achieves the best generalization and performance on out-of-domain data among various I2V paradigms. With only 1.3B parameters, FlashI2V achieves a dynamic degree score of 53.01 on Vbench-I2V, surpassing CogVideoX1.5-5B-I2V and Wan2.1-I2V-14B-480P. Github page: https://pku-yuangroup.github.io/FlashI2V/
中文: FlashI2V通过潜在偏移和傅里叶引导解决了图像到视频生成中的条件图像泄漏问题,仅用13亿参数就在域外数据上实现了最优的泛化能力和性能表现。
English: FlashI2V addresses conditional image leakage in Image-to-Video generation by introducing Latent Shifting and Fourier Guidance, achieving superior generalization and performance on out-of-domain data with only 1.3B parameters.

Authors:Shuoshuo Zhang, Zijian Li, Yizhen Zhang, Jingjing Fu, Lei Song, Jiang Bian, Jun Zhang, Yujiu Yang, Rui Wang
Title: PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images
Abstract:
Structured images (e.g., charts and geometric diagrams) remain challenging for multimodal large language models (MLLMs), as perceptual slips can cascade into erroneous conclusions. Intermediate visual cues can steer reasoning; however, existing cue-based methods are constrained with low-fidelity image processing and linear, rigid reasoning patterns, limiting their effectiveness on complex structured-image tasks. In this paper, we propose PixelCraft, a novel multi-agent system for high-fidelity image processing and flexible visual reasoning on structured images. The system comprises a dispatcher, a planner, a reasoner, critics, and a set of visual tool agents. To achieve high-fidelity processing, we construct a high-quality corpus and fine-tune an MLLM into a grounding model, whose pixel-level localizations are integrated with traditional computer vision (CV) algorithms in tool agents. Building on this foundation, PixelCraft facilitates flexible visual reasoning through a dynamic three-stage workflow of tool selection, agent discussion, and self-criticism. Moreover, unlike prior linear reasoning patterns that simply append historical images, PixelCraft maintains an image memory to allow the planner to adaptively revisit earlier visual steps, explore alternative reasoning branches, and dynamically adjust the reasoning trajectory during discussion. Extensive experiments on challenging chart and geometry benchmarks demonstrate that PixelCraft significantly improves visual reasoning performance for advanced MLLMs, setting a new standard for structured image reasoning. Our code will be available at https://github.com/microsoft/PixelCraft.
中文: PixelCraft是一种创新的多智能体系统,通过结合高保真像素级定位与计算机视觉算法,并采用动态工作流程实现灵活自适应的推理,显著提升了结构化图像上的视觉推理性能。
English: PixelCraft is a novel multi-agent system that enhances visual reasoning on structured images by integrating high-fidelity pixel-level localization with computer vision algorithms and employing a dynamic workflow for flexible, adaptive reasoning, significantly outperforming existing methods on complex tasks.

Authors:Junyu Chen, Wenkun He, Yuchao Gu, Yuyang Zhao, Jincheng Yu, Junsong Chen, Dongyun Zou, Yujun Lin, Zhekai Zhang, Muyang Li, Haocheng Xi, Ligeng Zhu, Enze Xie, Song Han, Han Cai
Title: DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder
Abstract:
We introduce DC-VideoGen, a post-training acceleration framework for efficient video generation. DC-VideoGen can be applied to any pre-trained video diffusion model, improving efficiency by adapting it to a deep compression latent space with lightweight fine-tuning. The framework builds on two key innovations: (i) a Deep Compression Video Autoencoder with a novel chunk-causal temporal design that achieves 32x/64x spatial and 4x temporal compression while preserving reconstruction quality and generalization to longer videos; and (ii) AE-Adapt-V, a robust adaptation strategy that enables rapid and stable transfer of pre-trained models into the new latent space. Adapting the pre-trained Wan-2.1-14B model with DC-VideoGen requires only 10 GPU days on the NVIDIA H100 GPU. The accelerated models achieve up to 14.8x lower inference latency than their base counterparts without compromising quality, and further enable 2160x3840 video generation on a single GPU. Code: https://github.com/dc-ai-projects/DC-VideoGen.
中文: DC-VideoGen是一种后训练加速框架,通过轻量级微调将预训练模型适配到深度压缩的潜空间,显著提升视频生成效率,推理延迟降低高达14.8倍且不损失质量。
English: DC-VideoGen is a post-training acceleration framework that enhances video generation efficiency by adapting pre-trained models to a compressed latent space through lightweight fine-tuning, achieving up to 14.8x faster inference without quality loss.

Authors:Wenkun He, Yuchao Gu, Junyu Chen, Dongyun Zou, Yujun Lin, Zhekai Zhang, Haocheng Xi, Muyang Li, Ligeng Zhu, Jincheng Yu, Junsong Chen, Enze Xie, Song Han, Han Cai
Title: DC-Gen: Post-Training Diffusion Acceleration with Deeply Compressed Latent Space
Abstract:
Existing text-to-image diffusion models excel at generating high-quality images, but face significant efficiency challenges when scaled to high resolutions, like 4K image generation. While previous research accelerates diffusion models in various aspects, it seldom handles the inherent redundancy within the latent space. To bridge this gap, this paper introduces DC-Gen, a general framework that accelerates text-to-image diffusion models by leveraging a deeply compressed latent space. Rather than a costly training-from-scratch approach, DC-Gen uses an efficient post-training pipeline to preserve the quality of the base model. A key challenge in this paradigm is the representation gap between the base model's latent space and a deeply compressed latent space, which can lead to instability during direct fine-tuning. To overcome this, DC-Gen first bridges the representation gap with a lightweight embedding alignment training. Once the latent embeddings are aligned, only a small amount of LoRA fine-tuning is needed to unlock the base model's inherent generation quality. We verify DC-Gen's effectiveness on SANA and FLUX.1-Krea. The resulting DC-Gen-SANA and DC-Gen-FLUX models achieve quality comparable to their base models but with a significant speedup. Specifically, DC-Gen-FLUX reduces the latency of 4K image generation by 53x on the NVIDIA H100 GPU. When combined with NVFP4 SVDQuant, DC-Gen-FLUX generates a 4K image in just 3.5 seconds on a single NVIDIA 5090 GPU, achieving a total latency reduction of 138x compared to the base FLUX.1-Krea model. Code: https://github.com/dc-ai-projects/DC-Gen.
中文:DC-Gen通过轻量级嵌入对齐和少量LoRA微调,利用深度压缩的潜在空间加速文本到图像扩散模型,在保持与基础模型相当图像质量的同时实现显著加速。
English: DC-Gen accelerates text-to-image diffusion models by leveraging a deeply compressed latent space through lightweight embedding alignment and minimal LoRA fine-tuning, achieving significant speedups while maintaining image quality comparable to base models.

Authors:Bingkui Tong, Jiaer Xia, Kaiyang Zhou
Title: Mitigating Hallucination in Multimodal LLMs with Layer Contrastive Decoding
Abstract:
Multimodal Large Language Models (MLLMs) have shown impressive perception and reasoning capabilities, yet they often suffer from hallucinations -- generating outputs that are linguistically coherent but inconsistent with the context of the input image, including inaccuracies in objects, attributes, and relations. To address this challenge, we propose a simple approach called Layer Contrastive Decoding (LayerCD). Our design is motivated by the observation that shallow visual features are much more likely than deep visual features to cause an MLLM to hallucinate as they only capture biased, low-level information that is insufficient for high-level reasoning. Therefore, LayerCD aims to filter out hallucinations by contrasting the output distributions generated from visual features of different levels, specifically those from the shallow and deep layers of the vision encoder, respectively. We conduct extensive experiments on two hallucination benchmarks and show that LayerCD significantly outperforms current state-of-the-art. The code for LayerCD is available at https://github.com/maifoundations/LayerCD .
Chinese: 多模态大语言模型常出现幻觉问题,而提出的层对比解码方法通过对比浅层与深层视觉特征,有效减少幻觉,显著提升了输出的准确性。
English: Multimodal Large Language Models often produce hallucinations, but the proposed Layer Contrastive Decoding method effectively mitigates this by contrasting shallow and deep visual features to enhance output accuracy.

Authors:Yuxin Jiang, Yuchao Gu, Yiren Song, Ivor Tsang, Mike Zheng Shou
Title: Personalized Vision via Visual In-Context Learning
Abstract:
Modern vision models, trained on large-scale annotated datasets, excel at predefined tasks but struggle with personalized vision -- tasks defined at test time by users with customized objects or novel objectives. Existing personalization approaches rely on costly fine-tuning or synthetic data pipelines, which are inflexible and restricted to fixed task formats. Visual in-context learning (ICL) offers a promising alternative, yet prior methods confine to narrow, in-domain tasks and fail to generalize to open-ended personalization. We introduce Personalized In-Context Operator (PICO), a simple four-panel framework that repurposes diffusion transformers as visual in-context learners. Given a single annotated exemplar, PICO infers the underlying transformation and applies it to new inputs without retraining. To enable this, we construct VisRel, a compact yet diverse tuning dataset, showing that task diversity, rather than scale, drives robust generalization. We further propose an attention-guided seed scorer that improves reliability via efficient inference scaling. Extensive experiments demonstrate that PICO (i) surpasses fine-tuning and synthetic-data baselines, (ii) flexibly adapts to novel user-defined tasks, and (iii) generalizes across both recognition and generation.
中文: PICO提出了一种四面板视觉上下文学习框架,利用扩散变换器实现无需重新训练的用户自定义任务个性化适配,在识别和生成任务中均优于现有方法。
English: PICO introduces a four-panel visual in-context learning framework that leverages diffusion transformers to enable flexible personalization for user-defined tasks without retraining, outperforming existing methods in both recognition and generation tasks.

Authors:Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, Shijian Lu
Title: Rolling Forcing: Autoregressive Long Video Diffusion in Real Time
Abstract:
Streaming video generation, as one fundamental component in interactive world models and neural game engines, aims to generate high-quality, low-latency, and temporally coherent long video streams. However, most existing work suffers from severe error accumulation that often significantly degrades the generated stream videos over long horizons. We design Rolling Forcing, a novel video generation technique that enables streaming long videos with minimal error accumulation. Rolling Forcing comes with three novel designs. First, instead of iteratively sampling individual frames, which accelerates error propagation, we design a joint denoising scheme that simultaneously denoises multiple frames with progressively increasing noise levels. This design relaxes the strict causality across adjacent frames, effectively suppressing error growth. Second, we introduce the attention sink mechanism into the long-horizon stream video generation task, which allows the model to keep key value states of initial frames as a global context anchor and thereby enhances long-term global consistency. Third, we design an efficient training algorithm that enables few-step distillation over largely extended denoising windows. This algorithm operates on non-overlapping windows and mitigates exposure bias conditioned on self-generated histories. Extensive experiments show that Rolling Forcing enables real-time streaming generation of multi-minute videos on a single GPU, with substantially reduced error accumulation.

Authors:Fan Yuan, Yuchen Yan, Yifan Jiang, Haoran Zhao, Tao Feng, Jinyan Chen, Yanwei Lou, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang
Title: GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts
Abstract:
Vision language models (VLMs) achieve unified modeling of images and text, enabling them to accomplish complex real-world tasks through perception, planning, and reasoning. Among these tasks, reasoning is particularly representative, with mathematical reasoning serving as a prominent example. It highlights the high-level capability of VLMs to comprehend mathematical information in images and to perform sophisticated reasoning. Recently, numerous visual mathematical reasoning benchmarks have been proposed, but they are often restricted to geometry, lack coverage of math word problems, and rarely assess reasoning across multiple images. To address these gaps, we introduce GSM8K-V, a purely visual multi-image mathematical reasoning benchmark. GSM8K-V is built by systematically mapping each sample from the widely used text-based GSM8K into visual form. Through a carefully designed automated image-generation pipeline combined with meticulous human annotation, we curate 1,319 high-quality samples. We evaluate a wide range of open-source and closed-source models on GSM8K-V. Results show that although existing VLMs have nearly saturated performance on text-based GSM8K, there remains substantial room for improvement on GSM8K-V. For example, the best-performing model, Gemini-2.5-Pro, achieves 95.22% accuracy on GSM8K but only 46.93% on GSM8K-V. We conduct a comprehensive analysis of GSM8K-V, examining the limitations of current models as well as potential directions for improvement. GSM8K-V offers a new perspective on visual mathematical reasoning and establishes a benchmark to guide the development of more robust and generalizable VLMs.
中文: GSM8K-V作为一个纯视觉多图像数学推理基准被提出,旨在弥补现有基准的不足,揭示了当前视觉语言模型虽然在文本数学推理上表现优异,但在视觉数学推理方面仍有巨大提升空间。
English: GSM8K-V is introduced as a purely visual multi-image mathematical reasoning benchmark to address gaps in existing benchmarks, revealing significant performance disparities between text-based and visual mathematical reasoning in current vision language models despite their advanced capabilities.

Authors:Zhaozhi Wang, Tong Zhang, Mingyue Guo, Yaowei Wang, Qixiang Ye
Title: VideoAnchor: Reinforcing Subspace-Structured Visual Cues for Coherent Visual-Spatial Reasoning
Abstract:
Multimodal Large Language Models (MLLMs) have achieved impressive progress in vision-language alignment, yet they remain limited in visual-spatial reasoning. We first identify that this limitation arises from the attention mechanism: visual tokens are overshadowed by language tokens, preventing the model from consistently recognizing the same visual cues across frames. To address this challenge, we draw a novel connection between the self-expressiveness property in sparse subspace clustering and the attention mechanism in Transformers. Building on this insight, we propose VideoAnchor, a plug-and-play module that leverages subspace affinities to reinforce visual cues across frames without retraining, effectively anchoring attention to shared visual structures. Extensive experiments across benchmarks and backbone models show consistent performance gains -- $e.g.$, 3.2% and 4.6% improvements on VSI-Bench and Video-MME (spatial-related tasks) with InternVL2-8B and Qwen2.5VL-72B -- while qualitative analyses demonstrate more coherent subspace partitions and stronger visual grounding. Our codes will be made public available at https://github.com/feufhd/VideoAnchor.
中文: 多模态大语言模型因语言标记主导注意力而存在视觉空间推理局限,但提出的VideoAnchor模块利用子空间聚类原理增强跨帧视觉线索且无需重新训练,在空间任务上实现了显著性能提升。
English: Multimodal Large Language Models struggle with visual-spatial reasoning due to language tokens dominating attention, but the proposed VideoAnchor module leverages subspace clustering principles to reinforce visual cues across frames without retraining, achieving significant performance improvements on spatial tasks.

Authors:Tomoyuki Suzuki, Kang-Jun Liu, Naoto Inoue, Kota Yamaguchi
Title: LayerD: Decomposing Raster Graphic Designs into Layers
Abstract:
Designers craft and edit graphic designs in a layer representation, but layer-based editing becomes impossible once composited into a raster image. In this work, we propose LayerD, a method to decompose raster graphic designs into layers for re-editable creative workflow. LayerD addresses the decomposition task by iteratively extracting unoccluded foreground layers. We propose a simple yet effective refinement approach taking advantage of the assumption that layers often exhibit uniform appearance in graphic designs. As decomposition is ill-posed and the ground-truth layer structure may not be reliable, we develop a quality metric that addresses the difficulty. In experiments, we show that LayerD successfully achieves high-quality decomposition and outperforms baselines. We also demonstrate the use of LayerD with state-of-the-art image generators and layer-based editing.
中文: LayerD是一种创新方法,通过迭代提取未遮挡前景层并基于图层均匀外观假设进行优化,将栅格图形设计分解为可编辑图层,在实验中优于基线方法,并能与先进图像生成器协同实现可重复编辑的工作流程。
English: LayerD is a novel method that decomposes raster graphic designs into editable layers by iteratively extracting unoccluded foregrounds and refining them based on uniform appearance assumptions, outperforming baselines and enabling re-editable workflows with modern image generators.

Authors:Chengyao Wang, Zhisheng Zhong, Bohao Peng, Senqiao Yang, Yuqi Liu, Haokun Gui, Bin Xia, Jingyao Li, Bei Yu, Jiaya Jia
Title: MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech
Abstract:
We present MGM-Omni, a unified Omni LLM for omni-modal understanding and expressive, long-horizon speech generation. Unlike cascaded pipelines that isolate speech synthesis, MGM-Omni adopts a "brain-mouth" design with a dual-track, token-based architecture that cleanly decouples multimodal reasoning from real-time speech generation. This design enables efficient cross-modal interaction and low-latency, streaming speech generation. For understanding, a unified training strategy coupled with a dual audio encoder design enables long-form audio perception across diverse acoustic conditions. For generation, a chunk-based parallel decoding scheme narrows the text speech token-rate gap, accelerating inference and supporting streaming zero-shot voice cloning with stable timbre over extended durations. Compared to concurrent work, MGM-Omni achieves these capabilities with markedly data-efficient training. Extensive experiments demonstrate that MGM-Omni outperforms existing open source models in preserving timbre identity across extended sequences, producing natural and context-aware speech, and achieving superior long-form audio and omnimodal understanding. MGM-Omni establishes an efficient, end-to-end paradigm for omnimodal understanding and controllable, personalised long-horizon speech generation.
Chinese: MGM-Omni 是一种统一的Omni LLM,通过双轨架构高效处理多模态理解和富有表现力的长时程语音生成,实现了低延迟流式处理,并在数据高效训练的基础上,在各项任务中展现出卓越性能。
English: MGM-Omni is a unified Omni LLM that efficiently handles multimodal understanding and expressive, long-horizon speech generation through a dual-track architecture, enabling low-latency streaming and superior performance across diverse tasks with data-efficient training.

Authors:Jan Held, Renaud Vandeghen, Sanghyun Son, Daniel Rebain, Matheus Gadelha, Yi Zhou, Ming C. Lin, Marc Van Droogenbroeck, Andrea Tagliasacchi
Title: Triangle Splatting+: Differentiable Rendering with Opaque Triangles
Abstract:
Reconstructing 3D scenes and synthesizing novel views has seen rapid progress in recent years. Neural Radiance Fields demonstrated that continuous volumetric radiance fields can achieve high-quality image synthesis, but their long training and rendering times limit practicality. 3D Gaussian Splatting (3DGS) addressed these issues by representing scenes with millions of Gaussians, enabling real-time rendering and fast optimization. However, Gaussian primitives are not natively compatible with the mesh-based pipelines used in VR headsets, and real-time graphics applications. Existing solutions attempt to convert Gaussians into meshes through post-processing or two-stage pipelines, which increases complexity and degrades visual quality. In this work, we introduce Triangle Splatting+, which directly optimizes triangles, the fundamental primitive of computer graphics, within a differentiable splatting framework. We formulate triangle parametrization to enable connectivity through shared vertices, and we design a training strategy that enforces opaque triangles. The final output is immediately usable in standard graphics engines without post-processing. Experiments on the Mip-NeRF360 and Tanks & Temples datasets show that Triangle Splatting+achieves state-of-the-art performance in mesh-based novel view synthesis. Our method surpasses prior splatting approaches in visual fidelity while remaining efficient and fast to training. Moreover, the resulting semi-connected meshes support downstream applications such as physics-based simulation or interactive walkthroughs. The project page is https://trianglesplatting2.github.io/trianglesplatting2/.

Authors:AmirHossein Zamani, Bruno Roy, Arianna Rampini
Title: Unsupervised Representation Learning for 3D Mesh Parameterization with Semantic and Visibility Objectives
Abstract:
Recent 3D generative models produce high-quality textures for 3D mesh objects. However, they commonly rely on the heavy assumption that input 3D meshes are accompanied by manual mesh parameterization (UV mapping), a manual task that requires both technical precision and artistic judgment. Industry surveys show that this process often accounts for a significant share of asset creation, creating a major bottleneck for 3D content creators. Moreover, existing automatic methods often ignore two perceptually important criteria: (1) semantic awareness (UV charts should align semantically similar 3D parts across shapes) and (2) visibility awareness (cutting seams should lie in regions unlikely to be seen). To overcome these shortcomings and to automate the mesh parameterization process, we present an unsupervised differentiable framework that augments standard geometry-preserving UV learning with semantic- and visibility-aware objectives. For semantic-awareness, our pipeline (i) segments the mesh into semantic 3D parts, (ii) applies an unsupervised learned per-part UV-parameterization backbone, and (iii) aggregates per-part charts into a unified UV atlas. For visibility-awareness, we use ambient occlusion (AO) as an exposure proxy and back-propagate a soft differentiable AO-weighted seam objective to steer cutting seams toward occluded regions. By conducting qualitative and quantitative evaluations against state-of-the-art methods, we show that the proposed method produces UV atlases that better support texture generation and reduce perceptible seam artifacts compared to recent baselines. Our implementation code is publicly available at: https://github.com/AHHHZ975/Semantic-Visibility-UV-Param.
中文: 近期3D生成模型依赖手动UV映射这一耗时环节,我们提出的无监督框架通过融入语义感知和可见性感知机制,自动生成优化后的UV图谱,显著提升纹理生成质量并减少接缝瑕疵。
English: Recent 3D generative models require manual UV mapping, a bottleneck in asset creation, but our unsupervised framework automates this process by incorporating semantic and visibility awareness to produce superior UV atlases that enhance texture generation and minimize seam visibility.

Authors:Dingning Liu, Haoyu Guo, Jingyi Zhou, Tong He
Title: BRIDGE -- Building Reinforcement-Learning Depth-to-Image Data Generation Engine for Monocular Depth Estimation
Abstract:
Monocular Depth Estimation (MDE) is a foundational task for computer vision. Traditional methods are limited by data scarcity and quality, hindering their robustness. To overcome this, we propose BRIDGE, an RL-optimized depth-to-image (D2I) generation framework that synthesizes over 20M realistic and geometrically accurate RGB images, each intrinsically paired with its ground truth depth, from diverse source depth maps. Then we train our depth estimation model on this dataset, employing a hybrid supervision strategy that integrates teacher pseudo-labels with ground truth depth for comprehensive and robust training. This innovative data generation and training paradigm enables BRIDGE to achieve breakthroughs in scale and domain diversity, consistently outperforming existing state-of-the-art approaches quantitatively and in complex scene detail capture, thereby fostering general and robust depth features. Code and models are available at https://dingning-liu.github.io/bridge.github.io/.

Authors:Huaizhi Qu, Xiao Wang, Gengwei Zhang, Jie Peng, Tianlong Chen
Title: GEM: 3D Gaussian Splatting for Efficient and Accurate Cryo-EM Reconstruction
Abstract:
Cryo-electron microscopy (cryo-EM) has become a central tool for high-resolution structural biology, yet the massive scale of datasets (often exceeding 100k particle images) renders 3D reconstruction both computationally expensive and memory intensive. Traditional Fourier-space methods are efficient but lose fidelity due to repeated transforms, while recent real-space approaches based on neural radiance fields (NeRFs) improve accuracy but incur cubic memory and computation overhead. Therefore, we introduce GEM, a novel cryo-EM reconstruction framework built on 3D Gaussian Splatting (3DGS) that operates directly in real-space while maintaining high efficiency. Instead of modeling the entire density volume, GEM represents proteins with compact 3D Gaussians, each parameterized by only 11 values. To further improve the training efficiency, we designed a novel gradient computation to 3D Gaussians that contribute to each voxel. This design substantially reduced both memory footprint and training cost. On standard cryo-EM benchmarks, GEM achieves up to 48% faster training and 12% lower memory usage compared to state-of-the-art methods, while improving local resolution by as much as 38.8%. These results establish GEM as a practical and scalable paradigm for cryo-EM reconstruction, unifying speed, efficiency, and high-resolution accuracy. Our code is available at https://github.com/UNITES-Lab/GEM.
中文: GEM提出了一种基于3D高斯点渲染的新型冷冻电镜重建框架,相比现有方法显著降低了内存占用和训练成本,同时提高了分辨率。
English: GEM introduces a novel cryo-EM reconstruction framework using 3D Gaussian Splatting, which significantly reduces memory usage and training costs while improving resolution compared to existing methods.

Authors:Wenhao Li, Qiangchang Wang, Xianjing Meng, Zhibin Wu, Yilong Yin
Title: VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning
Abstract:
Few-shot learning (FSL) aims to recognize novel concepts from only a few labeled support samples. Recent studies enhance support features by incorporating additional semantic information or designing complex semantic fusion modules. However, they still suffer from hallucinating semantics that contradict the visual evidence due to the lack of grounding in actual instances, resulting in noisy guidance and costly corrections. To address these issues, we propose a novel framework, bridging Vision and Text with LLMs for Few-Shot Learning (VT-FSL), which constructs precise cross-modal prompts conditioned on Large Language Models (LLMs) and support images, seamlessly integrating them through a geometry-aware alignment. It mainly consists of Cross-modal Iterative Prompting (CIP) and Cross-modal Geometric Alignment (CGA). Specifically, the CIP conditions an LLM on both class names and support images to generate precise class descriptions iteratively in a single structured reasoning pass. These descriptions not only enrich the semantic understanding of novel classes but also enable the zero-shot synthesis of semantically consistent images. The descriptions and synthetic images act respectively as complementary textual and visual prompts, providing high-level class semantics and low-level intra-class diversity to compensate for limited support data. Furthermore, the CGA jointly aligns the fused textual, support, and synthetic visual representations by minimizing the kernelized volume of the 3-dimensional parallelotope they span. It captures global and nonlinear relationships among all representations, enabling structured and consistent multimodal integration. The proposed VT-FSL method establishes new state-of-the-art performance across ten diverse benchmarks, including standard, cross-domain, and fine-grained few-shot learning scenarios. Code is available at https://github.com/peacelwh/VT-FSL.
Chinese: VT-FSL框架通过跨模态迭代提示和几何对齐,利用大语言模型生成精确的类别描述和合成图像,以增强小样本学习,在多个基准测试中取得了最先进的性能。
English: The VT-FSL framework introduces cross-modal iterative prompting and geometric alignment to enhance few-shot learning by generating precise class descriptions and synthetic images using LLMs, achieving state-of-the-art results across multiple benchmarks.

Authors:Xiaoxiao Ma, Haibo Qiu, Guohui Zhang, Zhixiong Zeng, Siqi Yang, Lin Ma, Feng Zhao
Title: STAGE: Stable and Generalizable GRPO for Autoregressive Image Generation
Abstract:
Reinforcement learning has recently been explored to improve text-to-image generation, yet applying existing GRPO algorithms to autoregressive (AR) image models remains challenging. The instability of the training process easily disrupts the pretrained model capability during long runs, resulting in marginal gains, degraded image quality, and poor generalization. In this work, we revisit GRPO for AR image generation and identify two key issues: contradictory gradients from unnecessary tokens and unstable policy entropy dynamics. To address these, we introduce STAGE, a stable and generalizable framework that leverages two targeted solutions: 1) Advantage/KL reweighting. Similarity-aware reweighting to alleviate conflicting updates; and 2) Entropy reward. An entropy-based reward corresponding to reference model to stabilize learning. With the help of alleviating conflicts between tokens and an entropy reward for stabilizing training, we reduce disruption of the pretrained distribution and mitigate reward hacking, which in turn improves generalization and transfer better to other benchmarks. Experiments across multiple benchmarks show that STAGE consistently improves visual quality, stability, and cross-task generalization compared to baseline GRPO.
中文:STAGE框架通过解决自回归图像生成中令牌梯度冲突和稳定熵动态,有效提升了强化学习的训练稳定性,在多个基准测试中实现了更好的图像质量和泛化能力。
English: The STAGE framework addresses instability in reinforcement learning for autoregressive image generation by resolving contradictory token gradients and stabilizing entropy dynamics, leading to improved image quality and generalization across benchmarks.

Authors:Mustansar Fiaz, Hiyam Debary, Paolo Fraccaro, Danda Paudel, Luc Van Gool, Fahad Khan, Salman Khan
Title: GeoVLM-R1: Reinforcement Fine-Tuning for Improved Remote Sensing Reasoning
Abstract:
Recent advances in reinforcement learning (RL) have delivered strong reasoning capabilities in natural image domains, yet their potential for Earth Observation (EO) remains largely unexplored. EO tasks introduce unique challenges, spanning referred object detection, image or region captioning, change detection, grounding, and temporal analysis, that demand task aware reasoning. We propose a novel post training framework that incorporates task aware rewards to enable effective adaptation of reasoning based RL models to diverse EO tasks. This training strategy enhances reasoning capabilities for remote sensing images, stabilizes optimization, and improves robustness. Extensive experiments across multiple EO benchmarks show consistent performance gains over state of the art generic and specialized vision language models. Code and models will be released publicly at https://mustansarfiaz.github.io/GeoVLM-R1/ .
中文摘要:本研究提出了一种新颖的后训练框架,通过引入任务感知奖励来增强强化学习模型在地球观测任务中的表现,在多个基准测试中显著提升了推理能力、优化稳定性和鲁棒性。
English Summary: This study introduces a novel post-training framework that enhances reinforcement learning models for Earth Observation tasks by incorporating task-aware rewards, improving reasoning capabilities, stability, and robustness across multiple benchmarks.

Authors:Tooba Imtiaz, Lucy Chai, Kathryn Heal, Xuan Luo, Jungyeon Park, Jennifer Dy, John Flynn
Title: LVT: Large-Scale Scene Reconstruction via Local View Transformers
Abstract:
Large transformer models are proving to be a powerful tool for 3D vision and novel view synthesis. However, the standard Transformer's well-known quadratic complexity makes it difficult to scale these methods to large scenes. To address this challenge, we propose the Local View Transformer (LVT), a large-scale scene reconstruction and novel view synthesis architecture that circumvents the need for the quadratic attention operation. Motivated by the insight that spatially nearby views provide more useful signal about the local scene composition than distant views, our model processes all information in a local neighborhood around each view. To attend to tokens in nearby views, we leverage a novel positional encoding that conditions on the relative geometric transformation between the query and nearby views. We decode the output of our model into a 3D Gaussian Splat scene representation that includes both color and opacity view-dependence. Taken together, the Local View Transformer enables reconstruction of arbitrarily large, high-resolution scenes in a single forward pass. See our project page for results and interactive demos https://toobaimt.github.io/lvt/.

Authors:Yuyang Yin, HaoXiang Guo, Fangfu Liu, Mengyu Wang, Hanwen Liang, Eric Li, Yikai Wang, Xiaojie Jin, Yao Zhao, Yunchao Wei
Title: PanoWorld-X: Generating Explorable Panoramic Worlds via Sphere-Aware Video Diffusion
Abstract:
Generating a complete and explorable 360-degree visual world enables a wide range of downstream applications. While prior works have advanced the field, they remain constrained by either narrow field-of-view limitations, which hinder the synthesis of continuous and holistic scenes, or insufficient camera controllability that restricts free exploration by users or autonomous agents. To address this, we propose PanoWorld-X, a novel framework for high-fidelity and controllable panoramic video generation with diverse camera trajectories. Specifically, we first construct a large-scale dataset of panoramic video-exploration route pairs by simulating camera trajectories in virtual 3D environments via Unreal Engine. As the spherical geometry of panoramic data misaligns with the inductive priors from conventional video diffusion, we then introduce a Sphere-Aware Diffusion Transformer architecture that reprojects equirectangular features onto the spherical surface to model geometric adjacency in latent space, significantly enhancing visual fidelity and spatiotemporal continuity. Extensive experiments demonstrate that our PanoWorld-X achieves superior performance in various aspects, including motion range, control precision, and visual quality, underscoring its potential for real-world applications.
中文:PanoWorld-X是一种创新框架,通过球面感知扩散变换器和大型数据集生成高保真、可控的全景视频,克服了以往视场角限制和相机控制不足的问题。
English: PanoWorld-X is a novel framework that generates high-fidelity and controllable panoramic videos with diverse camera trajectories, overcoming previous limitations in field-of-view and camera control through a Sphere-Aware Diffusion Transformer and a large-scale dataset.

Authors:Haotian Dong, Wenjing Wang, Chen Li, Di Lin
Title: Wan-Alpha: High-Quality Text-to-Video Generation with Alpha Channel
Abstract:
RGBA video generation, which includes an alpha channel to represent transparency, is gaining increasing attention across a wide range of applications. However, existing methods often neglect visual quality, limiting their practical usability. In this paper, we propose Wan-Alpha, a new framework that generates transparent videos by learning both RGB and alpha channels jointly. We design an effective variational autoencoder (VAE) that encodes the alpha channel into the RGB latent space. Then, to support the training of our diffusion transformer, we construct a high-quality and diverse RGBA video dataset. Compared with state-of-the-art methods, our model demonstrates superior performance in visual quality, motion realism, and transparency rendering. Notably, our model can generate a wide variety of semi-transparent objects, glowing effects, and fine-grained details such as hair strands. The released model is available on our website: https://donghaotian123.github.io/Wan-Alpha/.

Authors:Tian Xia, Matthew Sinclair, Andreas Schuh, Fabio De Sousa Ribeiro, Raghav Mehta, Rajat Rasal, Esther Puyol-Antón, Samuel Gerber, Kersten Petersen, Michiel Schaap, Ben Glocker
Title: Segmentor-Guided Counterfactual Fine-Tuning for Locally Coherent and Targeted Image Synthesis
Abstract:
Counterfactual image generation is a powerful tool for augmenting training data, de-biasing datasets, and modeling disease. Current approaches rely on external classifiers or regressors to increase the effectiveness of subject-level interventions (e.g., changing the patient's age). For structure-specific interventions (e.g., changing the area of the left lung in a chest radiograph), we show that this is insufficient, and can result in undesirable global effects across the image domain. Previous work used pixel-level label maps as guidance, requiring a user to provide hypothetical segmentations which are tedious and difficult to obtain. We propose Segmentor-guided Counterfactual Fine-Tuning (Seg-CFT), which preserves the simplicity of intervening on scalar-valued, structure-specific variables while producing locally coherent and effective counterfactuals. We demonstrate the capability of generating realistic chest radiographs, and we show promising results for modeling coronary artery disease. Code: https://github.com/biomedia-mira/seg-cft.
中文: 现有的反事实图像生成方法在处理结构特异性干预时效果不足且依赖繁琐的像素级标注,而本文提出的Seg-CFT方法通过简单标量变量即可生成局部一致的反事实图像,在医学影像领域展现出良好应用前景。
English: Current counterfactual image generation methods are insufficient for structure-specific interventions and require tedious pixel-level guidance, but the proposed Seg-CFT approach effectively produces locally coherent counterfactuals using simple scalar variables while demonstrating promising results in medical imaging.

Authors:Yu Ma, Guoliang Wei, Haihong Xiao, Yue Cheng
Title: HBSplat: Robust Sparse-View Gaussian Reconstruction with Hybrid-Loss Guided Depth and Bidirectional Warping
Abstract:
Novel View Synthesis (NVS) from sparse views presents a formidable challenge in 3D reconstruction, where limited multi-view constraints lead to severe overfitting, geometric distortion, and fragmented scenes. While 3D Gaussian Splatting (3DGS) delivers real-time, high-fidelity rendering, its performance drastically deteriorates under sparse inputs, plagued by floating artifacts and structural failures. To address these challenges, we introduce HBSplat, a unified framework that elevates 3DGS by seamlessly integrating robust structural cues, virtual view constraints, and occluded region completion. Our core contributions are threefold: a Hybrid-Loss Depth Estimation module that ensures multi-view consistency by leveraging dense matching priors and integrating reprojection, point propagation, and smoothness constraints; a Bidirectional Warping Virtual View Synthesis method that enforces substantially stronger constraints by creating high-fidelity virtual views through bidirectional depth-image warping and multi-view fusion; and an Occlusion-Aware Reconstruction component that recovers occluded areas using a depth-difference mask and a learning-based inpainting model. Extensive evaluations on LLFF, Blender, and DTU benchmarks validate that HBSplat sets a new state-of-the-art, achieving up to 21.13 dB PSNR and 0.189 LPIPS, while maintaining real-time inference. Code is available at: https://github.com/eternalland/HBSplat.
Chinese: HBSplat通过整合结构线索、虚拟视图合成和遮挡感知重建,提升了3D高斯溅射在稀疏视图下的表现,实现了最先进的实时渲染效果。
English: HBSplat enhances 3D Gaussian Splatting by integrating structural cues, virtual view synthesis, and occlusion-aware reconstruction to overcome sparse view challenges, achieving state-of-the-art performance and real-time rendering.

Authors:Jiuhong Xiao, Roshan Nayak, Ning Zhang, Daniel Tortei, Giuseppe Loianno
Title: ThermalGen: Style-Disentangled Flow-Based Generative Models for RGB-to-Thermal Image Translation
Abstract:
Paired RGB-thermal data is crucial for visual-thermal sensor fusion and cross-modality tasks, including important applications such as multi-modal image alignment and retrieval. However, the scarcity of synchronized and calibrated RGB-thermal image pairs presents a major obstacle to progress in these areas. To overcome this challenge, RGB-to-Thermal (RGB-T) image translation has emerged as a promising solution, enabling the synthesis of thermal images from abundant RGB datasets for training purposes. In this study, we propose ThermalGen, an adaptive flow-based generative model for RGB-T image translation, incorporating an RGB image conditioning architecture and a style-disentangled mechanism. To support large-scale training, we curated eight public satellite-aerial, aerial, and ground RGB-T paired datasets, and introduced three new large-scale satellite-aerial RGB-T datasets--DJI-day, Bosonplus-day, and Bosonplus-night--captured across diverse times, sensor types, and geographic regions. Extensive evaluations across multiple RGB-T benchmarks demonstrate that ThermalGen achieves comparable or superior translation performance compared to existing GAN-based and diffusion-based methods. To our knowledge, ThermalGen is the first RGB-T image translation model capable of synthesizing thermal images that reflect significant variations in viewpoints, sensor characteristics, and environmental conditions. Project page: http://xjh19971.github.io/ThermalGen

Authors:Zeyu Cai, Ziyang Li, Xiaoben Li, Boqian Li, Zeyu Wang, Zhenyu Zhang, Yuliang Xiu
Title: UP2You: Fast Reconstruction of Yourself from Unconstrained Photo Collections
Abstract:
We present UP2You, the first tuning-free solution for reconstructing high-fidelity 3D clothed portraits from extremely unconstrained in-the-wild 2D photos. Unlike previous approaches that require "clean" inputs (e.g., full-body images with minimal occlusions, or well-calibrated cross-view captures), UP2You directly processes raw, unstructured photographs, which may vary significantly in pose, viewpoint, cropping, and occlusion. Instead of compressing data into tokens for slow online text-to-3D optimization, we introduce a data rectifier paradigm that efficiently converts unconstrained inputs into clean, orthogonal multi-view images in a single forward pass within seconds, simplifying the 3D reconstruction. Central to UP2You is a pose-correlated feature aggregation module (PCFA), that selectively fuses information from multiple reference images w.r.t. target poses, enabling better identity preservation and nearly constant memory footprint, with more observations. We also introduce a perceiver-based multi-reference shape predictor, removing the need for pre-captured body templates. Extensive experiments on 4D-Dress, PuzzleIOI, and in-the-wild captures demonstrate that UP2You consistently surpasses previous methods in both geometric accuracy (Chamfer-15%, P2S-18% on PuzzleIOI) and texture fidelity (PSNR-21%, LPIPS-46% on 4D-Dress). UP2You is efficient (1.5 minutes per person), and versatile (supports arbitrary pose control, and training-free multi-garment 3D virtual try-on), making it practical for real-world scenarios where humans are casually captured. Both models and code will be released to facilitate future research on this underexplored task. Project Page: https://zcai0612.github.io/UP2You

Authors:Hongyang Zhang, Yinhao Liu, Zhenyu Kuang
Title: SkyLink: Unifying Street-Satellite Geo-Localization via UAV-Mediated 3D Scene Alignment
Abstract:
Cross-view geo-localization aims at establishing location correspondences between different viewpoints. Existing approaches typically learn cross-view correlations through direct feature similarity matching, often overlooking semantic degradation caused by extreme viewpoint disparities. To address this unique problem, we focus on robust feature retrieval under viewpoint variation and propose the novel SkyLink method. We firstly utilize the Google Retrieval Enhancement Module to perform data enhancement on street images, which mitigates the occlusion of the key target due to restricted street viewpoints. The Patch-Aware Feature Aggregation module is further adopted to emphasize multiple local feature aggregations to ensure the consistent feature extraction across viewpoints. Meanwhile, we integrate the 3D scene information constructed from multi-scale UAV images as a bridge between street and satellite viewpoints, and perform feature alignment through self-supervised and cross-view contrastive learning. Experimental results demonstrate robustness and generalization across diverse urban scenarios, which achieve 25.75$\%$ Recall@1 accuracy on University-1652 in the UAVM2025 Challenge. Code will be released at https://github.com/HRT00/CVGL-3D.
Chinese: SkyLink方法通过数据增强、局部特征聚合和三维场景整合,有效缓解视角差异导致的语义退化,在University-1652数据集上实现了25.75%的Recall@1准确率,提升了跨视角地理定位的鲁棒性。
English: The SkyLink method enhances cross-view geo-localization by mitigating semantic degradation through data enhancement, patch-aware feature aggregation, and 3D scene integration, achieving robust performance with 25.75% Recall@1 accuracy on University-1652.

Authors:Yizhuo Ding, Mingkang Chen, Zhibang Feng, Tong Xiao, Wanying Qu, Wenqi Shao, Yanwei Fu
Title: VTPerception-R1: Enhancing Multimodal Reasoning via Explicit Visual and Textual Perceptual Grounding
Abstract:
Multimodal large language models (MLLMs) often struggle to ground reasoning in perceptual evidence. We present a systematic study of perception strategies-explicit, implicit, visual, and textual-across four multimodal benchmarks and two MLLMs. Our findings show that explicit perception, especially when paired with textual cues, consistently yields the best improvements, particularly for smaller models. Based on this insight, we propose VTPerception-R1, a unified two-stage framework that decouples perception from reasoning. Stage 1 introduces perception-augmented fine-tuning, and Stage 2 applies perception-aware reinforcement learning with novel visual, textual, and consistency rewards. Experiments demonstrate that VTPerception-R1 significantly improves reasoning accuracy and robustness across diverse tasks, offering a scalable and auditable solution for perception-grounded multimodal reasoning. Our code is available at: https://github.com/yizhuoDi/VTPerceprion-R1.
中文: 多模态大语言模型常难以将推理基于感知证据,而提出的VTPerception-R1框架通过感知与推理解耦的两阶段微调和强化学习,显著提升了多种任务中的准确性和鲁棒性。
English: Multimodal large language models often fail to base reasoning on perceptual evidence, but the proposed VTPerception-R1 framework, which decouples perception from reasoning through two stages of fine-tuning and reinforcement learning, significantly enhances accuracy and robustness across various tasks.

Authors:Jiaqi Chen, Xinhao Ji, Yuanyuan Gao, Hao Li, Yuning Gong, Yifei Liu, Dan Xu, Zhihang Zhong, Dingwen Zhang, Xiao Sun
Title: ExGS: Extreme 3D Gaussian Compression with Diffusion Priors
Abstract:
Neural scene representations, such as 3D Gaussian Splatting (3DGS), have enabled high-quality neural rendering; however, their large storage and transmission costs hinder deployment in resource-constrained environments. Existing compression methods either rely on costly optimization, which is slow and scene-specific, or adopt training-free pruning and quantization, which degrade rendering quality under high compression ratios. In contrast, recent data-driven approaches provide a promising direction to overcome this trade-off, enabling efficient compression while preserving high rendering quality. We introduce ExGS, a novel feed-forward framework that unifies Universal Gaussian Compression (UGC) with GaussPainter for Extreme 3DGS compression. UGC performs re-optimization-free pruning to aggressively reduce Gaussian primitives while retaining only essential information, whereas GaussPainter leverages powerful diffusion priors with mask-guided refinement to restore high-quality renderings from heavily pruned Gaussian scenes. Unlike conventional inpainting, GaussPainter not only fills in missing regions but also enhances visible pixels, yielding substantial improvements in degraded renderings. To ensure practicality, it adopts a lightweight VAE and a one-step diffusion design, enabling real-time restoration. Our framework can even achieve over 100X compression (reducing a typical 354.77 MB model to about 3.31 MB) while preserving fidelity and significantly improving image quality under challenging conditions. These results highlight the central role of diffusion priors in bridging the gap between extreme compression and high-quality neural rendering. Our code repository will be released at: https://github.com/chenttt2001/ExGS
中文: ExGS框架通过结合无重优化的通用高斯压缩与基于扩散先验的高斯绘制器,实现了3D高斯溅射模型的极致压缩,在超过100倍压缩比下仍能保持高质量渲染效果。
English: The ExGS framework achieves extreme compression of 3D Gaussian Splatting models by combining Universal Gaussian Compression for efficient pruning and GaussPainter with diffusion priors for quality restoration, enabling over 100x size reduction while maintaining high rendering fidelity.

Authors:Yang Chen, Minghao Liu, Yufan Shen, Yunwen Li, Tianyuan Huang, Xinyu Fang, Tianyu Zheng, Wenxuan Huang, Cheng Yang, Daocheng Fu, Jianbiao Mei, Rong Wu, Licheng Wen, Xuemeng Yang, Song Mao, Qunshu Lin, Zhi Yu, Yongliang Shen, Yu Qiao, Botian Shi
Title: IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video?
Abstract:
The webpage-to-code task requires models to understand visual representations of webpages and generate corresponding code. However, existing benchmarks primarily focus on static screenshot-to-code tasks, thereby overlooking the dynamic interactions fundamental to real-world web applications. To address this limitation, this paper introduces IWR-Bench, a novel benchmark for evaluating the capabilities of Large Vision-Language Models (LVLMs) in interactive webpage reconstruction from video. IWR-Bench comprises 113 meticulously curated tasks from 100 real-world websites, with 1,001 actions and featuring diverse interaction complexities (e.g., web games), visual styles, and domains. Aligning with standard web development practices, each task includes not only user interaction videos but also all crawled static assets (e.g., images, videos). This benchmark evaluates models on two fundamental challenges: comprehensive multi-modal reasoning to infer interaction logic from video and assets, and advanced code generation to translate this logic into functional code. An agent-as-a-judge framework with a comprehensive metric system automatically assesses the functional correctness and visual fidelity of generated webpages. Extensive experiments on 28 LVLMs reveal a significant challenge: the best model achieves an overall score of only 36.35%, as functional correctness (24.39% IFS) lags significantly behind visual fidelity (64.25% VFS). These results highlight critical limitations in current models' ability to reason about temporal dynamics and synthesize event-driven logic, establishing IWR-Bench as a challenging frontier for vision-language research. The benchmark and evaluation code will be made publicly available. Code is available at https://github.com/L-O-I/IWR-Bench.
中文: 本文提出了IWR-Bench这一新颖基准,用于评估大视觉语言模型从视频重建交互式网页的能力,通过对28个模型的广泛实验发现,尽管视觉保真度尚可,但功能正确性方面仍存在显著挑战。
English: This paper introduces IWR-Bench, a novel benchmark for evaluating Large Vision-Language Models' ability to reconstruct interactive webpages from videos, revealing significant challenges in functional correctness despite moderate visual fidelity through extensive experiments on 28 models.

Authors:Zidu Wang, Meng Xu, Miao Xu, Hengyuan Ma, Jiankuo Zhao, Xutao Li, Xiangyu Zhu, Zhen Lei
Title: BFSM: 3D Bidirectional Face-Skull Morphable Model
Abstract:
Building a joint face-skull morphable model holds great potential for applications such as remote diagnostics, surgical planning, medical education, and physically based facial simulation. However, realizing this vision is constrained by the scarcity of paired face-skull data, insufficient registration accuracy, and limited exploration of reconstruction and clinical applications. Moreover, individuals with craniofacial deformities are often overlooked, resulting in underrepresentation and limited inclusivity. To address these challenges, we first construct a dataset comprising over 200 samples, including both normal cases and rare craniofacial conditions. Each case contains a CT-based skull, a CT-based face, and a high-fidelity textured face scan. Secondly, we propose a novel dense ray matching registration method that ensures topological consistency across face, skull, and their tissue correspondences. Based on this, we introduce the 3D Bidirectional Face-Skull Morphable Model (BFSM), which enables shape inference between the face and skull through a shared coefficient space, while also modeling tissue thickness variation to support one-to-many facial reconstructions from the same skull, reflecting individual changes such as fat over time. Finally, we demonstrate the potential of BFSM in medical applications, including 3D face-skull reconstruction from a single image and surgical planning prediction. Extensive experiments confirm the robustness and accuracy of our method. BFSM is available at https://github.com/wang-zidu/BFSM
中文: 本研究提出了三维双向人脸-颅骨可变形模型(BFSM),通过共享系数空间实现人脸与颅骨间的形状推断,解决了数据稀缺和配准精度问题,并展示了在医学重建和手术规划中的应用潜力。
English: The study introduces a 3D Bidirectional Face-Skull Morphable Model (BFSM) that enables shape inference between faces and skulls using a shared coefficient space, addressing data scarcity and registration challenges while demonstrating applications in medical reconstruction and surgical planning.

Authors:Peter Hönig, Stefan Thalhammer, Jean-Baptiste Weibel, Matthias Hirschmanner, Markus Vincze
Title: SCOPE: Semantic Conditioning for Sim2Real Category-Level Object Pose Estimation in Robotics
Abstract:
Object manipulation requires accurate object pose estimation. In open environments, robots encounter unknown objects, which requires semantic understanding in order to generalize both to known categories and beyond. To resolve this challenge, we present SCOPE, a diffusion-based category-level object pose estimation model that eliminates the need for discrete category labels by leveraging DINOv2 features as continuous semantic priors. By combining these DINOv2 features with photorealistic training data and a noise model for point normals, we reduce the Sim2Real gap in category-level object pose estimation. Furthermore, injecting the continuous semantic priors via cross-attention enables SCOPE to learn canonicalized object coordinate systems across object instances beyond the distribution of known categories. SCOPE outperforms the current state of the art in synthetically trained category-level object pose estimation, achieving a relative improvement of 31.9\% on the 5$^\circ$5cm metric. Additional experiments on two instance-level datasets demonstrate generalization beyond known object categories, enabling grasping of unseen objects from unknown categories with a success rate of up to 100\%. Code available: https://github.com/hoenigpeter/scope.
中文摘要:SCOPE是一种基于扩散的模型,利用DINOv2特征作为连续语义先验,无需离散类别标签即可实现精确的类别级物体姿态估计,在达到最先进性能的同时能够泛化至未知类别物体。
English Summary: SCOPE is a diffusion-based model that uses DINOv2 features as continuous semantic priors to enable accurate category-level object pose estimation without discrete labels, achieving state-of-the-art performance and generalization to unseen object categories.

Authors:Zi-Yuan Hu, Shuo Liang, Duo Zheng, Yanyang Li, Yeyao Tao, Shijia Huang, Wei Feng, Jia Qin, Jianguang Yu, Jing Huang, Meng Fang, Yin Li, Liwei Wang
Title: NeMo: Needle in a Montage for Video-Language Understanding
Abstract:
Recent advances in video large language models (VideoLLMs) call for new evaluation protocols and benchmarks for complex temporal reasoning in video-language understanding. Inspired by the needle in a haystack test widely used by LLMs, we introduce a novel task of Needle in a Montage (NeMo), designed to assess VideoLLMs' critical reasoning capabilities, including long-context recall and temporal grounding. To generate video question answering data for our task, we develop a scalable automated data generation pipeline that facilitates high-quality data synthesis. Built upon the proposed pipeline, we present NeMoBench, a video-language benchmark centered on our task. Specifically, our full set of NeMoBench features 31,378 automatically generated question-answer (QA) pairs from 13,486 videos with various durations ranging from seconds to hours. Experiments demonstrate that our pipeline can reliably and automatically generate high-quality evaluation data, enabling NeMoBench to be continuously updated with the latest videos. We evaluate 20 state-of-the-art models on our benchmark, providing extensive results and key insights into their capabilities and limitations. Our project page is available at: https://lavi-lab.github.io/NeMoBench.

Authors:Shijie Lian, Changti Wu, Laurence Tianruo Yang, Hang Yuan, Bin Yu, Lei Zhang, Kai Chen
Title: Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks
Abstract:
Spatial intelligence spans a rich suite of abilities, including visualising and transforming shapes, mentally rotating objects, judging relational positions and containment, and estimating numerosity. However, it still remains a critical unresolved challenge for Multimodal Large Language Models (MLLMs).To fill this gap, we propose to treat Euclidean geometry problem-solving as a surrogate task. Specifically, we meticulously constructed a curated multimodal dataset, called Euclid30K, comprising approximately 30K plane and solid geometry problems. To enable the model to acquire and apply Euclidean principles from these geometry problems, we employed Group Relative Policy Optimization (GRPO) to finetune the Qwen2.5VL family and RoboBrain2.0 family, inspiring the models to identify shapes, count, and relate entities, and perform multi-step deductive reasoning using Euclidean principles. Our experiments demonstrate that the resulting models achieve substantial zero-shot gains across four spatial reasoning benchmarks (Super-CLEVR, Omni3DBench, VSI-Bench, and MindCube) without any task-specific adaptations. Notably, after training on the Euclid30K, the mean VSI-Bench accuracy of all evaluated models rose from 34.5% to 40.5%, improving by 5.5 percentage points. Among them, RoboBrain2.0-Euclid-7B achieves 49.6\% accuracy, surpassing the previous state-of-the-art model, Spatial-MLLM.To our knowledge, this is the first systematic study showing that geometry-centric fine-tuning can confer vision-language models with broadly transferable spatial skills. Code and Euclid30K dataset can be found in https://zgca-ai4edu.github.io/Euclids_Gift.

Authors:Kai Liu, Shaoqiu Zhang, Linghe Kong, Yulun Zhang
Title: CLQ: Cross-Layer Guided Orthogonal-based Quantization for Diffusion Transformers
Abstract:
Visual generation quality has been greatly promoted with the rapid advances in diffusion transformers (DiTs), which is attributed to the scaling of model size and complexity. However, these attributions also hinder the practical deployment of DiTs on edge devices, limiting their development and application. Serve as an efficient model compression technique, model post-training quantization (PTQ) can reduce the memory consumption and speed up the inference, with inevitable performance degradation. To alleviate the degradation, we propose CLQ, a cross-layer guided orthogonal-based quantization method for DiTs. To be specific, CLQ consists of three key designs. First, we observe that the calibration data used by most of the PTQ methods can not honestly represent the distribution of the activations. Therefore, we propose cross-block calibration (CBC) to obtain accurate calibration data, with which the quantization can be better guided. Second, we propose orthogonal-based smoothing (OBS), which quantifies the outlier score of each channel and leverages block Hadamard matrix to smooth the outliers with negligible overhead. Third, we propose cross-layer parameter searching (CLPS) to search. We evaluate CLQ with both image generation and video generation models and successfully compress the model into W4A4 with negligible degradation in visual quality and metrics. CLQ achieves 3.98x memory saving and 3.95x speedup. Our code is available at \hyperlink{https://github.com/Kai-Liu001/CLQ}{https://github.com/Kai-Liu001/CLQ}.
中文摘要:CLQ是一种针对扩散变换器的跨层引导正交量化方法,通过精确校准数据和优化量化过程,在保持视觉质量的同时实现高效模型压缩,大幅降低内存占用并提升推理速度。
English Summary: CLQ is a novel cross-layer guided orthogonal-based quantization method for diffusion transformers that achieves efficient model compression with minimal performance degradation, enabling significant memory savings and inference speedup.

Authors:Junyi Gu, Beatriz Cabrero-Daniel, Ali Nouri, Lydia Armini, Christian Berger
Title: PCICF: A Pedestrian Crossing Identification and Classification Framework
Abstract:
We have recently observed the commercial roll-out of robotaxis in various countries. They are deployed within an operational design domain (ODD) on specific routes and environmental conditions, and are subject to continuous monitoring to regain control in safety-critical situations. Since ODDs typically cover urban areas, robotaxis must reliably detect vulnerable road users (VRUs) such as pedestrians, bicyclists, or e-scooter riders. To better handle such varied traffic situations, end-to-end AI, which directly compute vehicle control actions from multi-modal sensor data instead of only for perception, is on the rise. High quality data is needed for systematically training and evaluating such systems within their OOD. In this work, we propose PCICF, a framework to systematically identify and classify VRU situations to support ODD's incident analysis. We base our work on the existing synthetic dataset SMIRK, and enhance it by extending its single-pedestrian-only design into the MoreSMIRK dataset, a structured dictionary of multi-pedestrian crossing situations constructed systematically. We then use space-filling curves (SFCs) to transform multi-dimensional features of scenarios into characteristic patterns, which we match with corresponding entries in MoreSMIRK. We evaluate PCICF with the large real-world dataset PIE, which contains more than 150 manually annotated pedestrian crossing videos. We show that PCICF can successfully identify and classify complex pedestrian crossings, even when groups of pedestrians merge or split. By leveraging computationally efficient components like SFCs, PCICF has even potential to be used onboard of robotaxis for OOD detection for example. We share an open-source replication package for PCICF containing its algorithms, the complete MoreSMIRK dataset and dictionary, as well as our experiment results presented in: https://github.com/Claud1234/PCICF
中文摘要:PCICF框架通过增强的MoreSMIRK数据集和空间填充曲线技术,能系统识别和分类自动驾驶出租车运行中的弱势道路使用者场景,有效解析行人群体分流合并等复杂行为,提升运营安全分析能力。
English Summary: The PCICF framework is introduced to systematically identify and classify vulnerable road user situations for robotaxi operational safety, leveraging the enhanced MoreSMIRK dataset and space-filling curves to effectively analyze complex pedestrian interactions.

Authors:Hao Chen, Fang Xu, Tamer Saleh, Weifeng Hao, Gui-Song Xia
Title: Mask Clustering-based Annotation Engine for Large-Scale Submeter Land Cover Mapping
Abstract:
Recent advances in remote sensing technology have made submeter resolution imagery increasingly accessible, offering remarkable detail for fine-grained land cover analysis. However, its full potential remains underutilized - particularly for large-scale land cover mapping - due to the lack of sufficient, high-quality annotated datasets. Existing labels are typically derived from pre-existing products or manual annotation, which are often unreliable or prohibitively expensive, particularly given the rich visual detail and massive data volumes of submeter imagery. Inspired by the spatial autocorrelation principle, which suggests that objects of the same class tend to co-occur with similar visual features in local neighborhoods, we propose the Mask Clustering-based Annotation Engine (MCAE), which treats semantically consistent mask groups as the minimal annotating units to enable efficient, simultaneous annotation of multiple instances. It significantly improves annotation efficiency by one to two orders of magnitude, while preserving label quality, semantic diversity, and spatial representativeness. With MCAE, we build a high-quality annotated dataset of about 14 billion labeled pixels, referred to as HiCity-LC, which supports the generation of city-scale land cover maps across five major Chinese cities with classification accuracies above 85%. It is the first publicly available submeter resolution city-level land cover benchmark, highlighting the scalability and practical utility of MCAE for large-scale, submeter resolution mapping. The dataset is available at https://github.com/chenhaocs/MCAE
Chinese: 基于掩码聚类的标注引擎(MCAE)利用空间自相关性实现了亚米级影像的高效大规模标注,构建了包含140亿标记像素的HiCity-LC数据集,在城市尺度土地覆盖制图中达到85%以上的分类精度。
English: The Mask Clustering-based Annotation Engine (MCAE) leverages spatial autocorrelation to enable efficient large-scale annotation of submeter imagery, producing the HiCity-LC dataset with 14 billion labeled pixels and over 85% classification accuracy for city-scale land cover mapping.

Authors:Congjia Chen, Yufu Qu
Title: DINOReg: Strong Point Cloud Registration with Vision Foundation Model
Abstract:
Point cloud registration is a fundamental task in 3D computer vision. Most existing methods rely solely on geometric information for feature extraction and matching. Recently, several studies have incorporated color information from RGB-D data into feature extraction. Although these methods achieve remarkable improvements, they have not fully exploited the abundant texture and semantic information in images, and the feature fusion is performed in an image-lossy manner, which limit their performance. In this paper, we propose DINOReg, a registration network that sufficiently utilizes both visual and geometric information to solve the point cloud registration problem. Inspired by advances in vision foundation models, we employ DINOv2 to extract informative visual features from images, and fuse visual and geometric features at the patch level. This design effectively combines the rich texture and global semantic information extracted by DINOv2 with the detailed geometric structure information captured by the geometric backbone. Additionally, a mixed positional embedding is proposed to encode positional information from both image space and point cloud space, which enhances the model's ability to perceive spatial relationships between patches. Extensive experiments on the RGBD-3DMatch and RGBD-3DLoMatch datasets demonstrate that our method achieves significant improvements over state-of-the-art geometry-only and multi-modal registration methods, with a 14.2% increase in patch inlier ratio and a 15.7% increase in registration recall. The code is publicly available at https://github.com/ccjccjccj/DINOReg.
中文: DINOReg是一种新颖的点云配准网络,通过结合DINOv2提取的视觉特征与几何特征进行补丁级融合,并采用混合位置编码,在多个数据集上实现了显著优于现有方法的性能提升。
English: DINOReg is a novel point cloud registration network that effectively integrates visual features from DINOv2 with geometric data through patch-level fusion and mixed positional embedding, achieving significant performance improvements over existing methods.

Authors:Jitai Hao, Hao Liu, Xinyan Xiao, Qiang Huang, Jun Yu
Title: Uni-X: Mitigating Modality Conflict with a Two-End-Separated Architecture for Unified Multimodal Models
Abstract:
Unified Multimodal Models (UMMs) built on shared autoregressive (AR) transformers are attractive for their architectural simplicity. However, we identify a critical limitation: when trained on multimodal inputs, modality-shared transformers suffer from severe gradient conflicts between vision and text, particularly in shallow and deep layers. We trace this issue to the fundamentally different low-level statistical properties of images and text, while noting that conflicts diminish in middle layers where representations become more abstract and semantically aligned. To overcome this challenge, we propose Uni-X, a two-end-separated, middle-shared architecture. Uni-X dedicates its initial and final layers to modality-specific processing, while maintaining shared parameters in the middle layers for high-level semantic fusion. This X-shaped design not only eliminates gradient conflicts at both ends but also further alleviates residual conflicts in the shared layers. Extensive experiments validate the effectiveness of Uni-X. Under identical training conditions, Uni-X achieves superior training efficiency compared to strong baselines. When scaled to 3B parameters with larger training data, Uni-X matches or surpasses 7B AR-based UMMs, achieving a GenEval score of 82 for image generation alongside strong performance in text and vision understanding tasks. These results establish Uni-X as a parameter-efficient and scalable foundation for future unified multimodal modeling. Our code is available at https://github.com/CURRENTF/Uni-X
中文:Uni-X架构通过两端分离模态处理、中间共享参数的方式解决了多模态模型中的梯度冲突问题,以更少的参数实现了更高的效率和性能。
English: The Uni-X architecture addresses gradient conflicts in multimodal models by separating modality-specific processing at the ends while sharing middle layers, achieving superior efficiency and performance with fewer parameters.

Authors:Hao Yang, Weijie Qiu, Ru Zhang, Zhou Fang, Ruichao Mao, Xiaoyu Lin, Maji Huang, Zhaosong Huang, Teng Guo, Shuoyang Liu, Hai Rao
Title: UI-UG: A Unified MLLM for UI Understanding and Generation
Abstract:
Although Multimodal Large Language Models (MLLMs) have been widely applied across domains, they are still facing challenges in domain-specific tasks, such as User Interface (UI) understanding accuracy and UI generation quality. In this paper, we introduce UI-UG (a unified MLLM for UI Understanding and Generation), integrating both capabilities. For understanding tasks, we employ Supervised Fine-tuning (SFT) combined with Group Relative Policy Optimization (GRPO) to enhance fine-grained understanding on the modern complex UI data. For generation tasks, we further use Direct Preference Optimization (DPO) to make our model generate human-preferred UIs. In addition, we propose an industrially effective workflow, including the design of an LLM-friendly domain-specific language (DSL), training strategies, rendering processes, and evaluation metrics. In experiments, our model achieves state-of-the-art (SOTA) performance on understanding tasks, outperforming both larger general-purpose MLLMs and similarly-sized UI-specialized models. Our model is also on par with these larger MLLMs in UI generation performance at a fraction of the computational cost. We also demonstrate that integrating understanding and generation tasks can improve accuracy and quality for both tasks. Code and Model: https://github.com/neovateai/UI-UG
中文: 本文提出UI-UG这一统一多模态大语言模型,整合了用户界面理解与生成能力,在理解任务上达到最优性能,并以更低计算成本实现了与更大模型相当的界面生成质量。
English: This paper introduces UI-UG, a unified Multimodal Large Language Model that integrates UI understanding and generation, achieving state-of-the-art performance in understanding tasks and comparable generation quality to larger models with significantly lower computational cost.

Authors:Wankun Chen, Feng Gao, Yanhai Gan, Jingchao Cao, Junyu Dong, Qian Du
Title: Wavelet-Assisted Mamba for Satellite-Derived Sea Surface Temperature Super-Resolution
Abstract:
Sea surface temperature (SST) is an essential indicator of global climate change and one of the most intuitive factors reflecting ocean conditions. Obtaining high-resolution SST data remains challenging due to limitations in physical imaging, and super-resolution via deep neural networks is a promising solution. Recently, Mamba-based approaches leveraging State Space Models (SSM) have demonstrated significant potential for long-range dependency modeling with linear complexity. However, their application to SST data super-resolution remains largely unexplored. To this end, we propose the Wavelet-assisted Mamba Super-Resolution (WMSR) framework for satellite-derived SST data. The WMSR includes two key components: the Low-Frequency State Space Module (LFSSM) and High-Frequency Enhancement Module (HFEM). The LFSSM uses 2D-SSM to capture global information of the input data, and the robust global modeling capabilities of SSM are exploited to preserve the critical temperature information in the low-frequency component. The HFEM employs the pixel difference convolution to match and correct the high-frequency feature, achieving accurate and clear textures. Through comprehensive experiments on three SST datasets, our WMSR demonstrated superior performance over state-of-the-art methods. Our codes and datasets will be made publicly available at https://github.com/oucailab/WMSR.
中文: 本文提出了基于小波辅助的Mamba超分辨率(WMSR)框架,利用状态空间模型和小波处理提升海表温度数据分辨率,在实验中展现出优于现有方法的性能。
English: This paper introduces the Wavelet-assisted Mamba Super-Resolution (WMSR) framework, which utilizes state space models and wavelet processing to enhance the resolution of sea surface temperature data, demonstrating superior performance in experiments.

Authors:Sai Raj Kishore Perla, Aditya Vora, Sauradip Nag, Ali Mahdavi-Amiri, Hao Zhang
Title: ASIA: Adaptive 3D Segmentation using Few Image Annotations
Abstract:
We introduce ASIA (Adaptive 3D Segmentation using few Image Annotations), a novel framework that enables segmentation of possibly non-semantic and non-text-describable "parts" in 3D. Our segmentation is controllable through a few user-annotated in-the-wild images, which are easier to collect than multi-view images, less demanding to annotate than 3D models, and more precise than potentially ambiguous text descriptions. Our method leverages the rich priors of text-to-image diffusion models, such as Stable Diffusion (SD), to transfer segmentations from image space to 3D, even when the annotated and target objects differ significantly in geometry or structure. During training, we optimize a text token for each segment and fine-tune our model with a novel cross-view part correspondence loss. At inference, we segment multi-view renderings of the 3D mesh, fuse the labels in UV-space via voting, refine them with our novel Noise Optimization technique, and finally map the UV-labels back onto the mesh. ASIA provides a practical and generalizable solution for both semantic and non-semantic 3D segmentation tasks, outperforming existing methods by a noticeable margin in both quantitative and qualitative evaluations.

Authors:Siyan Dong, Zijun Wang, Lulu Cai, Yi Ma, Yanchao Yang
Title: PROFusion: Robust and Accurate Dense Reconstruction via Camera Pose Regression and Optimization
Abstract:
Real-time dense scene reconstruction during unstable camera motions is crucial for robotics, yet current RGB-D SLAM systems fail when cameras experience large viewpoint changes, fast motions, or sudden shaking. Classical optimization-based methods deliver high accuracy but fail with poor initialization during large motions, while learning-based approaches provide robustness but lack sufficient accuracy for dense reconstruction. We address this challenge through a combination of learning-based initialization with optimization-based refinement. Our method employs a camera pose regression network to predict metric-aware relative poses from consecutive RGB-D frames, which serve as reliable starting points for a randomized optimization algorithm that further aligns depth images with the scene geometry. Extensive experiments demonstrate promising results: our approach outperforms the best competitor on challenging benchmarks, while maintaining comparable accuracy on stable motion sequences. The system operates in real-time, showcasing that combining simple and principled techniques can achieve both robustness for unstable motions and accuracy for dense reconstruction. Project page: https://github.com/siyandong/PROFusion.
中文: 我们的方法结合了基于学习的位姿初始化和基于优化的精细化处理,在相机运动不稳定的情况下实现了鲁棒的实时稠密场景重建,在挑战性场景中优于竞争对手,同时在稳定序列中保持精度。
English: Our method combines learning-based pose initialization with optimization-based refinement to achieve robust real-time dense scene reconstruction under unstable camera motions, outperforming competitors in challenging scenarios while maintaining accuracy in stable sequences.

Authors:Yingdong Hu, Yisheng He, Jinnan Chen, Weihao Yuan, Kejie Qiu, Zehong Lin, Siyu Zhu, Zilong Dong, Jun Zhang
Title: Forge4D: Feed-Forward 4D Human Reconstruction and Interpolation from Uncalibrated Sparse-view Videos
Abstract:
Instant reconstruction of dynamic 3D humans from uncalibrated sparse-view videos is critical for numerous downstream applications. Existing methods, however, are either limited by the slow reconstruction speeds or incapable of generating novel-time representations. To address these challenges, we propose Forge4D, a feed-forward 4D human reconstruction and interpolation model that efficiently reconstructs temporally aligned representations from uncalibrated sparse-view videos, enabling both novel view and novel time synthesis. Our model simplifies the 4D reconstruction and interpolation problem as a joint task of streaming 3D Gaussian reconstruction and dense motion prediction. For the task of streaming 3D Gaussian reconstruction, we first reconstruct static 3D Gaussians from uncalibrated sparse-view images and then introduce learnable state tokens to enforce temporal consistency in a memory-friendly manner by interactively updating shared information across different timestamps. For novel time synthesis, we design a novel motion prediction module to predict dense motions for each 3D Gaussian between two adjacent frames, coupled with an occlusion-aware Gaussian fusion process to interpolate 3D Gaussians at arbitrary timestamps. To overcome the lack of the ground truth for dense motion supervision, we formulate dense motion prediction as a dense point matching task and introduce a self-supervised retargeting loss to optimize this module. An additional occlusion-aware optical flow loss is introduced to ensure motion consistency with plausible human movement, providing stronger regularization. Extensive experiments demonstrate the effectiveness of our model on both in-domain and out-of-domain datasets. Project page and code at: https://zhenliuzju.github.io/huyingdong/Forge4D.
中文: Forge4D是一种前馈模型,通过流式3D高斯重建和自监督运动预测,从稀疏视角视频高效重构动态4D人体,实现新视角和新时间的合成。
English: Forge4D is a feed-forward model that efficiently reconstructs dynamic 4D humans from sparse-view videos, enabling novel view and time synthesis through streaming 3D Gaussian reconstruction and self-supervised motion prediction.

Authors:Jiabin Luo, Junhui Lin, Zeyu Zhang, Biao Wu, Meng Fang, Ling Chen, Hao Tang
Title: UniVid: The Open-Source Unified Video Model
Abstract:
Unified video modeling that combines generation and understanding capabilities is increasingly important but faces two key challenges: maintaining semantic faithfulness during flow-based generation due to text-visual token imbalance and the limitations of uniform cross-modal attention across the flow trajectory, and efficiently extending image-centric MLLMs to video without costly retraining. We present UniVid, a unified architecture that couples an MLLM with a diffusion decoder through a lightweight adapter, enabling both video understanding and generation. We introduce Temperature Modality Alignment to improve prompt adherence and Pyramid Reflection for efficient temporal reasoning via dynamic keyframe selection. Extensive experiments on standard benchmarks demonstrate state-of-the-art performance, achieving a 2.2% improvement on VBench-Long total score compared to EasyAnimateV5.1, and 1.0% and 3.3% accuracy gains on MSVD-QA and ActivityNet-QA, respectively, compared with the best prior 7B baselines. Code: https://github.com/AIGeeksGroup/UniVid. Website: https://aigeeksgroup.github.io/UniVid.
中文: UniVid提出了一种统一视频架构,通过轻量级适配器将多模态大语言模型与扩散解码器结合,支持视频生成与理解,并采用温度模态对齐和金字塔反射等技术解决语义忠实度和高效时序推理问题,在基准测试中实现了领先性能。
English: UniVid introduces a unified video architecture that integrates an MLLM with a diffusion decoder via a lightweight adapter, enabling both video generation and understanding while addressing challenges like semantic faithfulness and efficient temporal reasoning through innovative techniques such as Temperature Modality Alignment and Pyramid Reflection, achieving state-of-the-art performance on benchmarks.

Authors:Le Dong, Jinghao Bian, Jingyang Hou, Jingliang Hu, Yilei Shi, Weisheng Dong, Xiao Xiang Zhu, Lichao Mou
Title: High-Order Progressive Trajectory Matching for Medical Image Dataset Distillation
Abstract:
Medical image analysis faces significant challenges in data sharing due to privacy regulations and complex institutional protocols. Dataset distillation offers a solution to address these challenges by synthesizing compact datasets that capture essential information from real, large medical datasets. Trajectory matching has emerged as a promising methodology for dataset distillation; however, existing methods primarily focus on terminal states, overlooking crucial information in intermediate optimization states. We address this limitation by proposing a shape-wise potential that captures the geometric structure of parameter trajectories, and an easy-to-complex matching strategy that progressively addresses parameters based on their complexity. Experiments on medical image classification tasks demonstrate that our method improves distillation performance while preserving privacy and maintaining model accuracy comparable to training on the original datasets. Our code is available at https://github.com/Bian-jh/HoP-TM.
Chinese Summary: 本研究提出一种新颖的医学影像数据集蒸馏方法,通过形状势能和由易到难的匹配策略捕捉中间优化状态,在保护隐私的同时提升蒸馏性能,并保持与原始数据集相当的模型精度。
English Summary: This study introduces a novel dataset distillation method for medical imaging that utilizes shape-wise potential and easy-to-complex matching to capture intermediate optimization states, enhancing performance while preserving privacy and maintaining accuracy comparable to original datasets.

Authors:Jun-Hao Wang, Yi-Yang Tian, Baoquan Chen, Peng-Shuai Wang
Title: Neural Visibility of Point Sets
Abstract:
Point clouds are widely used representations of 3D data, but determining the visibility of points from a given viewpoint remains a challenging problem due to their sparse nature and lack of explicit connectivity. Traditional methods, such as Hidden Point Removal (HPR), face limitations in computational efficiency, robustness to noise, and handling concave regions or low-density point clouds. In this paper, we propose a novel approach to visibility determination in point clouds by formulating it as a binary classification task. The core of our network consists of a 3D U-Net that extracts view-independent point-wise features and a shared multi-layer perceptron (MLP) that predicts point visibility using the extracted features and view direction as inputs. The network is trained end-to-end with ground-truth visibility labels generated from rendered 3D models. Our method significantly outperforms HPR in both accuracy and computational efficiency, achieving up to 126 times speedup on large point clouds. Additionally, our network demonstrates robustness to noise and varying point cloud densities and generalizes well to unseen shapes. We validate the effectiveness of our approach through extensive experiments on the ShapeNet, ABC Dataset and real-world datasets, showing substantial improvements in visibility accuracy. We also demonstrate the versatility of our method in various applications, including point cloud visualization, surface reconstruction, normal estimation, shadow rendering, and viewpoint optimization. Our code and models are available at https://github.com/octree-nn/neural-visibility.
中文: 本文提出了一种将点云可见性视为二分类任务的神经网络方法,相比传统方法在精度和计算效率上显著提升,并在多种数据集和应用中展现出良好的鲁棒性。
English: This paper introduces a neural network-based method that treats point cloud visibility as a binary classification task, achieving superior accuracy and computational efficiency compared to traditional approaches while demonstrating robustness across various datasets and applications.

Authors:Jianze Li, Yong Guo, Yulun Zhang, Xiaokang Yang
Title: Asymmetric VAE for One-Step Video Super-Resolution Acceleration
Abstract:
Diffusion models have significant advantages in the field of real-world video super-resolution and have demonstrated strong performance in past research. In recent diffusion-based video super-resolution (VSR) models, the number of sampling steps has been reduced to just one, yet there remains significant room for further optimization in inference efficiency. In this paper, we propose FastVSR, which achieves substantial reductions in computational cost by implementing a high compression VAE (spatial compression ratio of 16, denoted as f16). We design the structure of the f16 VAE and introduce a stable training framework. We employ pixel shuffle and channel replication to achieve additional upsampling. Furthermore, we propose a lower-bound-guided training strategy, which introduces a simpler training objective as a lower bound for the VAE's performance. It makes the training process more stable and easier to converge. Experimental results show that FastVSR achieves speedups of 111.9 times compared to multi-step models and 3.92 times compared to existing one-step models. We will release code and models at https://github.com/JianzeLi-114/FastVSR.
中文:FastVSR通过采用高压缩比VAE和稳定的训练框架,显著提升了视频超分辨率的推理效率,相比现有模型实现了大幅加速。
English: FastVSR significantly enhances inference efficiency in video super-resolution by employing a highly compressed VAE and a stable training framework, achieving substantial speed improvements over existing models.

Authors:Li Zhang, Haoxiang Gao, Zhihao Zhang, Luoxiao Huang, Tao Zhang
Title: SVAC: Scaling Is All You Need For Referring Video Object Segmentation
Abstract:
Referring Video Object Segmentation (RVOS) aims to segment target objects in video sequences based on natural language descriptions. While recent advances in Multi-modal Large Language Models (MLLMs) have improved RVOS performance through enhanced text-video understanding, several challenges remain, including insufficient exploitation of MLLMs' prior knowledge, prohibitive computational and memory costs for long-duration videos, and inadequate handling of complex temporal dynamics. In this work, we propose SVAC, a unified model that improves RVOS by scaling up input frames and segmentation tokens to enhance video-language interaction and segmentation precision. To address the resulting computational challenges, SVAC incorporates the Anchor-Based Spatio-Temporal Compression (ASTC) module to compress visual tokens while preserving essential spatio-temporal structure. Moreover, the Clip-Specific Allocation (CSA) strategy is introduced to better handle dynamic object behaviors across video clips. Experimental results demonstrate that SVAC achieves state-of-the-art performance on multiple RVOS benchmarks with competitive efficiency. Our code is available at https://github.com/lizhang1998/SVAC.
Chinese: 提出的SVAC模型通过扩展输入帧和分割标记来改进参考视频对象分割,同时采用压缩和分配策略以保持效率并处理动态视频内容,在多个基准测试中实现了最优性能。
English: The proposed SVAC model enhances Referring Video Object Segmentation by scaling frame inputs and segmentation tokens while employing compression and allocation strategies to maintain efficiency and handle dynamic video content, achieving state-of-the-art results on benchmarks.

Authors:Zhiqi Huang, Dulongkai Cui, Jinglu Hu
Title: SIE3D: Single-image Expressive 3D Avatar generation via Semantic Embedding and Perceptual Expression Loss
Abstract:
Generating high-fidelity 3D head avatars from a single image is challenging, as current methods lack fine-grained, intuitive control over expressions via text. This paper proposes SIE3D, a framework that generates expressive 3D avatars from a single image and descriptive text. SIE3D fuses identity features from the image with semantic embedding from text through a novel conditioning scheme, enabling detailed control. To ensure generated expressions accurately match the text, it introduces an innovative perceptual expression loss function. This loss uses a pre-trained expression classifier to regularize the generation process, guaranteeing expression accuracy. Extensive experiments show SIE3D significantly improves controllability and realism, outperforming competitive methods in identity preservation and expression fidelity on a single consumer-grade GPU. Project page: https://blazingcrystal1747.github.io/SIE3D/

Authors:Matej Palider, Omar Eldardeer, Viktor Kocur
Title: Gaze Estimation for Human-Robot Interaction: Analysis Using the NICO Platform
Abstract:
This paper evaluates the current gaze estimation methods within an HRI context of a shared workspace scenario. We introduce a new, annotated dataset collected with the NICO robotic platform. We evaluate four state-of-the-art gaze estimation models. The evaluation shows that the angular errors are close to those reported on general-purpose benchmarks. However, when expressed in terms of distance in the shared workspace the best median error is 16.48 cm quantifying the practical limitations of current methods. We conclude by discussing these limitations and offering recommendations on how to best integrate gaze estimation as a modality in HRI systems.
中文: 本文评估了人机交互中的视线估计方法,发现尽管角度误差接近基准水平,但实际应用中最佳中位误差达16.48厘米,揭示了当前方法的局限并提出了改进建议。
English: This paper assesses gaze estimation methods in human-robot interaction, revealing a median error of 16.48 cm in practical applications despite competitive angular accuracy, and suggests improvements for integration.

Authors:Jinpei Guo, Yifei Ji, Zheng Chen, Yufei Wang, Sizhuo Ma, Yong Guo, Yulun Zhang, Jian Wang
Title: Towards Redundancy Reduction in Diffusion Models for Efficient Video Super-Resolution
Abstract:
Diffusion models have recently shown promising results for video super-resolution (VSR). However, directly adapting generative diffusion models to VSR can result in redundancy, since low-quality videos already preserve substantial content information. Such redundancy leads to increased computational overhead and learning burden, as the model performs superfluous operations and must learn to filter out irrelevant information. To address this problem, we propose OASIS, an efficient $\textbf{o}$ne-step diffusion model with $\textbf{a}$ttention $\textbf{s}$pecialization for real-world v$\textbf{i}$deo $\textbf{s}$uper-resolution. OASIS incorporates an attention specialization routing that assigns attention heads to different patterns according to their intrinsic behaviors. This routing mitigates redundancy while effectively preserving pretrained knowledge, allowing diffusion models to better adapt to VSR and achieve stronger performance. Moreover, we propose a simple yet effective progressive training strategy, which starts with temporally consistent degradations and then shifts to inconsistent settings. This strategy facilitates learning under complex degradations. Extensive experiments demonstrate that OASIS achieves state-of-the-art performance on both synthetic and real-world datasets. OASIS also provides superior inference speed, offering a $\textbf{6.2$\times$}$ speedup over one-step diffusion baselines such as SeedVR2. The code will be available at \href{https://github.com/jp-guo/OASIS}{https://github.com/jp-guo/OASIS}.
Chinese: 研究者提出OASIS,一种具有注意力专门化的高效一步扩散模型,通过减少计算冗余并优化视频超分辨率性能,在合成和真实数据集上达到最优效果,并实现比基线模型快6.2倍的推理速度。
English: Researchers propose OASIS, an efficient one-step diffusion model with attention specialization that reduces computational redundancy and enhances performance in video super-resolution, achieving state-of-the-art results and a 6.2× speedup over baselines.

Authors:Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, Tiankai Hang, Duojun Huang, Jie Jiang, Zhengkai Jiang, Weijie Kong, Changlin Li, Donghao Li, Junzhe Li, Xin Li, Yang Li, Zhenxi Li, Zhimin Li, Jiaxin Lin, Linus, Lucaz Liu, Shu Liu, Songtao Liu, Yu Liu, Yuhong Liu, Yanxin Long, Fanbin Lu, Qinglin Lu, Yuyang Peng, Yuanbo Peng, Xiangwei Shen, Yixuan Shi, Jiale Tao, Yangyu Tao, Qi Tian, Pengfei Wan, Chunyu Wang, Kai Wang, Lei Wang, Linqing Wang, Lucas Wang, Qixun Wang, Weiyan Wang, Hao Wen, Bing Wu, Jianbing Wu, Yue Wu, Senhao Xie, Fang Yang, Miles Yang, Xiaofeng Yang, Xuan Yang, Zhantao Yang, Jingmiao Yu, Zheng Yuan, Chao Zhang, Jian-Wei Zhang, Peizhen Zhang, Shi-Xue Zhang, Tao Zhang, Weigang Zhang, Yepeng Zhang, Yingfang Zhang, Zihao Zhang, Zijian Zhang, Penghao Zhao, Zhiyuan Zhao, Xuefei Zhe, Jianchen Zhu, Zhao Zhong
Title: HunyuanImage 3.0 Technical Report
Abstract:
We present HunyuanImage 3.0, a native multimodal model that unifies multimodal understanding and generation within an autoregressive framework, with its image generation module publicly available. The achievement of HunyuanImage 3.0 relies on several key components, including meticulous data curation, advanced architecture design, a native Chain-of-Thoughts schema, progressive model pre-training, aggressive model post-training, and an efficient infrastructure that enables large-scale training and inference. With these advancements, we successfully trained a Mixture-of-Experts (MoE) model comprising over 80 billion parameters in total, with 13 billion parameters activated per token during inference, making it the largest and most powerful open-source image generative model to date. We conducted extensive experiments and the results of automatic and human evaluation of text-image alignment and visual quality demonstrate that HunyuanImage 3.0 rivals previous state-of-the-art models. By releasing the code and weights of HunyuanImage 3.0, we aim to enable the community to explore new ideas with a state-of-the-art foundation model, fostering a dynamic and vibrant multimodal ecosystem. All open source assets are publicly available at https://github.com/Tencent-Hunyuan/HunyuanImage-3.0
HunyuanImage 3.0 是一个突破性的开源多模态模型,通过自回归框架统一理解与生成功能,拥有超过800亿参数,在文本-图像对齐和视觉质量方面树立了新的标杆。
HunyuanImage 3.0 is a groundbreaking open-source multimodal model that integrates understanding and generation within an autoregressive framework, featuring over 80 billion parameters and setting new standards in text-image alignment and visual quality.

Authors:Dragoş-Andrei Chileban, Andrei-Ştefan Bulzan, Cosmin Cernǎzanu-Glǎvan
Title: CrashSplat: 2D to 3D Vehicle Damage Segmentation in Gaussian Splatting
Abstract:
Automatic car damage detection has been a topic of significant interest for the auto insurance industry as it promises faster, accurate, and cost-effective damage assessments. However, few works have gone beyond 2D image analysis to leverage 3D reconstruction methods, which have the potential to provide a more comprehensive and geometrically accurate representation of the damage. Moreover, recent methods employing 3D representations for novel view synthesis, particularly 3D Gaussian Splatting (3D-GS), have demonstrated the ability to generate accurate and coherent 3D reconstructions from a limited number of views. In this work we introduce an automatic car damage detection pipeline that performs 3D damage segmentation by up-lifting 2D masks. Additionally, we propose a simple yet effective learning-free approach for single-view 3D-GS segmentation. Specifically, Gaussians are projected onto the image plane using camera parameters obtained via Structure from Motion (SfM). They are then filtered through an algorithm that utilizes Z-buffering along with a normal distribution model of depth and opacities. Through experiments we found that this method is particularly effective for challenging scenarios like car damage detection, where target objects (e.g., scratches, small dents) may only be clearly visible in a single view, making multi-view consistency approaches impractical or impossible. The code is publicly available at: https://github.com/DragosChileban/CrashSplat.
中文: 本研究提出了一种自动化汽车损伤检测流程,利用3D高斯溅射实现单视图3D分割,通过无需学习的新方法有效解决了损伤仅在某单一视角可见的检测难题。
English: This study introduces an automated car damage detection pipeline that leverages 3D Gaussian Splatting for single-view 3D segmentation, effectively addressing scenarios where damage is only visible in one view through a novel learning-free approach.

Authors:Hanshi Wang, Yuhao Xu, Zekun Xu, Jin Gao, Yufan Liu, Weiming Hu, Ke Wang, Zhipeng Zhang
Title: AutoPrune: Each Complexity Deserves a Pruning Policy
Abstract:
The established redundancy in visual tokens within large vision-language models allows pruning to effectively reduce their substantial computational demands. Previous methods typically employ heuristic layer-specific pruning strategies where, although the number of tokens removed may differ across decoder layers, the overall pruning schedule is fixed and applied uniformly to all input samples and tasks, failing to align token elimination with the model's holistic reasoning trajectory. Cognitive science indicates that human visual processing often begins with broad exploration to accumulate evidence before narrowing focus as the target becomes distinct. Our experiments reveal an analogous pattern in these models. This observation suggests that neither a fixed pruning schedule nor a heuristic layer-wise strategy can optimally accommodate the diverse complexities inherent in different inputs. To overcome this limitation, we introduce Complexity-Adaptive Pruning (AutoPrune), a training-free, plug-and-play framework that tailors pruning policies to varying sample and task complexities. Specifically, AutoPrune quantifies the mutual information between visual and textual tokens, then projects this signal to a budget-constrained logistic retention curve. Each such logistic curve, defined by its unique shape, corresponds to the specific complexity of different tasks and can guarantee adherence to predefined computational constraints. We evaluate AutoPrune on standard vision-language tasks and on Vision-Language-Action models for autonomous driving. Notably, when applied to LLaVA-1.5-7B, our method prunes 89% of visual tokens and reduces inference FLOPs by 76.8% while retaining 96.7% of the original accuracy averaged over all tasks. This corresponds to a 9.1% improvement over the recent work PDrop, demonstrating the effectiveness. Code is available at https://github.com/AutoLab-SAI-SJTU/AutoPrune.
中文:AutoPrune框架通过量化视觉与文本标记间的互信息,针对不同输入复杂度自适应调整剪枝策略,在保持多任务高精度的同时大幅降低了计算开销。
English: The proposed AutoPrune framework dynamically tailors pruning policies to input complexity by leveraging mutual information between visual and textual tokens, achieving significant computational savings while maintaining high accuracy across vision-language tasks.

Authors:Haibao Yu, Wenxian Yang, Ruiyang Hao, Chuanye Wang, Jiaru Zhong, Ping Luo, Zaiqing Nie
Title: DriveE2E: Closed-Loop Benchmark for End-to-End Autonomous Driving through Real-to-Simulation
Abstract:
Closed-loop evaluation is increasingly critical for end-to-end autonomous driving. Current closed-loop benchmarks using the CARLA simulator rely on manually configured traffic scenarios, which can diverge from real-world conditions, limiting their ability to reflect actual driving performance. To address these limitations, we introduce a simple yet challenging closed-loop evaluation framework that closely integrates real-world driving scenarios into the CARLA simulator with infrastructure cooperation. Our approach involves extracting 800 dynamic traffic scenarios selected from a comprehensive 100-hour video dataset captured by high-mounted infrastructure sensors, and creating static digital twin assets for 15 real-world intersections with consistent visual appearance. These digital twins accurately replicate the traffic and environmental characteristics of their real-world counterparts, enabling more realistic simulations in CARLA. This evaluation is challenging due to the diversity of driving behaviors, locations, weather conditions, and times of day at complex urban intersections. In addition, we provide a comprehensive closed-loop benchmark for evaluating end-to-end autonomous driving models. Project URL: \href{https://github.com/AIR-THU/DriveE2E}{https://github.com/AIR-THU/DriveE2E}.
中文: 本文提出了一种创新的闭环评估框架,将800个真实交通场景和15个数字孪生交叉口集成到CARLA模拟器中,为自动驾驶评估创建了更真实的基准测试环境。
English: This paper introduces a novel closed-loop evaluation framework that integrates 800 real-world traffic scenarios and 15 digital twin intersections into CARLA simulator, creating more realistic benchmarks for autonomous driving assessment.

Authors:Xiyan Xu, Sirui Xu, Yu-Xiong Wang, Liang-Yan Gui
Title: MoReact: Generating Reactive Motion from Textual Descriptions
Abstract:
Modeling and generating human reactions poses a significant challenge with broad applications for computer vision and human-computer interaction. Existing methods either treat multiple individuals as a single entity, directly generating interactions, or rely solely on one person's motion to generate the other's reaction, failing to integrate the rich semantic information that underpins human interactions. Yet, these methods often fall short in adaptive responsiveness, i.e., the ability to accurately respond to diverse and dynamic interaction scenarios. Recognizing this gap, our work introduces an approach tailored to address the limitations of existing models by focusing on text-driven human reaction generation. Our model specifically generates realistic motion sequences for individuals that responding to the other's actions based on a descriptive text of the interaction scenario. The goal is to produce motion sequences that not only complement the opponent's movements but also semantically fit the described interactions. To achieve this, we present MoReact, a diffusion-based method designed to disentangle the generation of global trajectories and local motions sequentially. This approach stems from the observation that generating global trajectories first is crucial for guiding local motion, ensuring better alignment with given action and text. Furthermore, we introduce a novel interaction loss to enhance the realism of generated close interactions. Our experiments, utilizing data adapted from a two-person motion dataset, demonstrate the efficacy of our approach for this novel task, which is capable of producing realistic, diverse, and controllable reactions that not only closely match the movements of the counterpart but also adhere to the textual guidance. Please find our webpage at https://xiyan-xu.github.io/MoReactWebPage.
中文摘要:本研究提出MoReact模型,通过基于交互场景文本描述的顺序生成全局轨迹和局部动作,实现了符合语义的逼真人体反应生成,有效提升了动态交互的适应性和真实感。
English Summary: The study introduces MoReact, a diffusion-based model that generates realistic human reactions by sequentially creating global trajectories and local motions based on textual descriptions of interaction scenarios, enhancing responsiveness and semantic alignment.

Authors:Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang, Defu Lian, Jiajun Zhang, Dong Liu, Zheng liu
Title: EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling
Abstract:
Instruction-guided image editing has achieved remarkable progress, yet current models still face challenges with complex instructions and often require multiple samples to produce a desired result. Reinforcement Learning (RL) offers a promising solution, but its adoption in image editing has been severely hindered by the lack of a high-fidelity, efficient reward signal. In this work, we present a comprehensive methodology to overcome this barrier, centered on the development of a state-of-the-art, specialized reward model. We first introduce EditReward-Bench, a comprehensive benchmark to systematically evaluate reward models on editing quality. Building on this benchmark, we develop EditScore, a series of reward models (7B-72B) for evaluating the quality of instruction-guided image editing. Through meticulous data curation and filtering, EditScore effectively matches the performance of learning proprietary VLMs. Furthermore, coupled with an effective self-ensemble strategy tailored for the generative nature of EditScore, our largest variant even surpasses GPT-5 in the benchmark. We then demonstrate that a high-fidelity reward model is the key to unlocking online RL for image editing. Our experiments show that, while even the largest open-source VLMs fail to provide an effective learning signal, EditScore enables efficient and robust policy optimization. Applying our framework to a strong base model, OmniGen2, results in a final model that shows a substantial and consistent performance uplift. Overall, this work provides the first systematic path from benchmarking to reward modeling to RL training in image editing, showing that a high-fidelity, domain-specialized reward model is the key to unlocking the full potential of RL in this domain.
中文: 本研究提出了用于评估图像编辑奖励模型的基准EditReward-Bench,并开发了专用奖励模型EditScore,该模型能够有效实现指令引导图像编辑的强化学习,从而显著提升编辑性能。
English: This work introduces EditReward-Bench, a benchmark for evaluating reward models in image editing, and develops EditScore, a specialized reward model that enables effective reinforcement learning for instruction-guided image editing, leading to substantial performance improvements.

Authors:You Zhou, Lijiang Chen, Shuchang Lyu, Guangxia Cui, Wenpei Bai, Zheng Zhou, Meng Li, Guangliang Cheng, Huiyu Zhou, Qi Zhao
Title: Adversarial Versus Federated: An Adversarial Learning based Multi-Modality Cross-Domain Federated Medical Segmentation
Abstract:
Federated learning enables collaborative training of machine learning models among different clients while ensuring data privacy, emerging as the mainstream for breaking data silos in the healthcare domain. However, the imbalance of medical resources, data corruption or improper data preservation may lead to a situation where different clients possess medical images of different modality. This heterogeneity poses a significant challenge for cross-domain medical image segmentation within the federated learning framework. To address this challenge, we propose a new Federated Domain Adaptation (FedDA) segmentation training framework. Specifically, we propose a feature-level adversarial learning among clients by aligning feature maps across clients through embedding an adversarial training mechanism. This design can enhance the model's generalization on multiple domains and alleviate the negative impact from domain-shift. Comprehensive experiments on three medical image datasets demonstrate that our proposed FedDA substantially achieves cross-domain federated aggregation, endowing single modality client with cross-modality processing capabilities, and consistently delivers robust performance compared to state-of-the-art federated aggregation algorithms in objective and subjective assessment. Our code are available at https://github.com/GGbond-study/FedDA.
中文:提出的联邦域适应(FedDA)框架通过特征级对抗训练对齐客户端间的特征映射,解决了联邦学习中跨域医学图像分割的挑战,增强了模型泛化能力并减轻了域偏移影响,在多个数据集上的验证显示其性能优于现有方法。
English: The proposed Federated Domain Adaptation (FedDA) framework addresses cross-domain medical image segmentation challenges in federated learning by implementing feature-level adversarial training to align feature maps across clients, enhancing model generalization and mitigating domain-shift effects, as validated by superior performance on multiple datasets.

Authors:Yukun Chen, Boheng Li, Yu Yuan, Leyi Qi, Yiming Li, Tianwei Zhang, Zhan Qin, Kui Ren
Title: Taught Well Learned Ill: Towards Distillation-conditional Backdoor Attack
Abstract:
Knowledge distillation (KD) is a vital technique for deploying deep neural networks (DNNs) on resource-constrained devices by transferring knowledge from large teacher models to lightweight student models. While teacher models from third-party platforms may undergo security verification (\eg, backdoor detection), we uncover a novel and critical threat: distillation-conditional backdoor attacks (DCBAs). DCBA injects dormant and undetectable backdoors into teacher models, which become activated in student models via the KD process, even with clean distillation datasets. While the direct extension of existing methods is ineffective for DCBA, we implement this attack by formulating it as a bilevel optimization problem and proposing a simple yet effective method (\ie, SCAR). Specifically, the inner optimization simulates the KD process by optimizing a surrogate student model, while the outer optimization leverages outputs from this surrogate to optimize the teacher model for implanting the conditional backdoor. Our SCAR addresses this complex optimization utilizing an implicit differentiation algorithm with a pre-optimized trigger injection function. Extensive experiments across diverse datasets, model architectures, and KD techniques validate the effectiveness of our SCAR and its resistance against existing backdoor detection, highlighting a significant yet previously overlooked vulnerability in the KD process. Our code is available at https://github.com/WhitolfChen/SCAR.
中文: 知识蒸馏技术可将大型教师模型的知识转移到轻量级学生模型,但新发现的蒸馏条件后门攻击(DCBA)能在教师模型中植入潜伏后门,这些后门在蒸馏过程中会被激活到学生模型中,我们提出的SCAR方法通过双层优化成功实现了这种攻击,并在多种实验环境中验证了其有效性。
English: Knowledge distillation enables efficient deployment of deep neural networks on resource-limited devices, but a new threat called distillation-conditional backdoor attacks (DCBAs) can implant dormant backdoors in teacher models that activate in student models during distillation, which our proposed SCAR method effectively implements and demonstrates across various datasets and architectures.

Authors:Bingyang Cui, Yujie Zhang, Qi Yang, Zhu Li, Yiling Xu
Title: Towards Fine-Grained Text-to-3D Quality Assessment: A Benchmark and A Two-Stage Rank-Learning Metric
Abstract:
Recent advances in Text-to-3D (T23D) generative models have enabled the synthesis of diverse, high-fidelity 3D assets from textual prompts. However, existing challenges restrict the development of reliable T23D quality assessment (T23DQA). First, existing benchmarks are outdated, fragmented, and coarse-grained, making fine-grained metric training infeasible. Moreover, current objective metrics exhibit inherent design limitations, resulting in non-representative feature extraction and diminished metric robustness. To address these limitations, we introduce T23D-CompBench, a comprehensive benchmark for compositional T23D generation. We define five components with twelve sub-components for compositional prompts, which are used to generate 3,600 textured meshes from ten state-of-the-art generative models. A large-scale subjective experiment is conducted to collect 129,600 reliable human ratings across different perspectives. Based on T23D-CompBench, we further propose Rank2Score, an effective evaluator with two-stage training for T23DQA. Rank2Score enhances pairwise training via supervised contrastive regression and curriculum learning in the first stage, and subsequently refines predictions using mean opinion scores to achieve closer alignment with human judgments in the second stage. Extensive experiments and downstream applications demonstrate that Rank2Score consistently outperforms existing metrics across multiple dimensions and can additionally serve as a reward function to optimize generative models. The project is available at https://cbysjtu.github.io/Rank2Score/.
中文: 针对文本到3D模型质量评估中基准过时和指标局限的问题,本研究提出了T23D-CompBench综合基准和Rank2Score评估器,通过两阶段训练显著提升了与人类判断的一致性。
English: Recent advances in Text-to-3D models face challenges in quality assessment due to outdated benchmarks and limited metrics, prompting the introduction of T23D-CompBench and Rank2Score, a novel evaluator that enhances alignment with human judgments through two-stage training.

Authors:Cancan Li, Fei Su, Juan Liu, Hui Bu, Yulong Wan, Hongbin Suo, Ming Li
Title: AISHELL6-whisper: A Chinese Mandarin Audio-visual Whisper Speech Dataset with Speech Recognition Baselines
Abstract:
Whisper speech recognition is crucial not only for ensuring privacy in sensitive communications but also for providing a critical communication bridge for patients under vocal restraint and enabling discrete interaction in noise-sensitive environments. The development of Chinese mandarin audio-visual whisper speech recognition is hindered by the lack of large-scale datasets. We present AISHELL6-Whisper, a large-scale open-source audio-visual whisper speech dataset, featuring 30 hours each of whisper speech and parallel normal speech, with synchronized frontal facial videos. Moreover, we propose an audio-visual speech recognition (AVSR) baseline based on the Whisper-Flamingo framework, which integrates a parallel training strategy to align embeddings across speech types, and employs a projection layer to adapt to whisper speech's spectral properties. The model achieves a Character Error Rate (CER) of 4.13% for whisper speech and 1.11% for normal speech in the test set of our dataset, and establishes new state-of-the-art results on the wTIMIT benchmark. The dataset and the AVSR baseline codes are open-sourced at https://zutm.github.io/AISHELL6-Whisper.

Authors:Xiaojie Li, Bei Wang, Jianlong Wu, Yue Yu, Liqiang Nie, Min Zhang
Title: GenView++: Unifying Adaptive View Generation and Quality-Driven Supervision for Contrastive Representation Learning
Abstract:
The success of contrastive learning depends on the construction and utilization of high-quality positive pairs. However, current methods face critical limitations on two fronts: on the construction side, both handcrafted and generative augmentations often suffer from limited diversity and risk semantic corruption; on the learning side, the absence of a quality assessment mechanism leads to suboptimal supervision where all pairs are treated equally. To tackle these challenges, we propose GenView++, a unified framework that addresses both fronts by introducing two synergistic innovations. To improve pair construction, GenView++ introduces a multi-source adaptive view generation mechanism to synthesize diverse yet semantically coherent views by dynamically modulating generative parameters across image-conditioned, text-conditioned, and image-text-conditioned strategies. Second, a quality-driven contrastive learning mechanism assesses each pair's semantic alignment and diversity to dynamically reweight their training contribution, prioritizing high-quality pairs while suppressing redundant or misaligned pairs. Extensive experiments demonstrate the effectiveness of GenView++ across both vision and vision-language tasks. For vision representation learning, it improves MoCov2 by +2.5% on ImageNet linear classification. For vision-language learning, it raises the average zero-shot classification accuracy by +12.31% over CLIP and +5.31% over SLIP across ten datasets, and further improves Flickr30k text retrieval R@5 by +3.2%. The code is available at https://github.com/xiaojieli0903/GenViewPlusPlus.
中文: GenView++ 提出了一个统一框架,通过多源自适应视图生成机制创建多样化且语义一致的对比对,并采用质量驱动的对比学习机制动态优化高质量对的训练权重,在视觉和视觉语言任务中取得了显著性能提升。
English: GenView++ introduces a unified framework with multi-source adaptive view generation to create diverse, semantically coherent pairs and a quality-driven contrastive learning mechanism that dynamically prioritizes high-quality pairs, achieving significant performance gains in vision and vision-language tasks.

Authors:Lezhong Wang, Shutong Jin, Ruiqi Cui, Anders Bjorholm Dahl, Jeppe Revall Frisvad, Siavash Bigdeli
Title: ReLumix: Extending Image Relighting to Video via Video Diffusion Models
Abstract:
Controlling illumination during video post-production is a crucial yet elusive goal in computational photography. Existing methods often lack flexibility, restricting users to certain relighting models. This paper introduces ReLumix, a novel framework that decouples the relighting algorithm from temporal synthesis, thereby enabling any image relighting technique to be seamlessly applied to video. Our approach reformulates video relighting into a simple yet effective two-stage process: (1) an artist relights a single reference frame using any preferred image-based technique (e.g., Diffusion Models, physics-based renderers); and (2) a fine-tuned stable video diffusion (SVD) model seamlessly propagates this target illumination throughout the sequence. To ensure temporal coherence and prevent artifacts, we introduce a gated cross-attention mechanism for smooth feature blending and a temporal bootstrapping strategy that harnesses SVD's powerful motion priors. Although trained on synthetic data, ReLumix shows competitive generalization to real-world videos. The method demonstrates significant improvements in visual fidelity, offering a scalable and versatile solution for dynamic lighting control.

Authors:Arshia Yousefi Nezhad, Helia Aghaei, Hedieh Sajedi
Title: PVTAdpNet: Polyp Segmentation using Pyramid vision transformer with a novel Adapter block
Abstract:
Colorectal cancer ranks among the most common and deadly cancers, emphasizing the need for effective early detection and treatment. To address the limitations of traditional colonoscopy, including high miss rates due to polyp variability, we introduce the Pyramid Vision Transformer Adapter Residual Network (PVTAdpNet). This model integrates a U-Net-style encoder-decoder structure with a Pyramid Vision Transformer backbone, novel residual blocks, and adapter-based skip connections. The design enhances feature extraction, dense prediction, and gradient flow, supported by squeeze-and-excitation attention for improved channel-wise feature refinement. PVTAdpNet achieves real-time, accurate polyp segmentation, demonstrating superior performance on benchmark datasets with high mDice and mIoU scores, making it highly suitable for clinical applications. PVTAdpNet obtains a high Dice coefficient of 0.8851 and a mean Intersection over Union (mIoU) of 0.8167 on out-of-distribution polyp datasets. Evaluation of the PolypGen dataset demonstrates PVTAdpNet's capability for real-time, accurate performance within familiar distributions. The source code of our network is available at https://github.com/ayousefinejad/PVTAdpNet.git
中文:PVTAdpNet模型通过融合金字塔视觉Transformer与适配器连接,实现了结直肠癌检测中息肉的实时精准分割,在基准数据集上展现出卓越性能。
English: PVTAdpNet, a novel model combining Pyramid Vision Transformer with adapter-based connections, achieves real-time, accurate polyp segmentation for colorectal cancer detection, demonstrating superior performance on benchmark datasets.

Authors:Yewang Chen, Junfeng Li, Shuyin Xia, Qinghong Lai, Xinbo Gao, Guoyin Wang, Dongdong Cheng, Yi Liu, Yi Wang
Title: GBSK: Skeleton Clustering via Granular-ball Computing and Multi-Sampling for Large-Scale Data
Abstract:
To effectively handle clustering task for large-scale datasets, we propose a novel scalable skeleton clustering algorithm, namely GBSK, which leverages the granular-ball technique to capture the underlying structure of data. By multi-sampling the dataset and constructing multi-grained granular-balls, GBSK progressively uncovers a statistical "skeleton" -- a spatial abstraction that approximates the essential structure and distribution of the original data. This strategy enables GBSK to dramatically reduce computational overhead while maintaining high clustering accuracy. In addition, we introduce an adaptive version, AGBSK, with simplified parameter settings to enhance usability and facilitate deployment in real-world scenarios. Extensive experiments conducted on standard computing hardware demonstrate that GBSK achieves high efficiency and strong clustering performance on large-scale datasets, including one with up to 100 million instances across 256 dimensions. Our implementation and experimental results are available at: https://github.com/XFastDataLab/GBSK/.
Chinese: GBSK算法采用粒球技术提出了一种可扩展的骨架聚类方法,通过降低计算成本高效处理大规模数据集并保持高精度,其自适应版本AGBSK简化了参数设置以提升实际应用性。
English: The GBSK algorithm introduces a scalable skeleton clustering method using granular-ball technology to efficiently process large-scale datasets by reducing computational costs while maintaining high accuracy, with an adaptive version AGBSK simplifying parameter settings for practical use.

Authors:Xincheng Yao, Chao Shi, Muming Zhao, Guangtao Zhai, Chongyang Zhang
Title: ResAD++: Towards Class Agnostic Anomaly Detection via Residual Feature Learning
Abstract:
This paper explores the problem of class-agnostic anomaly detection (AD), where the objective is to train one class-agnostic AD model that can generalize to detect anomalies in diverse new classes from different domains without any retraining or fine-tuning on the target data. When applied for new classes, the performance of current single- and multi-class AD methods is still unsatisfactory. One fundamental reason is that representation learning in existing methods is still class-related, namely, feature correlation. To address this issue, we propose residual features and construct a simple but effective framework, termed ResAD. Our core insight is to learn the residual feature distribution rather than the initial feature distribution. Residual features are formed by matching and then subtracting normal reference features. In this way, we can effectively realize feature decorrelation. Even in new classes, the distribution of normal residual features would not remarkably shift from the learned distribution. In addition, we think that residual features still have one issue: scale correlation. To this end, we propose a feature hypersphere constraining approach, which learns to constrain initial normal residual features into a spatial hypersphere for enabling the feature scales of different classes as consistent as possible. Furthermore, we propose a novel logbarrier bidirectional contraction OCC loss and vector quantization based feature distribution matching module to enhance ResAD, leading to the improved version of ResAD (ResAD++). Comprehensive experiments on eight real-world AD datasets demonstrate that our ResAD++ can achieve remarkable AD results when directly used in new classes, outperforming state-of-the-art competing methods and also surpassing ResAD. The code is available at https://github.com/xcyao00/ResAD.
中文: 本文提出了ResAD及其改进版ResAD++,通过残差特征和特征超球面约束方法实现类别无关的异常检测,有效解耦特征相关性并在新类别中保持稳定分布,在多个数据集上显著优于现有方法。
English: This paper introduces ResAD and its enhanced version ResAD++, which utilize residual features and a feature hypersphere approach to achieve class-agnostic anomaly detection by decorrelating features and maintaining consistent distributions across new classes, outperforming existing methods on multiple datasets.

Authors:Hangtian Zhao, Xiang Chen, Yizhe Li, Qianhao Wang, Haibo Lu, Fei Gao
Title: FastViDAR: Real-Time Omnidirectional Depth Estimation via Alternative Hierarchical Attention
Abstract:
In this paper we propose FastViDAR, a novel framework that takes four fisheye camera inputs and produces a full $360^\circ$ depth map along with per-camera depth, fusion depth, and confidence estimates. Our main contributions are: (1) We introduce Alternative Hierarchical Attention (AHA) mechanism that efficiently fuses features across views through separate intra-frame and inter-frame windowed self-attention, achieving cross-view feature mixing with reduced overhead. (2) We propose a novel ERP fusion approach that projects multi-view depth estimates to a shared equirectangular coordinate system to obtain the final fusion depth. (3) We generate ERP image-depth pairs using HM3D and 2D3D-S datasets for comprehensive evaluation, demonstrating competitive zero-shot performance on real datasets while achieving up to 20 FPS on NVIDIA Orin NX embedded hardware. Project page: \href{https://3f7dfc.github.io/FastVidar/}{https://3f7dfc.github.io/FastVidar/}

Authors:Yiheng Zhang, Zhuojiang Cai, Mingdao Wang, Meitong Guo, Tianxiao Li, Li Lin, Yuwang Wang
Title: M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation
Abstract:
In text-driven 3D scene generation, object layout serves as a crucial intermediate representation that bridges high-level language instructions with detailed geometric output. It not only provides a structural blueprint for ensuring physical plausibility but also supports semantic controllability and interactive editing. However, the learning capabilities of current 3D indoor layout generation models are constrained by the limited scale, diversity, and annotation quality of existing datasets. To address this, we introduce M3DLayout, a large-scale, multi-source dataset for 3D indoor layout generation. M3DLayout comprises 15,080 layouts and over 258k object instances, integrating three distinct sources: real-world scans, professional CAD designs, and procedurally generated scenes. Each layout is paired with detailed structured text describing global scene summaries, relational placements of large furniture, and fine-grained arrangements of smaller items. This diverse and richly annotated resource enables models to learn complex spatial and semantic patterns across a wide variety of indoor environments. To assess the potential of M3DLayout, we establish a benchmark using a text-conditioned diffusion model. Experimental results demonstrate that our dataset provides a solid foundation for training layout generation models. Its multi-source composition enhances diversity, notably through the Inf3DLayout subset which provides rich small-object information, enabling the generation of more complex and detailed scenes. We hope that M3DLayout can serve as a valuable resource for advancing research in text-driven 3D scene synthesis.

Authors:Yunjiang Xu, Lingzhi Li, Jin Wang, Yupeng Ouyang, Benyuan Yang
Title: INSTINCT: Instance-Level Interaction Architecture for Query-Based Collaborative Perception
Abstract:
Collaborative perception systems overcome single-vehicle limitations in long-range detection and occlusion scenarios by integrating multi-agent sensory data, improving accuracy and safety. However, frequent cooperative interactions and real-time requirements impose stringent bandwidth constraints. Previous works proves that query-based instance-level interaction reduces bandwidth demands and manual priors, however, LiDAR-focused implementations in collaborative perception remain underdeveloped, with performance still trailing state-of-the-art approaches. To bridge this gap, we propose INSTINCT (INSTance-level INteraCtion ArchiTecture), a novel collaborative perception framework featuring three core components: 1) a quality-aware filtering mechanism for high-quality instance feature selection; 2) a dual-branch detection routing scheme to decouple collaboration-irrelevant and collaboration-relevant instances; and 3) a Cross Agent Local Instance Fusion module to aggregate local hybrid instance features. Additionally, we enhance the ground truth (GT) sampling technique to facilitate training with diverse hybrid instance features. Extensive experiments across multiple datasets demonstrate that INSTINCT achieves superior performance. Specifically, our method achieves an improvement in accuracy 13.23%/33.08% in DAIR-V2X and V2V4Real while reducing the communication bandwidth to 1/281 and 1/264 compared to state-of-the-art methods. The code is available at https://github.com/CrazyShout/INSTINCT.
中文:INSTINCT框架通过质量感知过滤和双分支路由的实例级交互机制,在显著提升检测精度的同时将通信带宽降至现有最优方法的1/281至1/264。
English: The proposed INSTINCT framework enhances collaborative perception by employing instance-level interaction with quality-aware filtering and dual-branch routing, achieving significant accuracy improvements and drastic bandwidth reduction compared to state-of-the-art methods.

Authors:Weilun Feng, Chuanguang Yang, Haotong Qin, Mingqiang Wu, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu
Title: QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification
Abstract:
Diffusion transformers exhibit remarkable video generation capability, yet their prohibitive computational and memory costs hinder practical deployment. Model quantization and attention sparsification are two promising directions for compression, but each alone suffers severe performance degradation under aggressive compression. Combining them promises compounded efficiency gains, but naive integration is ineffective. The sparsity-induced information loss exacerbates quantization noise, leading to amplified attention shifts. To address this, we propose \textbf{QuantSparse}, a unified framework that integrates model quantization with attention sparsification. Specifically, we introduce \textit{Multi-Scale Salient Attention Distillation}, which leverages both global structural guidance and local salient supervision to mitigate quantization-induced bias. In addition, we develop \textit{Second-Order Sparse Attention Reparameterization}, which exploits the temporal stability of second-order residuals to efficiently recover information lost under sparsity. Experiments on HunyuanVideo-13B demonstrate that QuantSparse achieves 20.88 PSNR, substantially outperforming the state-of-the-art quantization baseline Q-VDiT (16.85 PSNR), while simultaneously delivering a \textbf{3.68$\times$} reduction in storage and \textbf{1.88$\times$} acceleration in end-to-end inference. Our code will be released in https://github.com/wlfeng0509/QuantSparse.
中文摘要:QuantSparse是一个统一压缩框架,通过多尺度注意力蒸馏和二阶稀疏重参数化技术,有效结合模型量化和注意力稀疏化,在显著降低存储和加速推理的同时实现了更优的视频生成质量。
English Summary: QuantSparse is a unified compression framework that effectively combines model quantization and attention sparsification through multi-scale attention distillation and second-order sparse reparameterization, achieving superior video generation quality with significantly reduced storage and accelerated inference.

Authors:Dayu Tan, Ziwei Zhang, Yansan Su, Xin Peng, Yike Dai, Chunhou Zheng, Weimin Zhong
Title: MSD-KMamba: Bidirectional Spatial-Aware Multi-Modal 3D Brain Segmentation via Multi-scale Self-Distilled Fusion Strategy
Abstract:
Numerous CNN-Transformer hybrid models rely on high-complexity global attention mechanisms to capture long-range dependencies, which introduces non-linear computational complexity and leads to significant resource consumption. Although knowledge distillation and sparse attention mechanisms can improve efficiency, they often fall short of delivering the high segmentation accuracy necessary for complex tasks. Balancing model performance with computational efficiency remains a critical challenge. In this work, we propose a novel 3D multi-modal image segmentation framework, termed MSD-KMamba, which integrates bidirectional spatial perception with multi-scale self-distillation. The bidirectional spatial aware branch effectively captures long-range spatial context dependencies across brain regions, while also incorporating a powerful nonlinear feature extraction mechanism that further enhances the model's ability to learn complex and heterogeneous patterns. In addition, the proposed multi-scale self-distilled fusion strategy strengthens hierarchical feature representations and improves the transfer of semantic information at different resolution levels. By jointly leveraging the bidirectional spatial perception branch and the multi-scale self-distilled fusion strategy, our framework effectively mitigates the bottleneck of quadratic computational complexity in volumetric segmentation, while simultaneously addressing the limitation of insufficient global perception. Extensive experiments on multiple standard benchmark datasets demonstrate that MSD-KMamba consistently outperforms state-of-the-art methods in segmentation accuracy, robustness, and generalization, while maintaining high computational efficiency and favorable scalability. The source code of MSD-KMamba is publicly available at https://github.com/daimao-zhang/MSD-KMamba.
中文: MSD-KMamba框架通过双向空间感知和多尺度自蒸馏技术,在三维多模态图像分割中有效捕获长程依赖并增强特征表示,相比现有方法实现了更高的分割精度和计算效率。
English: The MSD-KMamba framework introduces bidirectional spatial perception and multi-scale self-distillation to effectively capture long-range dependencies and enhance feature representation in 3D multi-modal image segmentation, achieving superior accuracy and efficiency compared to existing methods.

Authors:Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Kangli Zi, Qingming Huang
Title: LightFair: Towards an Efficient Alternative for Fair T2I Diffusion via Debiasing Pre-trained Text Encoders
Abstract:
This paper explores a novel lightweight approach LightFair to achieve fair text-to-image diffusion models (T2I DMs) by addressing the adverse effects of the text encoder. Most existing methods either couple different parts of the diffusion model for full-parameter training or rely on auxiliary networks for correction. They incur heavy training or sampling burden and unsatisfactory performance. Since T2I DMs consist of multiple components, with the text encoder being the most fine-tunable and front-end module, this paper focuses on mitigating bias by fine-tuning text embeddings. To validate feasibility, we observe that the text encoder's neutral embedding output shows substantial skewness across image embeddings of various attributes in the CLIP space. More importantly, the noise prediction network further amplifies this imbalance. To finetune the text embedding, we propose a collaborative distance-constrained debiasing strategy that balances embedding distances to improve fairness without auxiliary references. However, mitigating bias can compromise the original generation quality. To address this, we introduce a two-stage text-guided sampling strategy to limit when the debiased text encoder intervenes. Extensive experiments demonstrate that LightFair is effective and efficient. Notably, on Stable Diffusion v1.5, our method achieves SOTA debiasing at just $1/4$ of the training burden, with virtually no increase in sampling burden. The code is available at https://github.com/boyuh/LightFair.
中文摘要:本文提出LightFair轻量方法,通过距离约束去偏策略微调文本嵌入并结合两阶段采样,有效提升文本到图像扩散模型的公平性,以仅四分之一训练负担实现最优去偏效果且几乎不增加采样成本。
English Summary: This paper introduces LightFair, a lightweight method that enhances fairness in text-to-image diffusion models by fine-tuning text embeddings with a distance-constrained debiasing strategy and a two-stage sampling approach, achieving state-of-the-art performance with significantly reduced training and sampling overhead.

Authors:Cheng Huang, Weizheng Xie, Fan Gao, Yutong Liu, Ruoling Wu, Zeyu Han, Jingxi Qiu, Xiangxiang Wang, Zhenglin Yang, Hao Wang, Yongbin Yu
Title: BioVessel-Net and RetinaMix: Unsupervised Retinal Vessel Segmentation from OCTA Images
Abstract:
Structural changes in retinal blood vessels are critical biomarkers for the onset and progression of glaucoma and other ocular diseases. However, current vessel segmentation approaches largely rely on supervised learning and extensive manual annotations, which are costly, error-prone, and difficult to obtain in optical coherence tomography angiography. Here we present BioVessel-Net, an unsupervised generative framework that integrates vessel biostatistics with adversarial refinement and a radius-guided segmentation strategy. Unlike pixel-based methods, BioVessel-Net directly models vascular structures with biostatistical coherence, achieving accurate and explainable vessel extraction without labeled data or high-performance computing. To support training and evaluation, we introduce RetinaMix, a new benchmark dataset of 2D and 3D OCTA images with high-resolution vessel details from diverse populations. Experimental results demonstrate that BioVessel-Net achieves near-perfect segmentation accuracy across RetinaMix and existing datasets, substantially outperforming state-of-the-art supervised and semi-supervised methods. Together, BioVessel-Net and RetinaMix provide a label-free, computationally efficient, and clinically interpretable solution for retinal vessel analysis, with broad potential for glaucoma monitoring, blood flow modeling, and progression prediction. Code and dataset are available: https://github.com/VikiXie/SatMar8.
中文摘要:BioVessel-Net是一个无监督生成框架,通过整合血管生物统计学与对抗性优化,无需人工标注即可实现精准的视网膜血管分割,其性能超越现有监督方法且具备临床可解释性。
English Summary: BioVessel-Net is an unsupervised generative framework that integrates vessel biostatistics with adversarial refinement to achieve accurate retinal vessel segmentation without manual annotations, outperforming existing supervised methods while providing clinical interpretability.

Authors:Xiang Tang, Ruotong Li, Xiaopeng Fan
Title: ZeroScene: A Zero-Shot Framework for 3D Scene Generation from a Single Image and Controllable Texture Editing
Abstract:
In the field of 3D content generation, single image scene reconstruction methods still struggle to simultaneously ensure the quality of individual assets and the coherence of the overall scene in complex environments, while texture editing techniques often fail to maintain both local continuity and multi-view consistency. In this paper, we propose a novel system ZeroScene, which leverages the prior knowledge of large vision models to accomplish both single image-to-3D scene reconstruction and texture editing in a zero-shot manner. ZeroScene extracts object-level 2D segmentation and depth information from input images to infer spatial relationships within the scene. It then jointly optimizes 3D and 2D projection losses of the point cloud to update object poses for precise scene alignment, ultimately constructing a coherent and complete 3D scene that encompasses both foreground and background. Moreover, ZeroScene supports texture editing of objects in the scene. By imposing constraints on the diffusion model and introducing a mask-guided progressive image generation strategy, we effectively maintain texture consistency across multiple viewpoints and further enhance the realism of rendered results through Physically Based Rendering (PBR) material estimation. Experimental results demonstrate that our framework not only ensures the geometric and appearance accuracy of generated assets, but also faithfully reconstructs scene layouts and produces highly detailed textures that closely align with text prompts.

Authors:Han Hu, Zhuoran Zheng, Liang Li, Chen Lyu
Title: VAMamba: An Efficient Visual Adaptive Mamba for Image Restoration
Abstract:
Recent Mamba-based image restoration methods have achieved promising results but remain limited by fixed scanning patterns and inefficient feature utilization. Conventional Mamba architectures rely on predetermined paths that cannot adapt to diverse degradations, constraining both restoration performance and computational efficiency. To overcome these limitations, we propose VAMamba, a Visual Adaptive Mamba framework with two key innovations. First, QCLAM(Queue-basedCacheLow-rankAdaptiveMemory)enhancesfeaturelearningthrougha FIFO cache that stores historical representations. Similarity between current LoRA-adapted and cached features guides intelligent fusion, enabling dynamic reuse while effectively controlling memorygrowth.Second, GPS-SS2D(GreedyPathScanSS2D)introducesadaptive scanning. A Vision Transformer generates score maps to estimate pixel importance, and a greedy strategy de termines optimal forward and backward scanning paths. These learned trajectories replace rigid patterns, enabling SS2D to perform targeted feature extraction. The integration of QCLAM and GPS-SS2D allows VAMamba to adaptively focus on degraded regions while maintaining high computational efficiency. Extensive experiments across diverse restoration tasks demonstrate that VAMamba consistently outperforms existing approaches in both restoration quality and efficiency, establishing new benchmarks for adaptive image restoration. Our code is available at https://github.com/WaterHQH/VAMamba.
中文摘要:VAMamba通过QCLAM实现动态特征复用和GPS-SS2D自适应扫描路径,在多种图像修复任务中实现了卓越的修复质量与计算效率。
English Summary: VAMamba introduces QCLAM for dynamic feature reuse and GPS-SS2D for adaptive scanning paths, achieving superior image restoration quality and efficiency across diverse tasks.

Authors:Kaicheng Yang, Xun Zhang, Haotong Qin, Yucheng Lin, Kaisen Yang, Xianglong Yan, Yulun Zhang
Title: RobuQ: Pushing DiTs to W1.58A2 via Robust Activation Quantization
Abstract:
Diffusion Transformers (DiTs) have recently emerged as a powerful backbone for image generation, demonstrating superior scalability and performance over U-Net architectures. However, their practical deployment is hindered by substantial computational and memory costs. While Quantization-Aware Training (QAT) has shown promise for U-Nets, its application to DiTs faces unique challenges, primarily due to the sensitivity and distributional complexity of activations. In this work, we identify activation quantization as the primary bottleneck for pushing DiTs to extremely low-bit settings. To address this, we propose a systematic QAT framework for DiTs, named RobuQ. We start by establishing a strong ternary weight (W1.58A4) DiT baseline. Building upon this, we propose RobustQuantizer to achieve robust activation quantization. Our theoretical analyses show that the Hadamard transform can convert unknown per-token distributions into per-token normal distributions, providing a strong foundation for this method. Furthermore, we propose AMPN, the first Activation-only Mixed-Precision Network pipeline for DiTs. This method applies ternary weights across the entire network while allocating different activation precisions to each layer to eliminate information bottlenecks. Through extensive experiments on unconditional and conditional image generation, our RobuQ framework achieves state-of-the-art performance for DiT quantization in sub-4-bit quantization configuration. To the best of our knowledge, RobuQ is the first achieving stable and competitive image generation on large datasets like ImageNet-1K with activations quantized to average 2 bits. The code and models will be available at https://github.com/racoonykc/RobuQ .
中文摘要:RobuQ框架通过引入RobustQuantizer和AMPN方法,解决了扩散变换器中激活量化的关键难题,在低于4比特的量化配置下实现了最先进的图像生成性能。
English Summary: The RobuQ framework addresses activation quantization challenges in Diffusion Transformers (DiTs) by introducing RobustQuantizer and AMPN pipeline, achieving state-of-the-art sub-4-bit quantization performance with competitive image generation quality.

Authors:Junyi Wu, Jiachen Tao, Haoxuan Wang, Gaowen Liu, Ramana Rao Kompella, Yan Yan
Title: Orientation-anchored Hyper-Gaussian for 4D Reconstruction from Casual Videos
Abstract:
We present Orientation-anchored Gaussian Splatting (OriGS), a novel framework for high-quality 4D reconstruction from casually captured monocular videos. While recent advances extend 3D Gaussian Splatting to dynamic scenes via various motion anchors, such as graph nodes or spline control points, they often rely on low-rank assumptions and fall short in modeling complex, region-specific deformations inherent to unconstrained dynamics. OriGS addresses this by introducing a hyperdimensional representation grounded in scene orientation. We first estimate a Global Orientation Field that propagates principal forward directions across space and time, serving as stable structural guidance for dynamic modeling. Built upon this, we propose Orientation-aware Hyper-Gaussian, a unified formulation that embeds time, space, geometry, and orientation into a coherent probabilistic state. This enables inferring region-specific deformation through principled conditioned slicing, adaptively capturing diverse local dynamics in alignment with global motion intent. Experiments demonstrate the superior reconstruction fidelity of OriGS over mainstream methods in challenging real-world dynamic scenes.
中文: OriGS通过基于场景方向的高维表示,解决了单目视频4D重建中复杂区域特定形变的建模难题,在动态场景中实现了卓越的还原精度。
English: OriGS introduces a hyperdimensional representation based on scene orientation to model complex, region-specific deformations in 4D reconstruction from monocular videos, achieving superior fidelity in dynamic scenes.

Authors:Mohammad Hossein Sameti, Amir M. Mansourian, Arash Marioriyad, Soheil Fadaee Oshyani, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah
Title: No Concept Left Behind: Test-Time Optimization for Compositional Text-to-Image Generation
Abstract:
Despite recent advances in text-to-image (T2I) models, they often fail to faithfully render all elements of complex prompts, frequently omitting or misrepresenting specific objects and attributes. Test-time optimization has emerged as a promising approach to address this limitation by refining generation without the need for retraining. In this paper, we propose a fine-grained test-time optimization framework that enhances compositional faithfulness in T2I generation. Unlike most of prior approaches that rely solely on a global image/text similarity score, our method decomposes the input prompt into semantic concepts and evaluates alignment at both the global and concept levels. A fine-grained variant of CLIP is used to compute concept-level correspondence, producing detailed feedback on missing or inaccurate concepts. This feedback is fed into an iterative prompt refinement loop, enabling the large language model to propose improved prompts. Experiments on DrawBench and CompBench prompts demonstrate that our method significantly improves concept coverage and human-judged faithfulness over both standard test-time optimization and the base T2I model. Code is available at: https://github.com/AmirMansurian/NoConceptLeftBehind
中文: 本文提出了一种细粒度的测试时优化框架,通过将提示分解为语义概念并利用迭代反馈来提高概念覆盖率和生成忠实度,在文本到图像生成任务中显著优于现有方法。
English: This paper introduces a fine-grained test-time optimization framework that enhances text-to-image generation by decomposing prompts into semantic concepts and using iterative feedback to improve concept coverage and faithfulness, outperforming existing methods.

Authors:Tharindu Ekanayake, Constantino Álvarez Casado, Miguel Bordallo López
Title: 3DPCNet: Pose Canonicalization for Robust Viewpoint-Invariant 3D Kinematic Analysis from Monocular RGB cameras
Abstract:
Monocular 3D pose estimators produce camera-centered skeletons, creating view-dependent kinematic signals that complicate comparative analysis in applications such as health and sports science. We present 3DPCNet, a compact, estimator-agnostic module that operates directly on 3D joint coordinates to rectify any input pose into a consistent, body-centered canonical frame. Its hybrid encoder fuses local skeletal features from a graph convolutional network with global context from a transformer via a gated cross-attention mechanism. From this representation, the model predicts a continuous 6D rotation that is mapped to an $SO(3)$ matrix to align the pose. We train the model in a self-supervised manner on the MM-Fi dataset using synthetically rotated poses, guided by a composite loss ensuring both accurate rotation and pose reconstruction. On the MM-Fi benchmark, 3DPCNet reduces the mean rotation error from over 20$^{\circ}$ to 3.4$^{\circ}$ and the Mean Per Joint Position Error from ~64 mm to 47 mm compared to a geometric baseline. Qualitative evaluations on the TotalCapture dataset further demonstrate that our method produces acceleration signals from video that show strong visual correspondence to ground-truth IMU sensor data, confirming that our module removes viewpoint variability to enable physically plausible motion analysis.
中文:3DPCNet是一个紧凑型模块,通过自监督旋转对齐将视角依赖的3D姿态转换为统一的身体坐标系表示,将旋转误差从超过20°降至3.4°,位置误差降至47毫米,显著提升了运动分析的准确性。
English: 3DPCNet is a compact module that converts view-dependent 3D poses into consistent body-centered representations through self-supervised rotation alignment, significantly improving motion analysis accuracy by reducing rotation errors from over 20° to 3.4° and position errors to 47 mm.

Authors:Sahithya Ravi, Aditya Chinchure, Raymond T. Ng, Leonid Sigal, Vered Shwartz
Title: SPIKE-RL: Video-LLMs meet Bayesian Surprise
Abstract:
Real-world videos often show routine activities punctuated by memorable, surprising events. However, most Video-LLMs process videos by sampling frames uniformly, likely missing critical moments that define a video's narrative. We introduce SPIKE, an inference-time framework that quantifies Bayesian Surprise as the belief update triggered by new visual evidence in the video stream, identifying moments where new visual evidence conflicts with prior beliefs. SPIKE effectively localizes surprise in videos, strongly correlated with humans on positive (FunQA) and negative (Oops!) surprise benchmarks. Since the beliefs of zero-shot Video-LLMs are often suboptimal, we develop SPIKE-RL, which leverages GRPO to optimize belief hypotheses based on a reward signal from the video caption. SPIKE and SPIKE-RL guide query-agnostic surprise-weighted frame sampling, which allocates more frames to interesting moments in the video. With this strategy, we achieve consistent performance gains on five downstream benchmarks over uniform sampling. By enabling Video-LLMs to track beliefs and register surprise, our work paves the way for more robust models that can revise their understanding in response to new information.
中文摘要:SPIKE是一种推理时框架,通过量化贝叶斯惊喜来定位视频中的意外时刻,其引导的帧采样策略能在五个下游基准测试中持续超越均匀采样性能,使视频大语言模型能更好地捕捉关键叙事节点。
English Summary: SPIKE is an inference-time framework that identifies surprising moments in videos by measuring Bayesian Surprise, enabling optimized frame sampling that improves performance across multiple benchmarks by focusing on key narrative events.

Authors:Takehiko Ohkawa, Jihyun Lee, Shunsuke Saito, Jason Saragih, Fabian Prado, Yichen Xu, Shoou-I Yu, Ryosuke Furuta, Yoichi Sato, Takaaki Shiratori
Title: Generative Modeling of Shape-Dependent Self-Contact Human Poses
Abstract:
One can hardly model self-contact of human poses without considering underlying body shapes. For example, the pose of rubbing a belly for a person with a low BMI leads to penetration of the hand into the belly for a person with a high BMI. Despite its relevance, existing self-contact datasets lack the variety of self-contact poses and precise body shapes, limiting conclusive analysis between self-contact poses and shapes. To address this, we begin by introducing the first extensive self-contact dataset with precise body shape registration, Goliath-SC, consisting of 383K self-contact poses across 130 subjects. Using this dataset, we propose generative modeling of self-contact prior conditioned by body shape parameters, based on a body-part-wise latent diffusion with self-attention. We further incorporate this prior into single-view human pose estimation while refining estimated poses to be in contact. Our experiments suggest that shape conditioning is vital to the successful modeling of self-contact pose distribution, hence improving single-view pose estimation in self-contact.
Chinese Summary: 该研究推出了首个具有精确体型标注的大规模自接触数据集Goliath-SC,并开发了基于体型参数条件化的生成模型,有效提升了自接触姿态分布建模和单视角姿态估计的准确性。
English Summary: The study introduces Goliath-SC, the first extensive self-contact dataset with precise body shapes, and develops a generative model that uses shape conditioning to improve self-contact pose distribution modeling and single-view pose estimation.

Authors:Xi Ding, Lei Wang, Piotr Koniusz, Yongsheng Gao
Title: Graph Your Own Prompt
Abstract:
We propose Graph Consistency Regularization (GCR), a novel framework that injects relational graph structures, derived from model predictions, into the learning process to promote class-aware, semantically meaningful feature representations. Functioning as a form of self-prompting, GCR enables the model to refine its internal structure using its own outputs. While deep networks learn rich representations, these often capture noisy inter-class similarities that contradict the model's predicted semantics. GCR addresses this issue by introducing parameter-free Graph Consistency Layers (GCLs) at arbitrary depths. Each GCL builds a batch-level feature similarity graph and aligns it with a global, class-aware masked prediction graph, derived by modulating softmax prediction similarities with intra-class indicators. This alignment enforces that feature-level relationships reflect class-consistent prediction behavior, acting as a semantic regularizer throughout the network. Unlike prior work, GCR introduces a multi-layer, cross-space graph alignment mechanism with adaptive weighting, where layer importance is learned from graph discrepancy magnitudes. This allows the model to prioritize semantically reliable layers and suppress noisy ones, enhancing feature quality without modifying the architecture or training procedure. GCR is model-agnostic, lightweight, and improves semantic structure across various networks and datasets. Experiments show that GCR promotes cleaner feature structure, stronger intra-class cohesion, and improved generalization, offering a new perspective on learning from prediction structure. [Project website](https://darcyddx.github.io/gcr/) [Code](https://github.com/Darcyddx/graph-prompt)
中文摘要:图一致性正则化(GCR)是一种创新框架,通过将特征相似性图与类别感知预测图在多网络层中对齐,无需改变架构即可提升语义结构和泛化能力。
English Summary: Graph Consistency Regularization (GCR) is a novel framework that enhances feature learning by aligning feature similarity graphs with class-aware prediction graphs across network layers, improving semantic structure and generalization without architectural changes.

Authors:Zhaohua Zhang, Jianhuan Zhuo, Muxi Chen, Chenchen Zhao, Wenyu Jiang, Tianwen Jiang, Mingyang Chen, Yu Tang, Qiuyong Xiao, Jihong Zhang, Zhixun Su
Title: GRAPE: Let GPRO Supervise Query Rewriting by Ranking for Retrieval
Abstract:
The CLIP model has become a cornerstone of large-scale retrieval systems by aligning text and image data in a unified embedding space. Despite its simplicity and efficiency, CLIP struggles when applied to tasks whose input distributions diverge from its training corpus, such as queries with multilingual, long-form, or multimodal differences. To avoid costly retraining, existing methods mainly adopt query-rewriting strategies with large language models (LLMs), aiming to mitigate distribution gaps at the query level. However, due to the lack of supervision signals, LLMs fail to generate the optimal one that fits the training distribution. We address this challenge with GRAPE (Grouped Ranking-Aware Policy Optimization Enhancement), a plug-and-play enhancement approach that incorporates ranking signals into retrieval-guided query rewriting with LLMs. Intuitively, GRAPE proposes to leverage GRPO to bridge distributional differences -- including length, multilingual, and modality shifts -- by transforming queries into forms better aligned with the retriever's training distribution. However, our preliminary experiment finds that naively finetuning LLM with similarity scores can lead to score inflation, where nearly all candidates are assigned unexpectedly high scores regardless of their true relevance. To address score inflation, we propose a corpus-relative ranking-based reward, which explicitly aligns optimization with ranking metrics while suppressing spurious score inflation. Extensive experiments demonstrate that GRAPE consistently improves retrieval performance under distributional shifts -- including multilingual differences (Flickr30k-CN, CVLUE, XM3600), length differences (Wikipedia), and multimodal differences (CIRR) -- achieving an average improvement of 4.9\% in Recall\@10. The code is available at https://github.com/Chinese0123456/GRAPE.git
Chinese Summary: GRAPE提出了一种即插即用的增强方法,通过引入排序感知优化来提升CLIP模型在分布差异场景下的检索性能,在多语言、长文本和多模态任务中均实现了显著改进而无需重新训练。
English Summary: GRAPE introduces a plug-and-play enhancement that uses ranking-aware optimization to improve CLIP-based retrieval performance under distribution shifts, achieving significant gains across multilingual, length, and modality variations without retraining.

Authors:Siheng Wang, Zhengdao Li, Yanshu Li, Canran Xiao, Haibo Zhan, Zhengtao Yao, Xuzhi Zhang, Jiale Kang, Linshan Li, Weiming Liu, Zhikang Dong, Jifeng Shen, Junhao Dong, Qiang Sun, Piotr Koniusz
Title: C3-OWD: A Curriculum Cross-modal Contrastive Learning Framework for Open-World Detection
Abstract:
Object detection has advanced significantly in the closed-set setting, but real-world deployment remains limited by two challenges: poor generalization to unseen categories and insufficient robustness under adverse conditions. Prior research has explored these issues separately: visible-infrared detection improves robustness but lacks generalization, while open-world detection leverages vision-language alignment strategy for category diversity but struggles under extreme environments. This trade-off leaves robustness and diversity difficult to achieve simultaneously. To mitigate these issues, we propose \textbf{C3-OWD}, a curriculum cross-modal contrastive learning framework that unifies both strengths. Stage~1 enhances robustness by pretraining with RGBT data, while Stage~2 improves generalization via vision-language alignment. To prevent catastrophic forgetting between two stages, we introduce an Exponential Moving Average (EMA) mechanism that theoretically guarantees preservation of pre-stage performance with bounded parameter lag and function consistency. Experiments on FLIR, OV-COCO, and OV-LVIS demonstrate the effectiveness of our approach: C3-OWD achieves $80.1$ AP$^{50}$ on FLIR, $48.6$ AP$^{50}_{\text{Novel}}$ on OV-COCO, and $35.7$ mAP$_r$ on OV-LVIS, establishing competitive performance across both robustness and diversity evaluations. Code available at: https://github.com/justin-herry/C3-OWD.git.
中文:提出的C3-OWD框架通过课程式跨模态对比学习,将目标检测的鲁棒性与泛化能力相统一,在防止灾难性遗忘的同时,在多个基准测试中实现了优越性能。
English: The proposed C3-OWD framework unifies robustness and generalization in object detection through curriculum cross-modal contrastive learning, achieving competitive performance across diverse benchmarks while preventing catastrophic forgetting via an EMA mechanism.

Authors:Hao Liu, Yongjie Zheng, Yuhan Kang, Mingyang Zhang, Maoguo Gong, Lorenzo Bruzzone
Title: Balanced Diffusion-Guided Fusion for Multimodal Remote Sensing Classification
Abstract:
Deep learning-based techniques for the analysis of multimodal remote sensing data have become popular due to their ability to effectively integrate complementary spatial, spectral, and structural information from different sensors. Recently, denoising diffusion probabilistic models (DDPMs) have attracted attention in the remote sensing community due to their powerful ability to capture robust and complex spatial-spectral distributions. However, pre-training multimodal DDPMs may result in modality imbalance, and effectively leveraging diffusion features to guide complementary diversity feature extraction remains an open question. To address these issues, this paper proposes a balanced diffusion-guided fusion (BDGF) framework that leverages multimodal diffusion features to guide a multi-branch network for land-cover classification. Specifically, we propose an adaptive modality masking strategy to encourage the DDPMs to obtain a modality-balanced rather than spectral image-dominated data distribution. Subsequently, these diffusion features hierarchically guide feature extraction among CNN, Mamba, and transformer networks by integrating feature fusion, group channel attention, and cross-attention mechanisms. Finally, a mutual learning strategy is developed to enhance inter-branch collaboration by aligning the probability entropy and feature similarity of individual subnetworks. Extensive experiments on four multimodal remote sensing datasets demonstrate that the proposed method achieves superior classification performance. The code is available at https://github.com/HaoLiu-XDU/BDGF.
中文: 本文提出了一种平衡扩散引导融合(BDGF)框架,通过自适应模态掩码策略和扩散特征引导多分支网络,解决了多模态遥感数据中的模态不平衡问题,并在土地覆盖分类中实现了优越性能。
English: This paper introduces a balanced diffusion-guided fusion (BDGF) framework that addresses modality imbalance in multimodal remote sensing data by using diffusion features to guide multi-branch networks, achieving superior land-cover classification through adaptive masking and mutual learning strategies.

Authors:Minsun Jeon, Simon S. Woo
Title: Seeing Through the Blur: Unlocking Defocus Maps for Deepfake Detection
Abstract:
The rapid advancement of generative AI has enabled the mass production of photorealistic synthetic images, blurring the boundary between authentic and fabricated visual content. This challenge is particularly evident in deepfake scenarios involving facial manipulation, but also extends to broader AI-generated content (AIGC) cases involving fully synthesized scenes. As such content becomes increasingly difficult to distinguish from reality, the integrity of visual media is under threat. To address this issue, we propose a physically interpretable deepfake detection framework and demonstrate that defocus blur can serve as an effective forensic signal. Defocus blur is a depth-dependent optical phenomenon that naturally occurs in camera-captured images due to lens focus and scene geometry. In contrast, synthetic images often lack realistic depth-of-field (DoF) characteristics. To capture these discrepancies, we construct a defocus blur map and use it as a discriminative feature for detecting manipulated content. Unlike RGB textures or frequency-domain signals, defocus blur arises universally from optical imaging principles and encodes physical scene structure. This makes it a robust and generalizable forensic cue. Our approach is supported by three in-depth feature analyses, and experimental results confirm that defocus blur provides a reliable and interpretable cue for identifying synthetic images. We aim for our defocus-based detection pipeline and interpretability tools to contribute meaningfully to ongoing research in media forensics. The implementation is publicly available at: https://github.com/irissun9602/Defocus-Deepfake-Detection
中文: 该框架利用散焦模糊(相机拍摄图像中的自然光学现象)作为可物理解读的取证信号来检测AI生成的合成内容,并通过特征分析和实验验证证明了其鲁棒性。
English: The proposed framework leverages defocus blur—a natural optical phenomenon in camera-captured images—as a physically interpretable forensic signal to detect AI-generated synthetic content, demonstrating robustness through feature analyses and experimental validation.

Authors:Atakan Topaloglu, Kunyi Li, Michael Niemeyer, Nassir Navab, A. Murat Tekalp, Federico Tombari
Title: OracleGS: Grounding Generative Priors for Sparse-View Gaussian Splatting
Abstract:
Sparse-view novel view synthesis is fundamentally ill-posed due to severe geometric ambiguity. Current methods are caught in a trade-off: regressive models are geometrically faithful but incomplete, whereas generative models can complete scenes but often introduce structural inconsistencies. We propose OracleGS, a novel framework that reconciles generative completeness with regressive fidelity for sparse view Gaussian Splatting. Instead of using generative models to patch incomplete reconstructions, our "propose-and-validate" framework first leverages a pre-trained 3D-aware diffusion model to synthesize novel views to propose a complete scene. We then repurpose a multi-view stereo (MVS) model as a 3D-aware oracle to validate the 3D uncertainties of generated views, using its attention maps to reveal regions where the generated views are well-supported by multi-view evidence versus where they fall into regions of high uncertainty due to occlusion, lack of texture, or direct inconsistency. This uncertainty signal directly guides the optimization of a 3D Gaussian Splatting model via an uncertainty-weighted loss. Our approach conditions the powerful generative prior on multi-view geometric evidence, filtering hallucinatory artifacts while preserving plausible completions in under-constrained regions, outperforming state-of-the-art methods on datasets including Mip-NeRF 360 and NeRF Synthetic.
Chinese Summary: OracleGS是一种新颖框架,通过扩散模型生成完整场景并利用多视角立体模型验证几何不确定性,将生成完整性与回归保真度相结合,指导3D高斯溅射优化,在稀疏视角新视图合成中超越了现有最优方法。
English Summary: OracleGS is a novel framework that integrates generative completeness with regressive fidelity for sparse-view novel view synthesis by using a diffusion model to propose complete scenes and a multi-view stereo model to validate geometric uncertainties, guiding 3D Gaussian Splatting optimization to outperform state-of-the-art methods.

Authors:Sasan Sharifipour, Constantino Álvarez Casado, Le Nguyen, Tharindu Ekanayake, Manuel Lage Cañellas, Nhi Nguyen, Miguel Bordallo López
Title: LiDAR-based Human Activity Recognition through Laplacian Spectral Analysis
Abstract:
Human Activity Recognition supports applications in healthcare, manufacturing, and human-machine interaction. LiDAR point clouds offer a privacy-preserving alternative to cameras and are robust to illumination. We propose a HAR method based on graph spectral analysis. Each LiDAR frame is mapped to a proximity graph (epsilon-graph) and the Laplacian spectrum is computed. Eigenvalues and statistics of eigenvectors form pose descriptors, and temporal statistics over sliding windows yield fixed vectors for classification with support vector machines and random forests. On the MM-Fi dataset with 40 subjects and 27 activities, under a strict subject-independent protocol, the method reaches 94.4% accuracy on a 13-class rehabilitation set and 90.3% on all 27 activities. It also surpasses the skeleton-based baselines reported for MM-Fi. The contribution is a compact and interpretable feature set derived directly from point cloud geometry that provides an accurate and efficient alternative to end-to-end deep learning.
中文: 本研究提出一种基于LiDAR点云的人类活动识别方法,通过构建邻近图并分析其拉普拉斯谱来生成姿态描述符,在MM-Fi数据集上准确率超过90%,同时提供了保护隐私且可解释的深度学习替代方案。
English: This study introduces a human activity recognition method using LiDAR point clouds, which constructs proximity graphs and analyzes their Laplacian spectra to create pose descriptors, achieving over 90% accuracy on the MM-Fi dataset while offering a privacy-preserving and interpretable alternative to deep learning approaches.

Authors:Donghao Zhang, Yimin Chen, Kauê TN Duarte, Taha Aslan, Mohamed AlShamrani, Brij Karmur, Yan Wan, Shengcai Chen, Bo Hu, Bijoy K Menon, Wu Qiu
Title: Benchmarking DINOv3 for Multi-Task Stroke Analysis on Non-Contrast CT
Abstract:
Non-contrast computed tomography (NCCT) is essential for rapid stroke diagnosis but is limited by low image contrast and signal to noise ratio. We address this challenge by leveraging DINOv3, a state-of-the-art self-supervised vision transformer, to generate powerful feature representations for a comprehensive set of stroke analysis tasks. Our evaluation encompasses infarct and hemorrhage segmentation, anomaly classification (normal vs. stroke and normal vs. infarct vs. hemorrhage), hemorrhage subtype classification (EDH, SDH, SAH, IPH, IVH), and dichotomized ASPECTS classification (<=6 vs. >6) on multiple public and private datasets. This study establishes strong benchmarks for these tasks and demonstrates the potential of advanced self-supervised models to improve automated stroke diagnosis from NCCT, providing a clear analysis of both the advantages and current constraints of the approach. The code is available at https://github.com/Zzz0251/DINOv3-stroke.
中文: 本研究利用DINOv3自监督视觉变换器改进非增强CT的卒中分析任务,建立了强大基准并展示了其在自动化诊断中的潜力,同时指出了当前方法的局限性。
English: This study utilizes the DINOv3 self-supervised vision transformer to enhance stroke analysis tasks on non-contrast CT, establishing strong benchmarks and demonstrating its potential for automated diagnosis while acknowledging current limitations.

Authors:Yutao Shen, Junkun Yuan, Toru Aonishi, Hideki Nakayama, Yue Ma
Title: Follow-Your-Preference: Towards Preference-Aligned Image Inpainting
Abstract:
This paper investigates image inpainting with preference alignment. Instead of introducing a novel method, we go back to basics and revisit fundamental problems in achieving such alignment. We leverage the prominent direct preference optimization approach for alignment training and employ public reward models to construct preference training datasets. Experiments are conducted across nine reward models, two benchmarks, and two baseline models with varying structures and generative algorithms. Our key findings are as follows: (1) Most reward models deliver valid reward scores for constructing preference data, even if some of them are not reliable evaluators. (2) Preference data demonstrates robust trends in both candidate scaling and sample scaling across models and benchmarks. (3) Observable biases in reward models, particularly in brightness, composition, and color scheme, render them susceptible to cause reward hacking. (4) A simple ensemble of these models yields robust and generalizable results by mitigating such biases. Built upon these observations, our alignment models significantly outperform prior models across standard metrics, GPT-4 assessments, and human evaluations, without any changes to model structures or the use of new datasets. We hope our work can set a simple yet solid baseline, pushing this promising frontier. Our code is open-sourced at: https://github.com/shenytzzz/Follow-Your-Preference.
本研究通过应用直接偏好优化与公共奖励模型重新审视图像修复中的偏好对齐,揭示了奖励模型的偏差并证明简单的集成方法能有效缓解这些问题,从而在多项评估中实现更优性能。
This study revisits image inpainting preference alignment by applying direct preference optimization with public reward models, revealing their biases and demonstrating that a simple ensemble approach effectively mitigates these issues to achieve superior performance across multiple evaluations.

Authors:Ben Liang, Yuan Liu, Bingwen Qiu, Yihong Wang, Xiubao Sui, Qian Chen
Title: FMC-DETR: Frequency-Decoupled Multi-Domain Coordination for Aerial-View Object Detection
Abstract:
Aerial-view object detection is a critical technology for real-world applications such as natural resource monitoring, traffic management, and UAV-based search and rescue. Detecting tiny objects in high-resolution aerial imagery presents a long-standing challenge due to their limited visual cues and the difficulty of modeling global context in complex scenes. Existing methods are often hampered by delayed contextual fusion and inadequate non-linear modeling, failing to effectively use global information to refine shallow features and thus encountering a performance bottleneck. To address these challenges, we propose FMC-DETR, a novel framework with frequency-decoupled fusion for aerial-view object detection. First, we introduce the Wavelet Kolmogorov-Arnold Transformer (WeKat) backbone, which applies cascaded wavelet transforms to enhance global low-frequency context perception in shallow features while preserving fine-grained details, and employs Kolmogorov-Arnold networks to achieve adaptive non-linear modeling of multi-scale dependencies. Next, a lightweight Cross-stage Partial Fusion (CPF) module reduces redundancy and improves multi-scale feature interaction. Finally, we introduce the Multi-Domain Feature Coordination (MDFC) module, which unifies spatial, frequency, and structural priors to to balance detail preservation and global enhancement. Extensive experiments on benchmark aerial-view datasets demonstrate that FMC-DETR achieves state-of-the-art performance with fewer parameters. On the challenging VisDrone dataset, our model achieves improvements of 6.5% AP and 8.2% AP50 over the baseline, highlighting its effectiveness in tiny object detection. The code can be accessed at https://github.com/bloomingvision/FMC-DETR.
Chinese: FMC-DETR提出了一种具有频率解耦融合的新型框架和Wavelet Kolmogorov-Arnold Transformer骨干网络,以增强全局上下文感知和自适应非线性建模,在航空视角微小物体检测中以更少的参数实现了最先进的性能。
English: FMC-DETR introduces a novel framework with frequency-decoupled fusion and a Wavelet Kolmogorov-Arnold Transformer backbone to enhance global context perception and adaptive non-linear modeling, achieving state-of-the-art performance in aerial-view tiny object detection with fewer parameters.

Authors:Ye-eun Kim, Suhyeon Lim, Andrew J. Choi
Title: MMeViT: Multi-Modal ensemble ViT for Post-Stroke Rehabilitation Action Recognition
Abstract:
Rehabilitation therapy for stroke patients faces a supply shortage despite the increasing demand. To address this issue, remote monitoring systems that reduce the burden on medical staff are emerging as a viable alternative. A key component of these remote monitoring systems is Human Action Recognition (HAR) technology, which classifies actions. However, existing HAR studies have primarily focused on non-disable individuals, making them unsuitable for recognizing the actions of stroke patients. HAR research for stroke has largely concentrated on classifying relatively simple actions using machine learning rather than deep learning. In this study, we designed a system to monitor the actions of stroke patients, focusing on domiciliary upper limb Activities of Daily Living (ADL). Our system utilizes IMU (Inertial Measurement Unit) sensors and an RGB-D camera, which are the most common modalities in HAR. We directly collected a dataset through this system, investigated an appropriate preprocess and proposed a deep learning model suitable for processing multimodal data. We analyzed the collected dataset and found that the action data of stroke patients is less clustering than that of non-disabled individuals. Simultaneously, we found that the proposed model learns similar tendencies for each label in data with features that are difficult to clustering. This study suggests the possibility of expanding the deep learning model, which has learned the action features of stroke patients, to not only simple action recognition but also feedback such as assessment contributing to domiciliary rehabilitation in future research. The code presented in this study is available at https://github.com/ye-Kim/MMeViT.
中文: 本研究开发了一种基于惯性测量单元传感器和RGB-D摄像头的多模态深度学习系统,用于识别脑卒中患者的上肢日常活动,解决了现有动作识别模型不适用于该人群的局限性,为远程康复监测提供了潜在应用前景。
English: This study develops a multimodal deep learning system using IMU sensors and an RGB-D camera to recognize upper limb daily activities in stroke patients, addressing the limitations of existing human action recognition models that are unsuitable for this population and enabling potential applications in remote rehabilitation monitoring.

Authors:Gabriel A. Viana, Luis F. Alves Pereira, Tsang Ing Ren, George D. C. Cavalcanti, Jan Sijbers
Title: Perceptual Influence: Improving the Perceptual Loss Design for Low-Dose CT Enhancement
Abstract:
Perceptual losses have emerged as powerful tools for training networks to enhance Low-Dose Computed Tomography (LDCT) images, offering an alternative to traditional pixel-wise losses such as Mean Squared Error, which often lead to over-smoothed reconstructions and loss of clinically relevant details in LDCT images. The perceptual losses operate in a latent feature space defined by a pretrained encoder and aim to preserve semantic content by comparing high-level features rather than raw pixel values. However, the design of perceptual losses involves critical yet underexplored decisions, including the feature representation level, the dataset used to pretrain the encoder, and the relative importance assigned to the perceptual component during optimization. In this work, we introduce the concept of perceptual influence (a metric that quantifies the relative contribution of the perceptual loss term to the total loss) and propose a principled framework to assess the impact of the loss design choices on the model training performance. Through systematic experimentation, we show that the widely used configurations in the literature to set up a perceptual loss underperform compared to better-designed alternatives. Our findings show that better perceptual loss designs lead to significant improvements in noise reduction and structural fidelity of reconstructed CT images, without requiring any changes to the network architecture. We also provide objective guidelines, supported by statistical analysis, to inform the effective use of perceptual losses in LDCT denoising. Our source code is available at https://github.com/vngabriel/perceptual-influence.
感知损失通过特征空间比较保留语义内容以提升低剂量CT图像重建效果,本研究提出系统性评估框架,证明优化后的损失设计能在不改变网络架构的情况下显著提升噪声抑制与结构保真度。
Perceptual losses enhance LDCT image reconstruction by preserving semantic content through feature-level comparisons, and our study introduces a principled framework that demonstrates optimized loss designs significantly improve noise reduction and structural fidelity without altering network architecture.

Authors:Zhiqiang Tian, Weigang Li, Chunhua Deng, Junwei Hu, Yongqiang Wang, Wenping Liu
Title: Desensitizing for Improving Corruption Robustness in Point Cloud Classification through Adversarial Training
Abstract:
Due to scene complexity, sensor inaccuracies, and processing imprecision, point cloud corruption is inevitable. Over-reliance on input features is the root cause of DNN vulnerabilities. It remains unclear whether this issue exists in 3D tasks involving point clouds and whether reducing dependence on these features can enhance the model's robustness to corrupted point clouds. This study attempts to answer these questions. Specifically, we quantified the sensitivity of the DNN to point cloud features using Shapley values and found that models trained using traditional methods exhibited high sensitivity values for certain features. Furthermore, under an equal pruning ratio, prioritizing the pruning of highly sensitive features causes more severe damage to model performance than random pruning. We propose `Desensitized Adversarial Training' (DesenAT), generating adversarial samples using feature desensitization and conducting training within a self-distillation framework, which aims to alleviate DNN's over-reliance on point clouds features by smoothing sensitivity. First, data points with high contribution components are eliminated, and spatial transformation is used to simulate corruption scenes, generate adversarial samples, and conduct adversarial training on the model. Next, to compensate for information loss in adversarial samples, we use the self-distillation method to transfer knowledge from clean samples to adversarial samples, and perform adversarial training in a distillation manner.Extensive experiments on ModelNet-C and PointCloud-C demonstrate show that the propose method can effectively improve the robustness of the model without reducing the performance of clean data sets. This code is publicly available at \href{https://github.com/JerkyT/DesenAT/tree/master}{https://github.com/JerkyT/DesenAT}.
Chinese: 本研究提出去敏感对抗训练(DesenAT),通过生成对抗样本并采用自蒸馏方法,减轻深度神经网络对特定点云特征的过度依赖,从而在保持干净数据集性能的同时,有效提升模型对受损点云的鲁棒性。
English: This study introduces Desensitized Adversarial Training (DesenAT), a method that reduces deep neural networks' over-reliance on specific point cloud features by generating adversarial samples and using self-distillation, thereby enhancing model robustness against corrupted point clouds without compromising performance on clean datasets.

Authors:Siheng Zhao, Jiageng Mao, Wei Chow, Zeyu Shangguan, Tianheng Shi, Rong Xue, Yuxi Zheng, Yijia Weng, Yang You, Daniel Seita, Leonidas Guibas, Sergey Zakharov, Vitor Guizilini, Yue Wang
Title: Robot Learning from Any Images
Abstract:
We introduce RoLA, a framework that transforms any in-the-wild image into an interactive, physics-enabled robotic environment. Unlike previous methods, RoLA operates directly on a single image without requiring additional hardware or digital assets. Our framework democratizes robotic data generation by producing massive visuomotor robotic demonstrations within minutes from a wide range of image sources, including camera captures, robotic datasets, and Internet images. At its core, our approach combines a novel method for single-view physical scene recovery with an efficient visual blending strategy for photorealistic data collection. We demonstrate RoLA's versatility across applications like scalable robotic data generation and augmentation, robot learning from Internet images, and single-image real-to-sim-to-real systems for manipulators and humanoids. Video results are available at https://sihengz02.github.io/RoLA .

Authors:Federico Chinello, Giacomo Boracchi
Title: Convolutional Set Transformer
Abstract:
We introduce the Convolutional Set Transformer (CST), a novel neural architecture designed to process image sets of arbitrary cardinality that are visually heterogeneous yet share high-level semantics - such as a common category, scene, or concept. Existing set-input networks, e.g., Deep Sets and Set Transformer, are limited to vector inputs and cannot directly handle 3D image tensors. As a result, they must be cascaded with a feature extractor, typically a CNN, which encodes images into embeddings before the set-input network can model inter-image relationships. In contrast, CST operates directly on 3D image tensors, performing feature extraction and contextual modeling simultaneously, thereby enabling synergies between the two processes. This design yields superior performance in tasks such as Set Classification and Set Anomaly Detection and further provides native compatibility with CNN explainability methods such as Grad-CAM, unlike competing approaches that remain opaque. Finally, we show that CSTs can be pre-trained on large-scale datasets and subsequently adapted to new domains and tasks through standard Transfer Learning schemes. To support further research, we release CST-15, a CST backbone pre-trained on ImageNet (https://github.com/chinefed/convolutional-set-transformer).
中文摘要:卷积集合变换器(CST)是一种新型神经网络架构,可直接处理三维图像张量组成的异构图像集,通过同步实现特征提取与上下文建模,在集合分类等任务中性能优于现有方法,并保持与CNN可解释性方法的兼容性。
English Summary: The Convolutional Set Transformer (CST) is a novel neural architecture that directly processes heterogeneous image sets as 3D tensors, integrating feature extraction and contextual modeling to outperform existing methods in tasks like set classification while maintaining compatibility with CNN explainability techniques.

Authors:Xuan He, Dongfu Jiang, Ping Nie, Minghao Liu, Zhengxuan Jiang, Mingyi Su, Wentao Ma, Junru Lin, Chun Ye, Yi Lu, Keming Wu, Benjamin Schneider, Quy Duc Do, Zhuofeng Li, Yiming Jia, Yuxuan Zhang, Guo Cheng, Haozhe Wang, Wangchunshu Zhou, Qunshu Lin, Yuanxing Zhang, Ge Zhang, Wenhao Huang, Wenhu Chen
Title: VideoScore2: Think before You Score in Generative Video Evaluation
Abstract:
Recent advances in text-to-video generation have produced increasingly realistic and diverse content, yet evaluating such videos remains a fundamental challenge due to their multi-faceted nature encompassing visual quality, semantic alignment, and physical consistency. Existing evaluators and reward models are limited to single opaque scores, lack interpretability, or provide only coarse analysis, making them insufficient for capturing the comprehensive nature of video quality assessment. We present VideoScore2, a multi-dimensional, interpretable, and human-aligned framework that explicitly evaluates visual quality, text-to-video alignment, and physical/common-sense consistency while producing detailed chain-of-thought rationales. Our model is trained on a large-scale dataset VideoFeedback2 containing 27,168 human-annotated videos with both scores and reasoning traces across three dimensions, using a two-stage pipeline of supervised fine-tuning followed by reinforcement learning with Group Relative Policy Optimization (GRPO) to enhance analytical robustness. Extensive experiments demonstrate that VideoScore2 achieves superior performance with 44.35 (+5.94) accuracy on our in-domain benchmark VideoScore-Bench-v2 and 50.37 (+4.32) average performance across four out-of-domain benchmarks (VideoGenReward-Bench, VideoPhy2, etc), while providing interpretable assessments that bridge the gap between evaluation and controllable generation through effective reward modeling for Best-of-N sampling. Project Page: https://tiger-ai-lab.github.io/VideoScore2/
Chinese Summary: VideoScore2是一个多维可解释的视频评估框架,通过三阶段维度评估和强化学习训练,在多个基准测试中表现优异,并为可控生成提供可解释的评估依据。
English Summary: VideoScore2 is a multi-dimensional and interpretable framework that evaluates text-to-video generation across visual quality, semantic alignment, and physical consistency, achieving superior performance on benchmarks while providing detailed rationales.

Authors:Komal Kumar, Rao Muhammad Anwer, Fahad Shahbaz Khan, Salman Khan, Ivan Laptev, Hisham Cholakkal
Title: DEFT: Decompositional Efficient Fine-Tuning for Text-to-Image Models
Abstract:
Efficient fine-tuning of pre-trained Text-to-Image (T2I) models involves adjusting the model to suit a particular task or dataset while minimizing computational resources and limiting the number of trainable parameters. However, it often faces challenges in striking a trade-off between aligning with the target distribution: learning a novel concept from a limited image for personalization and retaining the instruction ability needed for unifying multiple tasks, all while maintaining editability (aligning with a variety of prompts or in-context generation). In this work, we introduce DEFT, Decompositional Efficient Fine-Tuning, an efficient fine-tuning framework that adapts a pre-trained weight matrix by decomposing its update into two components with two trainable matrices: (1) a projection onto the complement of a low-rank subspace spanned by a low-rank matrix, and (2) a low-rank update. The single trainable low-rank matrix defines the subspace, while the other trainable low-rank matrix enables flexible parameter adaptation within that subspace. We conducted extensive experiments on the Dreambooth and Dreambench Plus datasets for personalization, the InsDet dataset for object and scene adaptation, and the VisualCloze dataset for a universal image generation framework through visual in-context learning with both Stable Diffusion and a unified model. Our results demonstrated state-of-the-art performance, highlighting the emergent properties of efficient fine-tuning. Our code is available on \href{https://github.com/MAXNORM8650/DEFT}{DEFTBase}.
Chinese: DEFT是一种高效微调框架,通过将权重更新分解为两个低秩组件,在个性化学习、任务统一和可编辑性之间实现最佳平衡,并在多个数据集上取得了最先进的性能。
English: DEFT is an efficient fine-tuning framework that decomposes weight updates into two low-rank components, enabling optimal balance between personalization, task unification, and editability while achieving state-of-the-art performance across multiple datasets.

Authors:Fernando Julio Cendra, Kai Han
Title: PartCo: Part-Level Correspondence Priors Enhance Category Discovery
Abstract:
Generalized Category Discovery (GCD) aims to identify both known and novel categories within unlabeled data by leveraging a set of labeled examples from known categories. Existing GCD methods primarily depend on semantic labels and global image representations, often overlooking the detailed part-level cues that are crucial for distinguishing closely related categories. In this paper, we introduce PartCo, short for Part-Level Correspondence Prior, a novel framework that enhances category discovery by incorporating part-level visual feature correspondences. By leveraging part-level relationships, PartCo captures finer-grained semantic structures, enabling a more nuanced understanding of category relationships. Importantly, PartCo seamlessly integrates with existing GCD methods without requiring significant modifications. Our extensive experiments on multiple benchmark datasets demonstrate that PartCo significantly improves the performance of current GCD approaches, achieving state-of-the-art results by bridging the gap between semantic labels and part-level visual compositions, thereby setting new benchmarks for GCD. Project page: https://visual-ai.github.io/partco

Authors:Le Zhang, Ao Li, Qibin Hou, Ce Zhu, Yonina C. Eldar
Title: Deep Learning Empowered Super-Resolution: A Comprehensive Survey and Future Prospects
Abstract:
Super-resolution (SR) has garnered significant attention within the computer vision community, driven by advances in deep learning (DL) techniques and the growing demand for high-quality visual applications. With the expansion of this field, numerous surveys have emerged. Most existing surveys focus on specific domains, lacking a comprehensive overview of this field. Here, we present an in-depth review of diverse SR methods, encompassing single image super-resolution (SISR), video super-resolution (VSR), stereo super-resolution (SSR), and light field super-resolution (LFSR). We extensively cover over 150 SISR methods, nearly 70 VSR approaches, and approximately 30 techniques for SSR and LFSR. We analyze methodologies, datasets, evaluation protocols, empirical results, and complexity. In addition, we conducted a taxonomy based on each backbone structure according to the diverse purposes. We also explore valuable yet under-studied open issues in the field. We believe that this work will serve as a valuable resource and offer guidance to researchers in this domain. To facilitate access to related work, we created a dedicated repository available at https://github.com/AVC2-UESTC/Holistic-Super-Resolution-Review.
中文摘要:本综述全面分析了150多种单图像、70种视频及30种立体/光场超分辨率方法,通过方法论比较和未充分研究方向的探讨,为研究人员提供了该领域的重要参考资源。
English Summary: This comprehensive survey provides an in-depth analysis of over 150 single image, 70 video, and 30 stereo/light field super-resolution methods, offering methodological comparisons and identifying under-studied research directions to serve as a key resource for researchers.

Authors:Ha-Hieu Pham, Minh Le, Han Huynh, Nguyen Quoc Khanh Le, Huy-Hieu Pham
Title: Graph-Theoretic Consistency for Robust and Topology-Aware Semi-Supervised Histopathology Segmentation
Abstract:
Semi-supervised semantic segmentation (SSSS) is vital in computational pathology, where dense annotations are costly and limited. Existing methods often rely on pixel-level consistency, which propagates noisy pseudo-labels and produces fragmented or topologically invalid masks. We propose Topology Graph Consistency (TGC), a framework that integrates graph-theoretic constraints by aligning Laplacian spectra, component counts, and adjacency statistics between prediction graphs and references. This enforces global topology and improves segmentation accuracy. Experiments on GlaS and CRAG demonstrate that TGC achieves state-of-the-art performance under 5-10% supervision and significantly narrows the gap to full supervision. Code is available at https://github.com/hieuphamha19/TGC.
Chinese: 提出的拓扑图一致性(TGC)框架通过施加图论约束来保持全局拓扑结构,从而在半监督语义分割中取得优异表现,在少量监督下达到最先进水平,并显著缩小与全监督之间的差距。
English: The proposed Topology Graph Consistency (TGC) framework enhances semi-supervised semantic segmentation by enforcing graph-theoretic constraints to maintain global topology, achieving state-of-the-art performance with minimal supervision and narrowing the gap to full supervision on medical datasets.

Authors:Yash Thube
Title: Pathological Truth Bias in Vision-Language Models
Abstract:
Vision Language Models (VLMs) are improving quickly, but standard benchmarks can hide systematic failures that reduce real world trust. We introduce MATS (Multimodal Audit for Truthful Spatialization), a compact behavioral audit that measures whether models reject visually contradicted statements, and two metrics Spatial Consistency Score (SCS) and Incorrect Agreement Rate (IAR). Instruction tuned generative VLMs (LLaVA 1.5, QwenVLchat) exhibit very low SCS and high IAR, while contrastive encoders (CLIP, SigLIP) are far more robust. Activation patching causally localizes failure loci (mid to late cross attention for generative models, pooled projection components for contrastive models) and suggests concrete repair paths.
中文: 视觉语言模型在拒绝视觉矛盾陈述方面存在系统性缺陷,MATS审计通过空间一致性评分和不正确同意率量化了生成模型的不足,并借助激活修补定位了可修复的故障节点。
English: Vision Language Models often fail to reject visually contradicted statements, as revealed by the MATS audit, which identifies systematic weaknesses in generative models and suggests targeted repairs through activation patching.

Authors:E-Ro Nguyen, Yichi Zhang, Kanchana Ranasinghe, Xiang Li, Michael S. Ryoo
Title: Pixel Motion Diffusion is What We Need for Robot Control
Abstract:
We present DAWN (Diffusion is All We Need for robot control), a unified diffusion-based framework for language-conditioned robotic manipulation that bridges high-level motion intent and low-level robot action via structured pixel motion representation. In DAWN, both the high-level and low-level controllers are modeled as diffusion processes, yielding a fully trainable, end-to-end system with interpretable intermediate motion abstractions. DAWN achieves state-of-the-art results on the challenging CALVIN benchmark, demonstrating strong multi-task performance, and further validates its effectiveness on MetaWorld. Despite the substantial domain gap between simulation and reality and limited real-world data, we demonstrate reliable real-world transfer with only minimal finetuning, illustrating the practical viability of diffusion-based motion abstractions for robotic control. Our results show the effectiveness of combining diffusion modeling with motion-centric representations as a strong baseline for scalable and robust robot learning. Project page: https://nero1342.github.io/DAWN/

Authors:Ke Wang, Houxing Ren, Zimu Lu, Mingjie Zhan, Hongsheng Li
Title: VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing
Abstract:
The growing capabilities of large language models and multimodal systems have spurred interest in voice-first AI assistants, yet existing benchmarks are inadequate for evaluating the full range of these systems' capabilities. We introduce VoiceAssistant-Eval, a comprehensive benchmark designed to assess AI assistants across listening, speaking, and viewing. VoiceAssistant-Eval comprises 10,497 curated examples spanning 13 task categories. These tasks include natural sounds, music, and spoken dialogue for listening; multi-turn dialogue, role-play imitation, and various scenarios for speaking; and highly heterogeneous images for viewing. To demonstrate its utility, we evaluate 21 open-source models and GPT-4o-Audio, measuring the quality of the response content and speech, as well as their consistency. The results reveal three key findings: (1) proprietary models do not universally outperform open-source models; (2) most models excel at speaking tasks but lag in audio understanding; and (3) well-designed smaller models can rival much larger ones. Notably, the mid-sized Step-Audio-2-mini (7B) achieves more than double the listening accuracy of LLaMA-Omni2-32B-Bilingual. However, challenges remain: multimodal (audio plus visual) input and role-play voice imitation tasks are difficult for current models, and significant gaps persist in robustness and safety alignment. VoiceAssistant-Eval identifies these gaps and establishes a rigorous framework for evaluating and guiding the development of next-generation AI assistants. Code and data will be released at https://mathllm.github.io/VoiceAssistantEval/ .

Authors:Long Xing, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jianze Liang, Qidong Huang, Jiaqi Wang, Feng Wu, Dahua Lin
Title: CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning
Abstract:
Image captioning is a fundamental task that bridges the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable data annotated by humans or proprietary models. This approach often leads to models that memorize specific ground-truth answers, limiting their generality and ability to generate diverse, creative descriptions. To overcome the limitation of SFT, we propose applying the Reinforcement Learning with Verifiable Rewards (RLVR) paradigm to the open-ended task of image captioning. A primary challenge, however, is designing an objective reward function for the inherently subjective nature of what constitutes a "good" caption. We introduce Captioning Reinforcement Learning (CapRL), a novel training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding image. CapRL employs a decoupled two-stage pipeline where an LVLM generates a caption, and the objective reward is derived from the accuracy of a separate, vision-free LLM answering Multiple-Choice Questions based solely on that caption. As the first study to apply RLVR to the subjective image captioning task, we demonstrate that CapRL significantly enhances multiple settings. Pretraining on the CapRL-5M caption dataset annotated by CapRL-3B results in substantial gains across 12 benchmarks. Moreover, within the Prism Framework for caption quality evaluation, CapRL achieves performance comparable to Qwen2.5-VL-72B, while exceeding the baseline by an average margin of 8.4%. Code is available here: https://github.com/InternLM/CapRL.
中文总结:该研究提出CapRL强化学习框架,通过评估描述能否帮助语言模型准确回答图像相关问题来定义描述质量,从而突破监督微调的限制。
English Summary: The study introduces CapRL, a reinforcement learning framework that overcomes limitations of supervised fine-tuning by defining caption quality through a caption's ability to help language models answer image-related questions accurately.

Authors:Alexandre Lopes, Roberto Souza, Helio Pedrini
Title: CCNeXt: An Effective Self-Supervised Stereo Depth Estimation Approach
Abstract:
Depth Estimation plays a crucial role in recent applications in robotics, autonomous vehicles, and augmented reality. These scenarios commonly operate under constraints imposed by computational power. Stereo image pairs offer an effective solution for depth estimation since it only needs to estimate the disparity of pixels in image pairs to determine the depth in a known rectified system. Due to the difficulty in acquiring reliable ground-truth depth data across diverse scenarios, self-supervised techniques emerge as a solution, particularly when large unlabeled datasets are available. We propose a novel self-supervised convolutional approach that outperforms existing state-of-the-art Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) while balancing computational cost. The proposed CCNeXt architecture employs a modern CNN feature extractor with a novel windowed epipolar cross-attention module in the encoder, complemented by a comprehensive redesign of the depth estimation decoder. Our experiments demonstrate that CCNeXt achieves competitive metrics on the KITTI Eigen Split test data while being 10.18$\times$ faster than the current best model and achieves state-of-the-art results in all metrics in the KITTI Eigen Split Improved Ground Truth and Driving Stereo datasets when compared to recently proposed techniques. To ensure complete reproducibility, our project is accessible at \href{https://github.com/alelopes/CCNext}{\texttt{https://github.com/alelopes/CCNext}}.
中文: 提出的CCNeXt架构采用自监督卷积方法,在深度估计任务中超越了现有CNN和ViT模型,以显著提升的计算速度在多个数据集上实现了最优性能。
English: The proposed CCNeXt architecture introduces a self-supervised convolutional approach that surpasses existing CNNs and ViTs in depth estimation, achieving state-of-the-art results with significantly faster computational speed across multiple datasets.

Authors:Ziyu Liu, Yuhang Zang, Shengyuan Ding, Yuhang Cao, Xiaoyi Dong, Haodong Duan, Dahua Lin, Jiaqi Wang
Title: SPARK: Synergistic Policy And Reward Co-Evolving Framework
Abstract:
Recent Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) increasingly use Reinforcement Learning (RL) for post-pretraining, such as RL with Verifiable Rewards (RLVR) for objective tasks and RL from Human Feedback (RLHF) for subjective tasks. However, RLHF incurs high costs and potential reward-policy mismatch due to reliance on human preferences, while RLVR still wastes supervision by discarding rollouts and correctness signals after each update. To address these challenges, we introduce the Synergistic Policy And Reward Co-Evolving Framework (SPARK), an efficient, on-policy, and stable method that builds on RLVR. Instead of discarding rollouts and correctness data, SPARK recycles this valuable information to simultaneously train the model itself as a generative reward model. This auxiliary training uses a mix of objectives, such as pointwise reward score, pairwise comparison, and evaluation conditioned on further-reflection responses, to teach the model to evaluate and improve its own responses. Our process eliminates the need for a separate reward model and costly human preference data. SPARK creates a positive co-evolving feedback loop: improved reward accuracy yields better policy gradients, which in turn produce higher-quality rollouts that further refine the reward model. Our unified framework supports test-time scaling via self-reflection without external reward models and their associated costs. We show that SPARK achieves significant performance gains on multiple LLM and LVLM models and multiple reasoning, reward models, and general benchmarks. For example, SPARK-VL-7B achieves an average 9.7% gain on 7 reasoning benchmarks, 12.1% on 2 reward benchmarks, and 1.5% on 8 general benchmarks over the baselines, demonstrating robustness and broad generalization.
中文摘要:SPARK框架通过回收利用训练过程中的数据和正确性信号,协同优化策略与生成式奖励模型,无需依赖昂贵的人工反馈,在多项基准测试中实现了显著的性能提升。
English Summary: The SPARK framework efficiently recycles rollout and correctness data to co-evolve both the policy and a generative reward model, eliminating the need for costly human feedback and achieving significant performance gains across various benchmarks.

Authors:Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, Yukang Chen
Title: LongLive: Real-time Interactive Long Video Generation
Abstract:
We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference, but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with new prompts for smooth, adherent switches; streaming long tuning to enable long video training and to align training and inference (train-long-test-long); and short window attention paired with a frame-level attention sink, shorten as frame sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short and long videos. LongLive supports up to 240-second videos on a single H100 GPU. LongLive further supports INT8-quantized inference with only marginal quality loss.
中文: LongLive是一种用于实时生成长视频的自回归框架,通过KV重缓存和流式长视频调优解决了效率与质量难题,实现了高速生成并支持交互式提示输入。
English: LongLive is an autoregressive framework for real-time long video generation that overcomes efficiency and quality challenges through KV-recache and streaming long tuning, achieving high-speed performance and supporting interactive prompt inputs.

Authors:Shuang Zeng, Dekang Qi, Xinyuan Chang, Feng Xiong, Shichao Xie, Xiaolong Wu, Shiyi Liang, Mu Xu, Xing Wei
Title: JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation
Abstract:
Vision-and-Language Navigation requires an embodied agent to navigate through unseen environments, guided by natural language instructions and a continuous video stream. Recent advances in VLN have been driven by the powerful semantic understanding of Multimodal Large Language Models. However, these methods typically rely on explicit semantic memory, such as building textual cognitive maps or storing historical visual frames. This type of method suffers from spatial information loss, computational redundancy, and memory bloat, which impede efficient navigation. Inspired by the implicit scene representation in human navigation, analogous to the left brain's semantic understanding and the right brain's spatial cognition, we propose JanusVLN, a novel VLN framework featuring a dual implicit neural memory that models spatial-geometric and visual-semantic memory as separate, compact, and fixed-size neural representations. This framework first extends the MLLM to incorporate 3D prior knowledge from the spatial-geometric encoder, thereby enhancing the spatial reasoning capabilities of models based solely on RGB input. Then, the historical key-value caches from the spatial-geometric and visual-semantic encoders are constructed into a dual implicit memory. By retaining only the KVs of tokens in the initial and sliding window, redundant computation is avoided, enabling efficient incremental updates. Extensive experiments demonstrate that JanusVLN outperforms over 20 recent methods to achieve SOTA performance. For example, the success rate improves by 10.5-35.5 compared to methods using multiple data types as input and by 3.6-10.8 compared to methods using more RGB training data. This indicates that the proposed dual implicit neural memory, as a novel paradigm, explores promising new directions for future VLN research. Ours project page: https://miv-xjtu.github.io/JanusVLN.github.io/.

Authors:Zhenqi He, Yuanpei Liu, Kai Han
Title: Category Discovery: An Open-World Perspective
Abstract:
Category discovery (CD) is an emerging open-world learning task, which aims at automatically categorizing unlabelled data containing instances from unseen classes, given some labelled data from seen classes. This task has attracted significant attention over the years and leads to a rich body of literature trying to address the problem from different perspectives. In this survey, we provide a comprehensive review of the literature, and offer detailed analysis and in-depth discussion on different methods. Firstly, we introduce a taxonomy for the literature by considering two base settings, namely novel category discovery (NCD) and generalized category discovery (GCD), and several derived settings that are designed to address the extra challenges in different real-world application scenarios, including continual category discovery, skewed data distribution, federated category discovery, etc. Secondly, for each setting, we offer a detailed analysis of the methods encompassing three fundamental components, representation learning, label assignment, and estimation of class number. Thirdly, we benchmark all the methods and distill key insights showing that large-scale pretrained backbones, hierarchical and auxiliary cues, and curriculum-style training are all beneficial for category discovery, while challenges remain in the design of label assignment, the estimation of class numbers, and scaling to complex multi-object scenarios. Finally, we discuss the key insights from the literature so far and point out promising future research directions. We compile a living survey of the category discovery literature at https://github.com/Visual-AI/Category-Discovery.
中文: 类别发现是一种开放世界学习任务,利用已见类别的标注数据对未见类别的未标注数据进行自动分类,本综述系统梳理了不同设定下的方法体系,通过基准测试指出表征学习和课程式训练等有效策略,同时揭示了类别数量估计等关键挑战与发展方向。
English: Category discovery is an open-world learning task that groups unlabeled data from unseen classes using labeled data from seen classes, with this survey providing a comprehensive taxonomy, method analysis, and benchmarking of approaches while highlighting challenges like class number estimation and future research directions.

Authors:Ruoyu Chen, Xiaoqing Guo, Kangwei Liu, Siyuan Liang, Shiming Liu, Qunli Zhang, Hua Zhang, Xiaochun Cao
Title: Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation
Abstract:
Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in aligning visual inputs with natural language outputs. Yet, the extent to which generated tokens depend on visual modalities remains poorly understood, limiting interpretability and reliability. In this work, we present EAGLE, a lightweight black-box framework for explaining autoregressive token generation in MLLMs. EAGLE attributes any selected tokens to compact perceptual regions while quantifying the relative influence of language priors and perceptual evidence. The framework introduces an objective function that unifies sufficiency (insight score) and indispensability (necessity score), optimized via greedy search over sparsified image regions for faithful and efficient attribution. Beyond spatial attribution, EAGLE performs modality-aware analysis that disentangles what tokens rely on, providing fine-grained interpretability of model decisions. Extensive experiments across open-source MLLMs show that EAGLE consistently outperforms existing methods in faithfulness, localization, and hallucination diagnosis, while requiring substantially less GPU memory. These results highlight its effectiveness and practicality for advancing the interpretability of MLLMs. The code is available at https://github.com/RuoyuChen10/EAGLE.
中文: EAGLE是一个轻量级黑盒框架,通过将多模态大语言模型的令牌生成归因于视觉区域并量化语言先验与感知证据的相对影响,显著提升了模型的可解释性,在忠实度和效率上均优于现有方法。
English: EAGLE is a lightweight black-box framework that enhances the interpretability of multimodal large language models by attributing token generation to visual regions and quantifying the influence of language priors versus perceptual evidence, outperforming existing methods in faithfulness and efficiency.

Authors:Guohui Zhang, Hu Yu, Xiaoxiao Ma, JingHao Zhang, Yaning Pan, Mingde Yao, Jie Xiao, Linjiang Huang, Feng Zhao
Title: Group Critical-token Policy Optimization for Autoregressive Image Generation
Abstract:
Recent studies have extended Reinforcement Learning with Verifiable Rewards (RLVR) to autoregressive (AR) visual generation and achieved promising progress. However, existing methods typically apply uniform optimization across all image tokens, while the varying contributions of different image tokens for RLVR's training remain unexplored. In fact, the key obstacle lies in how to identify more critical image tokens during AR generation and implement effective token-wise optimization for them. To tackle this challenge, we propose $\textbf{G}$roup $\textbf{C}$ritical-token $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{GCPO}$), which facilitates effective policy optimization on critical tokens. We identify the critical tokens in RLVR-based AR generation from three perspectives, specifically: $\textbf{(1)}$ Causal dependency: early tokens fundamentally determine the later tokens and final image effect due to unidirectional dependency; $\textbf{(2)}$ Entropy-induced spatial structure: tokens with high entropy gradients correspond to image structure and bridges distinct visual regions; $\textbf{(3)}$ RLVR-focused token diversity: tokens with low visual similarity across a group of sampled images contribute to richer token-level diversity. For these identified critical tokens, we further introduce a dynamic token-wise advantage weight to encourage exploration, based on confidence divergence between the policy model and reference model. By leveraging 30\% of the image tokens, GCPO achieves better performance than GRPO with full tokens. Extensive experiments on multiple text-to-image benchmarks for both AR models and unified multimodal models demonstrate the effectiveness of GCPO for AR visual generation.
Chinese: 近期研究提出关键令牌组策略优化(GCPO),通过基于因果依赖、熵梯度和令牌多样性识别关键图像令牌进行选择性优化,在仅使用30%令牌的情况下实现了比全令牌优化更优的自回归视觉生成效果。
English: Recent research introduces Group Critical-token Policy Optimization (GCPO), which enhances reinforcement learning for autoregressive visual generation by selectively optimizing critical image tokens identified through causal dependency, entropy gradients, and token diversity, achieving superior performance with only 30% of tokens.

Authors:Mishal Fatima, Shashank Agnihotri, Marius Bock, Kanchana Vaishnavi Gandikota, Kristof Van Laerhoven, Michael Moeller, Margret Keuper
Title: $γ$-Quant: Towards Learnable Quantization for Low-bit Pattern Recognition
Abstract:
Most pattern recognition models are developed on pre-proce\-ssed data. In computer vision, for instance, RGB images processed through image signal processing (ISP) pipelines designed to cater to human perception are the most frequent input to image analysis networks. However, many modern vision tasks operate without a human in the loop, raising the question of whether such pre-processing is optimal for automated analysis. Similarly, human activity recognition (HAR) on body-worn sensor data commonly takes normalized floating-point data arising from a high-bit analog-to-digital converter (ADC) as an input, despite such an approach being highly inefficient in terms of data transmission, significantly affecting the battery life of wearable devices. In this work, we target low-bandwidth and energy-constrained settings where sensors are limited to low-bit-depth capture. We propose $γ$-Quant, i.e.~the task-specific learning of a non-linear quantization for pattern recognition. We exemplify our approach on raw-image object detection as well as HAR of wearable data, and demonstrate that raw data with a learnable quantization using as few as 4-bits can perform on par with the use of raw 12-bit data. All code to reproduce our experiments is publicly available via https://github.com/Mishalfatima/Gamma-Quant
中文摘要:本研究提出γ-Quant方法,通过任务特定的非线性量化学习,仅用4位原始数据即可实现与12位数据相当的模式识别性能,有效解决了计算机视觉和人体活动识别中的能效挑战。
English Summary: This study introduces γ-Quant, a method for learning task-specific non-linear quantization that enables pattern recognition using only 4-bit raw data while matching the performance of 12-bit data, addressing efficiency challenges in computer vision and human activity recognition applications.

Authors:Song Fei, Tian Ye, Lujia Wang, Lei Zhu
Title: LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer
Abstract:
Universal image restoration (UIR) aims to recover images degraded by unknown mixtures while preserving semantics -- conditions under which discriminative restorers and UNet-based diffusion priors often oversmooth, hallucinate, or drift. We present LucidFlux, a caption-free UIR framework that adapts a large diffusion transformer (Flux.1) without image captions. LucidFlux introduces a lightweight dual-branch conditioner that injects signals from the degraded input and a lightly restored proxy to respectively anchor geometry and suppress artifacts. Then, a timestep- and layer-adaptive modulation schedule is designed to route these cues across the backbone's hierarchy, in order to yield coarse-to-fine and context-aware updates that protect the global structure while recovering texture. After that, to avoid the latency and instability of text prompts or MLLM captions, we enforce caption-free semantic alignment via SigLIP features extracted from the proxy. A scalable curation pipeline further filters large-scale data for structure-rich supervision. Across synthetic and in-the-wild benchmarks, LucidFlux consistently outperforms strong open-source and commercial baselines, and ablation studies verify the necessity of each component. LucidFlux shows that, for large DiTs, when, where, and what to condition on -- rather than adding parameters or relying on text prompts -- is the governing lever for robust and caption-free universal image restoration in the wild.

Authors:Haoyu Li, XiaoSong Li
Title: Gradient-based multi-focus image fusion with focus-aware saliency enhancement
Abstract:
Multi-focus image fusion (MFIF) aims to yield an all-focused image from multiple partially focused inputs, which is crucial in applications cover sur-veillance, microscopy, and computational photography. However, existing methods struggle to preserve sharp focus-defocus boundaries, often resulting in blurred transitions and focused details loss. To solve this problem, we propose a MFIF method based on significant boundary enhancement, which generates high-quality fused boundaries while effectively detecting focus in-formation. Particularly, we propose a gradient-domain-based model that can obtain initial fusion results with complete boundaries and effectively pre-serve the boundary details. Additionally, we introduce Tenengrad gradient detection to extract salient features from both the source images and the ini-tial fused image, generating the corresponding saliency maps. For boundary refinement, we develop a focus metric based on gradient and complementary information, integrating the salient features with the complementary infor-mation across images to emphasize focused regions and produce a high-quality initial decision result. Extensive experiments on four public datasets demonstrate that our method consistently outperforms 12 state-of-the-art methods in both subjective and objective evaluations. We have realized codes in https://github.com/Lihyua/GICI
中文摘要:本文提出了一种基于显著边界增强的多焦点图像融合方法,通过梯度域模型和Tenengrad梯度检测优化边界细节,在四个公开数据集上的实验表明该方法在主观和客观评估中均优于12种先进方法。
English Summary: This paper introduces a multi-focus image fusion method that enhances boundary quality through gradient-domain modeling and Tenengrad detection, demonstrating superior performance over existing methods in preserving sharp focus transitions.

Authors:Xiao Wang, Shujuan Wu, Xiaoxia Cheng, Changwei Bi, Jin Tang, Bin Luo
Title: Pedestrian Attribute Recognition via Hierarchical Cross-Modality HyperGraph Learning
Abstract:
Current Pedestrian Attribute Recognition (PAR) algorithms typically focus on mapping visual features to semantic labels or attempt to enhance learning by fusing visual and attribute information. However, these methods fail to fully exploit attribute knowledge and contextual information for more accurate recognition. Although recent works have started to consider using attribute text as additional input to enhance the association between visual and semantic information, these methods are still in their infancy. To address the above challenges, this paper proposes the construction of a multi-modal knowledge graph, which is utilized to mine the relationships between local visual features and text, as well as the relationships between attributes and extensive visual context samples. Specifically, we propose an effective multi-modal knowledge graph construction method that fully considers the relationships among attributes and the relationships between attributes and vision tokens. To effectively model these relationships, this paper introduces a knowledge graph-guided cross-modal hypergraph learning framework to enhance the standard pedestrian attribute recognition framework. Comprehensive experiments on multiple PAR benchmark datasets have thoroughly demonstrated the effectiveness of our proposed knowledge graph for the PAR task, establishing a strong foundation for knowledge-guided pedestrian attribute recognition. The source code of this paper will be released on https://github.com/Event-AHU/OpenPAR
中文摘要:本文提出了一种多模态知识图谱,通过建模视觉特征与属性文本之间的关系来提升行人属性识别性能,并在多个基准数据集上通过全面实验验证了其有效性。
English Summary: This paper introduces a multi-modal knowledge graph to enhance pedestrian attribute recognition by modeling relationships between visual features and attribute texts, validated through comprehensive experiments on benchmark datasets.

Authors:Pierrick Chatillon, Julien Rabin, David Tschumperlé
Title: NIFTY: a Non-Local Image Flow Matching for Texture Synthesis
Abstract:
This paper addresses the problem of exemplar-based texture synthesis. We introduce NIFTY, a hybrid framework that combines recent insights on diffusion models trained with convolutional neural networks, and classical patch-based texture optimization techniques. NIFTY is a non-parametric flow-matching model built on non-local patch matching, which avoids the need for neural network training while alleviating common shortcomings of patch-based methods, such as poor initialization or visual artifacts. Experimental results demonstrate the effectiveness of the proposed approach compared to representative methods from the literature. Code is available at https://github.com/PierrickCh/Nifty.git
中文: 本文提出NIFTY混合框架,结合扩散模型与基于斑块的纹理优化技术,无需神经网络训练即可解决基于范例的纹理合成问题,并有效克服传统斑块方法的常见缺陷。
English: This paper introduces NIFTY, a hybrid framework for exemplar-based texture synthesis that combines diffusion models with patch-based optimization, eliminating neural network training while overcoming common limitations of patch methods.

Authors:Jinpeng Lu, Linghan Cai, Yinda Chen, Guo Tang, Songhan Jiang, Haoyuan Shi, Zhiwei Xiong
Title: Johnson-Lindenstrauss Lemma Guided Network for Efficient 3D Medical Segmentation
Abstract:
Lightweight 3D medical image segmentation remains constrained by a fundamental "efficiency / robustness conflict", particularly when processing complex anatomical structures and heterogeneous modalities. In this paper, we study how to redesign the framework based on the characteristics of high-dimensional 3D images, and explore data synergy to overcome the fragile representation of lightweight methods. Our approach, VeloxSeg, begins with a deployable and extensible dual-stream CNN-Transformer architecture composed of Paired Window Attention (PWA) and Johnson-Lindenstrauss lemma-guided convolution (JLC). For each 3D image, we invoke a "glance-and-focus" principle, where PWA rapidly retrieves multi-scale information, and JLC ensures robust local feature extraction with minimal parameters, significantly enhancing the model's ability to operate with low computational budget. Followed by an extension of the dual-stream architecture that incorporates modal interaction into the multi-scale image-retrieval process, VeloxSeg efficiently models heterogeneous modalities. Finally, Spatially Decoupled Knowledge Transfer (SDKT) via Gram matrices injects the texture prior extracted by a self-supervised network into the segmentation network, yielding stronger representations than baselines at no extra inference cost. Experimental results on multimodal benchmarks show that VeloxSeg achieves a 26% Dice improvement, alongside increasing GPU throughput by 11x and CPU by 48x. Codes are available at https://github.com/JinPLu/VeloxSeg.
Chinese: VeloxSeg提出了一种双流CNN-Transformer架构,结合配对窗口注意力和约翰逊-林登斯特劳斯引理指导的卷积,通过模态交互和多尺度特征提取,在提升轻量化3D医学图像分割效率与鲁棒性的同时,实现了显著的性能改进和计算加速。
English: VeloxSeg introduces a dual-stream CNN-Transformer architecture with Paired Window Attention and Johnson-Lindenstrauss lemma-guided convolution, enhancing lightweight 3D medical image segmentation by improving efficiency and robustness across heterogeneous modalities while achieving significant performance gains and computational speedups.

Authors:Michael Jungo, Andreas Fischer
Title: Rule-Based Reinforcement Learning for Document Image Classification with Vision Language Models
Abstract:
Rule-based reinforcement learning has been gaining popularity ever since DeepSeek-R1 has demonstrated its success through simple verifiable rewards. In the domain of document analysis, reinforcement learning is not as prevalent, even though many downstream tasks may benefit from the emerging properties of reinforcement learning, particularly the enhanced reason capabilities. We study the effects of rule-based reinforcement learning with the task of Document Image Classification which is one of the most commonly studied downstream tasks in document analysis. We find that reinforcement learning tends to have better generalisation capabilities to out-of-distritbution data, which we examine in three different scenarios, namely out-of-distribution images, unseen classes and different modalities. Our code is available at https://github.com/jungomi/vision-finetune.
中文摘要:基于规则的强化学习在文档图像分类任务中展现出更强的泛化能力,尤其在处理分布外图像、未见类别和不同模态数据时表现优异。
English Summary: Rule-based reinforcement learning shows improved generalization for document image classification, particularly with out-of-distribution data across images, classes, and modalities.

Authors:Junyi Wu, Zhiteng Li, Haotong Qin, Xiaohong Liu, Linghe Kong, Yulun Zhang, Xiaokang Yang
Title: FlashEdit: Decoupling Speed, Structure, and Semantics for Precise Image Editing
Abstract:
Text-guided image editing with diffusion models has achieved remarkable quality but suffers from prohibitive latency, hindering real-world applications. We introduce FlashEdit, a novel framework designed to enable high-fidelity, real-time image editing. Its efficiency stems from three key innovations: (1) a One-Step Inversion-and-Editing (OSIE) pipeline that bypasses costly iterative processes; (2) a Background Shield (BG-Shield) technique that guarantees background preservation by selectively modifying features only within the edit region; and (3) a Sparsified Spatial Cross-Attention (SSCA) mechanism that ensures precise, localized edits by suppressing semantic leakage to the background. Extensive experiments demonstrate that FlashEdit maintains superior background consistency and structural integrity, while performing edits in under 0.2 seconds, which is an over 150$\times$ speedup compared to prior multi-step methods. Our code will be made publicly available at https://github.com/JunyiWuCode/FlashEdit.
Chinese: FlashEdit是一种新颖的图像编辑框架,通过三项关键创新——OSIE绕过迭代过程、BG-Shield保护背景和SSCA实现精准局部编辑,实现了高保真度的实时编辑,速度提升150倍,编辑时间低于0.2秒。
English: FlashEdit is a novel framework that enables high-fidelity, real-time image editing by incorporating three key innovations—OSIE for bypassing iterative processes, BG-Shield for background preservation, and SSCA for precise localized edits—achieving a 150× speedup and edits in under 0.2 seconds.

Authors:Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, Zhenjiang Jin, Guang Liang, Rui Zhang, Wenzheng Zhang, Yuan Qu, Zhifei Ren, Yuefeng Sun, Yuanhong Zheng, Dongsheng Ma, Zirui Tang, Boyu Niu, Ziyang Miao, Hejun Dong, Siyi Qian, Junyuan Zhang, Jingzhou Chen, Fangdong Wang, Xiaomeng Zhao, Liqun Wei, Wei Li, Shasha Wang, Ruiliang Xu, Yuanyuan Cao, Lu Chen, Qianqian Wu, Huaiyu Gu, Lindong Lu, Keming Wang, Dechen Lin, Guanlin Shen, Xuanhe Zhou, Linfeng Zhang, Yuhang Zang, Xiaoyi Dong, Jiaqi Wang, Bo Zhang, Lei Bai, Pei Chu, Weijia Li, Jiang Wu, Lijun Wu, Zhenxiang Li, Guangyu Wang, Zhongying Tu, Chao Xu, Kai Chen, Yu Qiao, Bowen Zhou, Dahua Lin, Wentao Zhang, Conghui He
Title: MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
Abstract:
We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. In the first stage, the model performs efficient layout analysis on downsampled images to identify structural elements, circumventing the computational overhead of processing high-resolution inputs. In the second stage, guided by the global layout, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, we developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. Ultimately, MinerU2.5 demonstrates strong document parsing ability, achieving state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead.
Chinese: MinerU2.5是一个12亿参数的视觉语言模型,采用从粗到精的两阶段解析策略,通过将全局布局分析与局部内容识别分离,在保持高计算效率的同时实现了最先进的文档解析性能。
English: MinerU2.5 is a 1.2B-parameter vision-language model that uses a two-stage parsing strategy to achieve state-of-the-art document recognition accuracy with high computational efficiency by decoupling layout analysis from content recognition.

Authors:Inzamamul Alam, Md Tanvir Islam, Simon S. Woo
Title: SpecXNet: A Dual-Domain Convolutional Network for Robust Deepfake Detection
Abstract:
The increasing realism of content generated by GANs and diffusion models has made deepfake detection significantly more challenging. Existing approaches often focus solely on spatial or frequency-domain features, limiting their generalization to unseen manipulations. We propose the Spectral Cross-Attentional Network (SpecXNet), a dual-domain architecture for robust deepfake detection. The core \textbf{Dual-Domain Feature Coupler (DDFC)} decomposes features into a local spatial branch for capturing texture-level anomalies and a global spectral branch that employs Fast Fourier Transform to model periodic inconsistencies. This dual-domain formulation allows SpecXNet to jointly exploit localized detail and global structural coherence, which are critical for distinguishing authentic from manipulated images. We also introduce the \textbf{Dual Fourier Attention (DFA)} module, which dynamically fuses spatial and spectral features in a content-aware manner. Built atop a modified XceptionNet backbone, we embed the DDFC and DFA modules within a separable convolution block. Extensive experiments on multiple deepfake benchmarks show that SpecXNet achieves state-of-the-art accuracy, particularly under cross-dataset and unseen manipulation scenarios, while maintaining real-time feasibility. Our results highlight the effectiveness of unified spatial-spectral learning for robust and generalizable deepfake detection. To ensure reproducibility, we released the full code on \href{https://github.com/inzamamulDU/SpecXNet}{\textcolor{blue}{\textbf{GitHub}}}.
中文摘要:SpecXNet提出了一种结合空间与频谱特征的双域架构,通过创新模块实现了最先进的深度伪造检测,具有强大的泛化能力和实时性能。
English Summary: SpecXNet introduces a dual-domain architecture combining spatial and spectral features through novel modules, achieving state-of-the-art deepfake detection with strong generalization and real-time performance.

Authors:Muxi Chen, Zhaohua Zhang, Chenchen Zhao, Mingyang Chen, Wenyu Jiang, Tianwen Jiang, Jianhuan Zhuo, Yu Tang, Qiuyong Xiao, Jihong Zhang, Qiang Xu
Title: FailureAtlas:Mapping the Failure Landscape of T2I Models via Active Exploration
Abstract:
Static benchmarks have provided a valuable foundation for comparing Text-to-Image (T2I) models. However, their passive design offers limited diagnostic power, struggling to uncover the full landscape of systematic failures or isolate their root causes. We argue for a complementary paradigm: active exploration. We introduce FailureAtlas, the first framework designed to autonomously explore and map the vast failure landscape of T2I models at scale. FailureAtlas frames error discovery as a structured search for minimal, failure-inducing concepts. While it is a computationally explosive problem, we make it tractable with novel acceleration techniques. When applied to Stable Diffusion models, our method uncovers hundreds of thousands of previously unknown error slices (over 247,000 in SD1.5 alone) and provides the first large-scale evidence linking these failures to data scarcity in the training set. By providing a principled and scalable engine for deep model auditing, FailureAtlas establishes a new, diagnostic-first methodology to guide the development of more robust generative AI. The code is available at https://github.com/cure-lab/FailureAtlas
中文摘要:FailureAtlas提出了首个主动探索框架,能够自主绘制文本到图像模型的系统性故障图谱,在Stable Diffusion中发现超24.7万个错误案例并揭示其与训练数据匮乏的关联。
English Summary: FailureAtlas introduces an active exploration framework that autonomously maps systematic failures in text-to-image models, revealing over 247,000 error cases in Stable Diffusion and linking them to training data scarcity.

Authors:Jewon Lee, Wooksu Shin, Seungmin Yang, Ki-Ung Song, DongUk Lim, Jaeyeon Kim, Tae-Ho Kim, Bo-Kyeong Kim
Title: ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models
Abstract:
Efficient processing of high-resolution images is crucial for real-world vision-language applications. However, existing Large Vision-Language Models (LVLMs) incur substantial computational overhead due to the large number of vision tokens. With the advent of "thinking with images" models, reasoning now extends beyond text to the visual domain. This capability motivates our two-stage "coarse-to-fine" reasoning pipeline: first, a downsampled image is analyzed to identify task-relevant regions; then, only these regions are cropped at full resolution and processed in a subsequent reasoning stage. This approach reduces computational cost while preserving fine-grained visual details where necessary. A major challenge lies in inferring which regions are truly relevant to a given query. Recent related methods often fail in the first stage after input-image downsampling, due to perception-driven reasoning, where clear visual information is required for effective reasoning. To address this issue, we propose ERGO (Efficient Reasoning & Guided Observation) that performs reasoning-driven perception-leveraging multimodal context to determine where to focus. Our model can account for perceptual uncertainty, expanding the cropped region to cover visually ambiguous areas for answering questions. To this end, we develop simple yet effective reward components in a reinforcement learning framework for coarse-to-fine perception. Across multiple datasets, our approach delivers higher accuracy than the original model and competitive methods, with greater efficiency. For instance, ERGO surpasses Qwen2.5-VL-7B on the V* benchmark by 4.7 points while using only 23% of the vision tokens, achieving a 3x inference speedup. The code and models can be found at: https://github.com/nota-github/ERGO.
中文: ERGO采用两阶段推理流程,先识别下采样图像中的任务相关区域,再仅对这些区域进行全分辨率处理,从而以显著降低的计算成本实现更高的准确率。
English: ERGO introduces a two-stage reasoning pipeline that first identifies task-relevant regions in downsampled images and then processes only those areas at full resolution, achieving higher accuracy with significantly reduced computational costs.

Authors:Abdelrahman Eldesokey, Aleksandar Cvejic, Bernard Ghanem, Peter Wonka
Title: Mind-the-Glitch: Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation
Abstract:
We propose a novel approach for disentangling visual and semantic features from the backbones of pre-trained diffusion models, enabling visual correspondence in a manner analogous to the well-established semantic correspondence. While diffusion model backbones are known to encode semantically rich features, they must also contain visual features to support their image synthesis capabilities. However, isolating these visual features is challenging due to the absence of annotated datasets. To address this, we introduce an automated pipeline that constructs image pairs with annotated semantic and visual correspondences based on existing subject-driven image generation datasets, and design a contrastive architecture to separate the two feature types. Leveraging the disentangled representations, we propose a new metric, Visual Semantic Matching (VSM), that quantifies visual inconsistencies in subject-driven image generation. Empirical results show that our approach outperforms global feature-based metrics such as CLIP, DINO, and vision--language models in quantifying visual inconsistencies while also enabling spatial localization of inconsistent regions. To our knowledge, this is the first method that supports both quantification and localization of inconsistencies in subject-driven generation, offering a valuable tool for advancing this task. Project Page:https://abdo-eldesokey.github.io/mind-the-glitch/

Authors:Zhe Zhu, Le Wan, Rui Xu, Yiheng Zhang, Honghua Chen, Zhiyang Dou, Cheng Lin, Yuan Liu, Mingqiang Wei
Title: PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data
Abstract:
Segmenting 3D objects into parts is a long-standing challenge in computer vision. To overcome taxonomy constraints and generalize to unseen 3D objects, recent works turn to open-world part segmentation. These approaches typically transfer supervision from 2D foundation models, such as SAM, by lifting multi-view masks into 3D. However, this indirect paradigm fails to capture intrinsic geometry, leading to surface-only understanding, uncontrolled decomposition, and limited generalization. We present PartSAM, the first promptable part segmentation model trained natively on large-scale 3D data. Following the design philosophy of SAM, PartSAM employs an encoder-decoder architecture in which a triplane-based dual-branch encoder produces spatially structured tokens for scalable part-aware representation learning. To enable large-scale supervision, we further introduce a model-in-the-loop annotation pipeline that curates over five million 3D shape-part pairs from online assets, providing diverse and fine-grained labels. This combination of scalable architecture and diverse 3D data yields emergent open-world capabilities: with a single prompt, PartSAM achieves highly accurate part identification, and in a Segment-Every-Part mode, it automatically decomposes shapes into both surface and internal structures. Extensive experiments show that PartSAM outperforms state-of-the-art methods by large margins across multiple benchmarks, marking a decisive step toward foundation models for 3D part understanding.
中文:PartSAM是一种创新的可提示三维部件分割模型,通过直接学习大规模三维数据,能够精确识别表面与内部结构,并在多个基准测试中大幅超越现有方法。
English: PartSAM is a novel promptable 3D part segmentation model that directly learns from large-scale 3D data, enabling accurate identification of both surface and internal structures while significantly outperforming existing methods.

Authors:Tao Wu, Yibo Jiang, Yehao Lu, Zhizhong Wang, Zeyi Huang, Zequn Qin, Xi Li
Title: MultiCrafter: High-Fidelity Multi-Subject Generation via Spatially Disentangled Attention and Identity-Aware Reinforcement Learning
Abstract:
Multi-subject image generation aims to synthesize user-provided subjects in a single image while preserving subject fidelity, ensuring prompt consistency, and aligning with human aesthetic preferences. However, existing methods, particularly those built on the In-Context-Learning paradigm, are limited by their reliance on simple reconstruction-based objectives, leading to both severe attribute leakage that compromises subject fidelity and failing to align with nuanced human preferences. To address this, we propose MultiCrafter, a framework that ensures high-fidelity, preference-aligned generation. First, we find that the root cause of attribute leakage is a significant entanglement of attention between different subjects during the generation process. Therefore, we introduce explicit positional supervision to explicitly separate attention regions for each subject, effectively mitigating attribute leakage. To enable the model to accurately plan the attention region of different subjects in diverse scenarios, we employ a Mixture-of-Experts architecture to enhance the model's capacity, allowing different experts to focus on different scenarios. Finally, we design a novel online reinforcement learning framework to align the model with human preferences, featuring a scoring mechanism to accurately assess multi-subject fidelity and a more stable training strategy tailored for the MoE architecture. Experiments validate that our framework significantly improves subject fidelity while aligning with human preferences better.

Authors:Daiqing Wu, Dongbao Yang, Sicheng Zhao, Can Ma, Yu Zhou
Title: Customizing Visual Emotion Evaluation for MLLMs: An Open-vocabulary, Multifaceted, and Scalable Approach
Abstract:
Recently, Multimodal Large Language Models (MLLMs) have achieved exceptional performance across diverse tasks, continually surpassing previous expectations regarding their capabilities. Nevertheless, their proficiency in perceiving emotions from images remains debated, with studies yielding divergent results in zero-shot scenarios. We argue that this inconsistency stems partly from constraints in existing evaluation methods, including the oversight of plausible responses, limited emotional taxonomies, neglect of contextual factors, and labor-intensive annotations. To facilitate customized visual emotion evaluation for MLLMs, we propose an Emotion Statement Judgment task that overcomes these constraints. Complementing this task, we devise an automated pipeline that efficiently constructs emotion-centric statements with minimal human effort. Through systematically evaluating prevailing MLLMs, our study showcases their stronger performance in emotion interpretation and context-based emotion judgment, while revealing relative limitations in comprehending perception subjectivity. When compared to humans, even top-performing MLLMs like GPT4o demonstrate remarkable performance gaps, underscoring key areas for future improvement. By developing a fundamental evaluation framework and conducting a comprehensive MLLM assessment, we hope this work contributes to advancing emotional intelligence in MLLMs. Project page: https://github.com/wdqqdw/MVEI.
中文: 本研究提出情感陈述判断任务和自动化流程,以解决多模态大语言模型在视觉情感感知评估中的局限,发现其在情感解读方面表现较强,但与人类能力仍存在显著差距。
English: This study introduces an Emotion Statement Judgment task and automated pipeline to address limitations in evaluating Multimodal Large Language Models' (MLLMs) visual emotion perception, revealing their strengths in emotion interpretation but significant gaps compared to human performance.

Authors:Woosung Joung, Daewon Chae, Jinkyu Kim
Title: SemanticControl: A Training-Free Approach for Handling Loosely Aligned Visual Conditions in ControlNet
Abstract:
ControlNet has enabled detailed spatial control in text-to-image diffusion models by incorporating additional visual conditions such as depth or edge maps. However, its effectiveness heavily depends on the availability of visual conditions that are precisely aligned with the generation goal specified by text prompt-a requirement that often fails in practice, especially for uncommon or imaginative scenes. For example, generating an image of a cat cooking in a specific pose may be infeasible due to the lack of suitable visual conditions. In contrast, structurally similar cues can often be found in more common settings-for instance, poses of humans cooking are widely available and can serve as rough visual guides. Unfortunately, existing ControlNet models struggle to use such loosely aligned visual conditions, often resulting in low text fidelity or visual artifacts. To address this limitation, we propose SemanticControl, a training-free method for effectively leveraging misaligned but semantically relevant visual conditions. Our approach adaptively suppresses the influence of the visual condition where it conflicts with the prompt, while strengthening guidance from the text. The key idea is to first run an auxiliary denoising process using a surrogate prompt aligned with the visual condition (e.g., "a human playing guitar" for a human pose condition) to extract informative attention masks, and then utilize these masks during the denoising of the actual target prompt (e.g., cat playing guitar). Experimental results demonstrate that our method improves performance under loosely aligned conditions across various conditions, including depth maps, edge maps, and human skeletons, outperforming existing baselines. Our code is available at https://mung3477.github.io/semantic-control.

Authors:Jibin Song, Mingi Kwon, Jaeseok Jeong, Youngjung Uh
Title: Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers
Abstract:
Text-to-video and image-to-video generation have made rapid progress in visual quality, but they remain limited in controlling the precise timing of motion. In contrast, audio provides temporal cues aligned with video motion, making it a promising condition for temporally controlled video generation. However, existing audio-to-video (A2V) models struggle with fine-grained synchronization due to indirect conditioning mechanisms or limited temporal modeling capacity. We present Syncphony, which generates 380x640 resolution, 24fps videos synchronized with diverse audio inputs. Our approach builds upon a pre-trained video backbone and incorporates two key components to improve synchronization: (1) Motion-aware Loss, which emphasizes learning at high-motion regions; (2) Audio Sync Guidance, which guides the full model using a visually aligned off-sync model without audio layers to better exploit audio cues at inference while maintaining visual quality. To evaluate synchronization, we propose CycleSync, a video-to-audio-based metric that measures the amount of motion cues in the generated video to reconstruct the original audio. Experiments on AVSync15 and The Greatest Hits datasets demonstrate that Syncphony outperforms existing methods in both synchronization accuracy and visual quality. Project page is available at: https://jibin86.github.io/syncphony_project_page

Authors:Minje Kim, Tae-Kyun Kim
Title: SRHand: Super-Resolving Hand Images and 3D Shapes via View/Pose-aware Neural Image Representations and Explicit 3D Meshes
Abstract:
Reconstructing detailed hand avatars plays a crucial role in various applications. While prior works have focused on capturing high-fidelity hand geometry, they heavily rely on high-resolution multi-view image inputs and struggle to generalize on low-resolution images. Multi-view image super-resolution methods have been proposed to enforce 3D view consistency. These methods, however, are limited to static objects/scenes with fixed resolutions and are not applicable to articulated deformable hands. In this paper, we propose SRHand (Super-Resolution Hand), the method for reconstructing detailed 3D geometry as well as textured images of hands from low-resolution images. SRHand leverages the advantages of implicit image representation with explicit hand meshes. Specifically, we introduce a geometric-aware implicit image function (GIIF) that learns detailed hand prior by upsampling the coarse input images. By jointly optimizing the implicit image function and explicit 3D hand shapes, our method preserves multi-view and pose consistency among upsampled hand images, and achieves fine-detailed 3D reconstruction (wrinkles, nails). In experiments using the InterHand2.6M and Goliath datasets, our method significantly outperforms state-of-the-art image upsampling methods adapted to hand datasets, and 3D hand reconstruction methods, quantitatively and qualitatively. Project page: https://yunminjin2.github.io/projects/srhand

Authors:Yu Shang, Yangcheng Yu, Xin Zhang, Xin Jin, Haisheng Su, Wei Wu, Yong Li
Title: MoWM: Mixture-of-World-Models for Embodied Planning via Latent-to-Pixel Feature Modulation
Abstract:
Embodied action planning is a core challenge in robotics, requiring models to generate precise actions from visual observations and language instructions. While video generation world models are promising, their reliance on pixel-level reconstruction often introduces visual redundancies that hinder action decoding and generalization. Latent world models offer a compact, motion-aware representation, but overlook the fine-grained details critical for precise manipulation. To overcome these limitations, we propose MoWM, a mixture-of-world-model framework that fuses representations from hybrid world models for embodied action planning. Our approach uses motion-aware representations from a latent model as a high-level prior, which guides the extraction of fine-grained visual features from the pixel space model. This design allows MoWM to highlight the informative visual details needed for action decoding. Extensive evaluations on the CALVIN benchmark demonstrate that our method achieves state-of-the-art task success rates and superior generalization. We also provide a comprehensive analysis of the strengths of each feature space, offering valuable insights for future research in embodied planning. The code is available at: https://github.com/tsinghua-fib-lab/MoWM.
中文摘要:提出的MoWM框架融合了潜在模型的运动感知表征和像素模型的细粒度视觉特征,在CALVIN基准测试中实现了最优的任务成功率,显著提升了具身行动规划能力。
English Summary: The proposed MoWM framework combines motion-aware latent representations with fine-grained visual features from pixel models to enhance embodied action planning, achieving state-of-the-art performance on the CALVIN benchmark.

Authors:Yu Shang, Lei Jin, Yiding Ma, Xin Zhang, Chen Gao, Wei Wu, Yong Li
Title: LongScape: Advancing Long-Horizon Embodied World Models with Context-Aware MoE
Abstract:
Video-based world models hold significant potential for generating high-quality embodied manipulation data. However, current video generation methods struggle to achieve stable long-horizon generation: classical diffusion-based approaches often suffer from temporal inconsistency and visual drift over multiple rollouts, while autoregressive methods tend to compromise on visual detail. To solve this, we introduce LongScape, a hybrid framework that adaptively combines intra-chunk diffusion denoising with inter-chunk autoregressive causal generation. Our core innovation is an action-guided, variable-length chunking mechanism that partitions video based on the semantic context of robotic actions. This ensures each chunk represents a complete, coherent action, enabling the model to flexibly generate diverse dynamics. We further introduce a Context-aware Mixture-of-Experts (CMoE) framework that adaptively activates specialized experts for each chunk during generation, guaranteeing high visual quality and seamless chunk transitions. Extensive experimental results demonstrate that our method achieves stable and consistent long-horizon generation over extended rollouts. Our code is available at: https://github.com/tsinghua-fib-lab/Longscape.
中文: LongScape提出了一种结合扩散去噪与自回归生成的混合框架,通过动作引导的分块机制和情境感知专家混合模型,实现了稳定高质量的长序列具身操作视频生成。
English: LongScape introduces a hybrid framework combining diffusion denoising and autoregressive generation with action-guided chunking and a Context-aware Mixture-of-Experts to achieve stable, high-quality long-horizon video generation for embodied manipulation.

Authors:Xinlei Yu, Chengming Xu, Guibin Zhang, Yongbo He, Zhangquan Chen, Zhucun Xue, Jiangning Zhang, Yue Liao, Xiaobin Hu, Yu-Gang Jiang, Shuicheng Yan
Title: Visual Multi-Agent System: Mitigating Hallucination Snowballing via Visual Flow
Abstract:
Multi-Agent System (MAS) powered by Visual Language Models (VLMs) enables challenging tasks but suffers from a novel failure term, multi-agent visual hallucination snowballing, where hallucinations are seeded in a single agent and amplified by following ones due to the over-reliance on textual flow to relay visual information. Through turn-, layer-, and token-wise attention analyses, we provide detailed insights into the essence of hallucination snowballing regarding the reduction of visual attention allocation. It leads us to identify a subset of vision tokens with a unimodal attention peak in middle layers that best preserve visual evidence but gradually diminish in deeper agent turns, resulting in the visual hallucination snowballing in MAS. Thus, we propose ViF, a lightweight, plug-and-play mitigation paradigm that relays inter-agent messages with Visual Flow powered by the selected visual relay tokens and applies attention reallocation to amplify this pattern. The experiment results demonstrate that our method markedly reduces hallucination snowballing, consistently improving the performance across eight benchmarks based on four common MAS structures and ten base models. The source code will be available at: https://github.com/YU-deep/ViF.git.
中文: 多智能体视觉语言模型因视觉注意力减弱而产生幻觉雪球效应,而提出的ViF方法通过视觉流和注意力重分配有效缓解此问题,显著提升了多个基准测试的性能。
English: Multi-agent systems using visual language models are prone to hallucination snowballing due to reduced visual attention, which is effectively mitigated by the proposed ViF method that uses visual flow and attention reallocation to enhance performance across multiple benchmarks.

Authors:Lihao Zheng, Jiawei Chen, Xintian Shen, Hao Ma, Tao Wei
Title: MIRG-RL: Multi-Image Reasoning and Grounding with Reinforcement Learning
Abstract:
Multi-image reasoning and grounding require understanding complex cross-image relationships at both object levels and image levels. Current Large Visual Language Models (LVLMs) face two critical challenges: the lack of cross-image reasoning capabilities and insufficient cross-image reference reward modeling. To address these issues, we propose a unified framework - Multi-Image Reasoning and Grounding with Reinforcement Learning (MIRG-RL). Specifically, our two-stage training paradigm combines supervised fine-tuning with annotated trajectories and image-aware reinforcement learning optimization, progressively developing multi-image reasoning capabilities. Furthermore, we innovatively propose a method for constructing the trajectory data, which integrates object-level and image-level annotation information, and use this method to generate a lightweight reasoning-enhanced dataset. To effectively resolve cross-image ambiguities, we design an image-aware RL policy with dual reward functions for objects and images. Experiments demonstrate that MIRG-RL achieves state-of-the-art (SOTA) performance in multi-image grounding benchmarks, attaining 64.82% on cross-image reasoning tasks - exceeding the previous best method by 1%. The code and dataset have been released at https://github.com/ZEUS2035/MIRG-RL.
中文: 为解决大型视觉语言模型缺乏跨图像推理能力的问题,我们提出MIRG-RL统一框架,结合监督微调与图像感知强化学习,在多图像定位基准测试中实现了最先进的性能。
English: To address the lack of cross-image reasoning capabilities in Large Visual Language Models, we propose MIRG-RL, a unified framework that combines supervised fine-tuning with image-aware reinforcement learning, achieving state-of-the-art performance in multi-image grounding benchmarks.

Authors:Tianci Wu, Guangming Zhu, Jiang Lu, Siyuan Wang, Ning Wang, Nuoye Xiong, Zhang Liang
Title: Prompt-guided Representation Disentanglement for Action Recognition
Abstract:
Action recognition is a fundamental task in video understanding. Existing methods typically extract unified features to process all actions in one video, which makes it challenging to model the interactions between different objects in multi-action scenarios. To alleviate this issue, we explore disentangling any specified actions from complex scenes as an effective solution. In this paper, we propose Prompt-guided Disentangled Representation for Action Recognition (ProDA), a novel framework that disentangles any specified actions from a multi-action scene. ProDA leverages Spatio-temporal Scene Graphs (SSGs) and introduces Dynamic Prompt Module (DPM) to guide a Graph Parsing Neural Network (GPNN) in generating action-specific representations. Furthermore, we design a video-adapted GPNN that aggregates information using dynamic weights. Experiments in video action recognition demonstrate the effectiveness of our approach when compared with the state-of-the-art methods. Our code can be found in https://github.com/iamsnaping/ProDA.git
Chinese: ProDA框架通过时空场景图和动态提示模块,从复杂多动作场景中分离指定动作,利用图解析神经网络生成动作特定表征,在视频动作识别中展现出优于现有方法的性能。
English: The ProDA framework introduces a novel approach to action recognition by disentangling specified actions from complex multi-action scenes using spatio-temporal scene graphs and a dynamic prompt module, achieving state-of-the-art performance through action-specific representations.

Authors:Lan Chen, Yuchao Gu, Qi Mao
Title: UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models
Abstract:
Large language models, trained on extensive corpora, successfully unify diverse linguistic tasks within a single generative framework. Inspired by this, recent works like Large Vision Model (LVM) extend this paradigm to vision by organizing tasks into sequential visual sentences, where visual prompts serve as the context to guide outputs. However, such modeling requires task-specific pre-training across modalities and sources, which is costly and limits scalability to unseen tasks. Given that pre-trained video generation models inherently capture temporal sequence dependencies, we explore a more unified and scalable alternative: can a pre-trained video generation model adapt to diverse image and video tasks? To answer this, we propose UniVid, a framework that fine-tunes a video diffusion transformer to handle various vision tasks without task-specific modifications. Tasks are represented as visual sentences, where the context sequence defines both the task and the expected output modality. We evaluate the generalization of UniVid from two perspectives: (1) cross-modal inference with contexts composed of both images and videos, extending beyond LVM's uni-modal setting; (2) cross-source tasks from natural to annotated data, without multi-source pre-training. Despite being trained solely on natural video data, UniVid generalizes well in both settings. Notably, understanding and generation tasks can easily switch by simply reversing the visual sentence order in this paradigm. These findings highlight the potential of pre-trained video generation models to serve as a scalable and unified foundation for vision modeling. Our code will be released at https://github.com/CUC-MIPG/UniVid.
中文摘要:UniVid框架通过视觉句子表示,将预训练视频生成模型应用于多种视觉任务,在跨模态和跨数据源的场景中展现出强大泛化能力,无需针对特定任务进行训练。
English Summary: The UniVid framework adapts a pre-trained video generation model to handle diverse vision tasks through visual sentence representations, demonstrating strong generalization across modalities and data sources without task-specific training.

Authors:Mehwish Mehmood, Ivor Spence, Muhammad Fahim
Title: LFA-Net: A Lightweight Network with LiteFusion Attention for Retinal Vessel Segmentation
Abstract:
Lightweight retinal vessel segmentation is important for the early diagnosis of vision-threatening and systemic diseases, especially in a real-world clinical environment with limited computational resources. Although segmentation methods based on deep learning are improving, existing models are still facing challenges of small vessel segmentation and high computational costs. To address these challenges, we proposed a new vascular segmentation network, LFA-Net, which incorporates a newly designed attention module, LiteFusion-Attention. This attention module incorporates residual learning connections, Vision Mamba-inspired dynamics, and modulation-based attention, enabling the model to capture local and global context efficiently and in a lightweight manner. LFA-Net offers high performance with 0.11 million parameters, 0.42 MB memory size, and 4.46 GFLOPs, which make it ideal for resource-constrained environments. We validated our proposed model on DRIVE, STARE, and CHASE_DB with outstanding performance in terms of dice scores of 83.28, 87.44, and 84.50% and Jaccard indices of 72.85, 79.31, and 74.70%, respectively. The code of LFA-Net is available online https://github.com/Mehwish4593/LFA-Net.
Chinese: 研究人员开发了LFA-Net轻量化视网膜血管分割网络,采用创新的LiteFusion-Attention模块,在低计算资源下实现高性能,非常适合临床诊断应用。
English: Researchers developed LFA-Net, a lightweight retinal vessel segmentation network featuring the innovative LiteFusion-Attention module, which achieves high performance with minimal computational resources, making it ideal for clinical diagnostics.

Authors:Mahindra Singh Rautela, Alexander Most, Siddharth Mansingh, Bradley C. Love, Ayan Biswas, Diane Oyen, Earl Lawrence
Title: MORPH: Shape-agnostic PDE Foundation Models
Abstract:
We introduce MORPH, a shape-agnostic, autoregressive foundation model for partial differential equations (PDEs). MORPH is built on a convolutional vision transformer backbone that seamlessly handles heterogeneous spatiotemporal datasets of varying data dimensionality (1D--3D) at different resolutions, multiple fields with mixed scalar and vector components. The architecture combines (i) component-wise convolution, which jointly processes scalar and vector channels to capture local interactions, (ii) inter-field cross-attention, which models and selectively propagates information between different physical fields, (iii) axial attentions, which factorizes full spatiotemporal self-attention along individual spatial and temporal axes to reduce computational burden while retaining expressivity. We pretrain multiple model variants on a diverse collection of heterogeneous PDE datasets and evaluate transfer to a range of downstream prediction tasks. Using both full-model fine-tuning and parameter-efficient low-rank adapters (LoRA), MORPH outperforms models trained from scratch in both zero-shot and full-shot generalization. Across extensive evaluations, MORPH matches or surpasses strong baselines and recent state-of-the-art models. Collectively, these capabilities present a flexible and powerful backbone for learning from heterogeneous and multimodal nature of scientific observations, charting a path toward scalable and data-efficient scientific machine learning. The source code, datasets, and models are publicly available at https://github.com/lanl/MORPH.
Chinese: MORPH是一种与形状无关的自回归PDE基础模型,能处理1D-3D异构时空数据集,通过创新的架构组件和高效训练技术,在泛化任务中超越现有模型表现。
English: MORPH is a shape-agnostic, autoregressive foundation model for PDEs that handles heterogeneous spatiotemporal datasets across 1D-3D dimensions and outperforms existing models in generalization tasks through innovative architectural components and efficient training techniques.

Authors:You Xie, Tianpei Gu, Zenan Li, Chenxu Zhang, Guoxian Song, Xiaochen Zhao, Chao Liang, Jianwen Jiang, Hongyi Xu, Linjie Luo
Title: X-Streamer: Unified Human World Modeling with Audiovisual Interaction
Abstract:
We introduce X-Streamer, an end-to-end multimodal human world modeling framework for building digital human agents capable of infinite interactions across text, speech, and video within a single unified architecture. Starting from a single portrait, X-Streamer enables real-time, open-ended video calls driven by streaming multimodal inputs. At its core is a Thinker-Actor dual-transformer architecture that unifies multimodal understanding and generation, turning a static portrait into persistent and intelligent audiovisual interactions. The Thinker module perceives and reasons over streaming user inputs, while its hidden states are translated by the Actor into synchronized multimodal streams in real time. Concretely, the Thinker leverages a pretrained large language-speech model, while the Actor employs a chunk-wise autoregressive diffusion model that cross-attends to the Thinker's hidden states to produce time-aligned multimodal responses with interleaved discrete text and audio tokens and continuous video latents. To ensure long-horizon stability, we design inter- and intra-chunk attentions with time-aligned multimodal positional embeddings for fine-grained cross-modality alignment and context retention, further reinforced by chunk-wise diffusion forcing and global identity referencing. X-Streamer runs in real time on two A100 GPUs, sustaining hours-long consistent video chat experiences from arbitrary portraits and paving the way toward unified world modeling of interactive digital humans.

Authors:Prasanna Reddy Pulakurthi, Jiamian Wang, Majid Rabbani, Sohail Dianat, Raghuveer Rao, Zhiqiang Tao
Title: X-CoT: Explainable Text-to-Video Retrieval via LLM-based Chain-of-Thought Reasoning
Abstract:
Prevalent text-to-video retrieval systems mainly adopt embedding models for feature extraction and compute cosine similarities for ranking. However, this design presents two limitations. Low-quality text-video data pairs could compromise the retrieval, yet are hard to identify and examine. Cosine similarity alone provides no explanation for the ranking results, limiting the interpretability. We ask that can we interpret the ranking results, so as to assess the retrieval models and examine the text-video data? This work proposes X-CoT, an explainable retrieval framework upon LLM CoT reasoning in place of the embedding model-based similarity ranking. We first expand the existing benchmarks with additional video annotations to support semantic understanding and reduce data bias. We also devise a retrieval CoT consisting of pairwise comparison steps, yielding detailed reasoning and complete ranking. X-CoT empirically improves the retrieval performance and produces detailed rationales. It also facilitates the model behavior and data quality analysis. Code and data are available at: https://github.com/PrasannaPulakurthi/X-CoT.
Chinese: 现有文本-视频检索系统存在数据质量低和结果不可解释的问题,X-CoT通过采用基于大语言模型的推理框架替代传统嵌入模型,不仅提升了检索性能,还能生成详细推理过程。
English: Current text-to-video retrieval systems face limitations from low-quality data and lack of interpretability, which X-CoT addresses by replacing embedding models with an LLM-based reasoning framework that enhances performance and provides detailed explanations.

Authors:Rohan Sanda, Asad Aali, Andrew Johnston, Eduardo Reis, Jonathan Singh, Gordon Wetzstein, Sara Fridovich-Keil
Title: Patch-Based Diffusion for Data-Efficient, Radiologist-Preferred MRI Reconstruction
Abstract:
Magnetic resonance imaging (MRI) requires long acquisition times, raising costs, reducing accessibility, and making scans more susceptible to motion artifacts. Diffusion probabilistic models that learn data-driven priors can potentially assist in reducing acquisition time. However, they typically require large training datasets that can be prohibitively expensive to collect. Patch-based diffusion models have shown promise in learning effective data-driven priors over small real-valued datasets, but have not yet demonstrated clinical value in MRI. We extend the Patch-based Diffusion Inverse Solver (PaDIS) to complex-valued, multi-coil MRI reconstruction, and compare it against a state-of-the-art whole-image diffusion baseline (FastMRI-EDM) for 7x undersampled MRI reconstruction on the FastMRI brain dataset. We show that PaDIS-MRI models trained on small datasets of as few as 25 k-space images outperform FastMRI-EDM on image quality metrics (PSNR, SSIM, NRMSE), pixel-level uncertainty, cross-contrast generalization, and robustness to severe k-space undersampling. In a blinded study with three radiologists, PaDIS-MRI reconstructions were chosen as diagnostically superior in 91.7% of cases, compared to baselines (i) FastMRI-EDM and (ii) classical convex reconstruction with wavelet sparsity. These findings highlight the potential of patch-based diffusion priors for high-fidelity MRI reconstruction in data-scarce clinical settings where diagnostic confidence matters.
中文摘要:基于补丁的扩散模型PaDIS-MRI仅需少量训练数据即可实现高质量加速MRI重建,在诊断准确性和鲁棒性上均优于传统方法。
English Summary: Patch-based diffusion models like PaDIS-MRI enable high-quality, accelerated MRI reconstruction with small training datasets, outperforming conventional methods in diagnostic accuracy and robustness.

Authors:Yuan Gao, Hao Wu, Qingsong Wen, Kun Wang, Xian Wu, Xiaomeng Huang
Title: VISION: Prompting Ocean Vertical Velocity Reconstruction from Incomplete Observations
Abstract:
Reconstructing subsurface ocean dynamics, such as vertical velocity fields, from incomplete surface observations poses a critical challenge in Earth science, a field long hampered by the lack of standardized, analysis-ready benchmarks. To systematically address this issue and catalyze research, we first build and release KD48, a high-resolution ocean dynamics benchmark derived from petascale simulations and curated with expert-driven denoising. Building on this benchmark, we introduce VISION, a novel reconstruction paradigm based on Dynamic Prompting designed to tackle the core problem of missing data in real-world observations. The essence of VISION lies in its ability to generate a visual prompt on-the-fly from any available subset of observations, which encodes both data availability and the ocean's physical state. More importantly, we design a State-conditioned Prompting module that efficiently injects this prompt into a universal backbone, endowed with geometry- and scale-aware operators, to guide its adaptive adjustment of computational strategies. This mechanism enables VISION to precisely handle the challenges posed by varying input combinations. Extensive experiments on the KD48 benchmark demonstrate that VISION not only substantially outperforms state-of-the-art models but also exhibits strong generalization under extreme data missing scenarios. By providing a high-quality benchmark and a robust model, our work establishes a solid infrastructure for ocean science research under data uncertainty. Our codes are available at: https://github.com/YuanGao-YG/VISION.
中文: 本研究提出了高分辨率海洋动力学基准KD48和新型重建模型VISION,该模型通过动态视觉提示机制自适应处理不完整观测数据,在极端数据缺失场景下显著优于现有方法,为海洋科学研究建立了坚实基础。
English: This study introduces KD48, a high-resolution ocean dynamics benchmark, and VISION, a novel reconstruction model that dynamically adapts to incomplete data through visual prompting, significantly outperforming existing methods and enhancing research infrastructure for subsurface ocean analysis.

Authors:Hude Liu, Jerry Yao-Chieh Hu, Jennifer Yuntong Zhang, Zhao Song, Han Liu
Title: Are Hallucinations Bad Estimations?
Abstract:
We formalize hallucinations in generative models as failures to link an estimate to any plausible cause. Under this interpretation, we show that even loss-minimizing optimal estimators still hallucinate. We confirm this with a general high probability lower bound on hallucinate rate for generic data distributions. This reframes hallucination as structural misalignment between loss minimization and human-acceptable outputs, and hence estimation errors induced by miscalibration. Experiments on coin aggregation, open-ended QA, and text-to-image support our theory.
Chinese: 该研究将生成模型中的幻觉重新定义为损失最小化与人类期望之间的结构性错配,证明即使是最优估计器也会因校准误差而产生幻觉。
English: The study redefines hallucinations in generative models as structural misalignment between loss minimization and human expectations, demonstrating that even optimal estimators hallucinate due to estimation errors from miscalibration.

Authors:Anton Konushin, Nikita Drozdov, Bulat Gabdullin, Alexey Zakharov, Anna Vorontsova, Danila Rukhovich, Maksim Kolodiazhnyi
Title: TUN3D: Towards Real-World Scene Understanding from Unposed Images
Abstract:
Layout estimation and 3D object detection are two fundamental tasks in indoor scene understanding. When combined, they enable the creation of a compact yet semantically rich spatial representation of a scene. Existing approaches typically rely on point cloud input, which poses a major limitation since most consumer cameras lack depth sensors and visual-only data remains far more common. We address this issue with TUN3D, the first method that tackles joint layout estimation and 3D object detection in real scans, given multi-view images as input, and does not require ground-truth camera poses or depth supervision. Our approach builds on a lightweight sparse-convolutional backbone and employs two dedicated heads: one for 3D object detection and one for layout estimation, leveraging a novel and effective parametric wall representation. Extensive experiments show that TUN3D achieves state-of-the-art performance across three challenging scene understanding benchmarks: (i) using ground-truth point clouds, (ii) using posed images, and (iii) using unposed images. While performing on par with specialized 3D object detection methods, TUN3D significantly advances layout estimation, setting a new benchmark in holistic indoor scene understanding. Code is available at https://github.com/col14m/tun3d .
中文摘要:TUN3D是首个仅通过多视角图像即可联合实现布局估计与3D物体检测的方法,无需深度传感器或相机位姿真值监督,在多项基准测试中达到了最先进的性能水平。
English Summary: TUN3D is the first method to jointly perform layout estimation and 3D object detection from multi-view images without requiring depth sensors or camera pose supervision, achieving state-of-the-art performance across multiple benchmarks.

Authors:Anja Sheppard, Tyler Smithline, Andrew Scheffer, David Smith, Advaith V. Sethuraman, Ryan Bird, Sabrina Lin, Katherine A. Skinner
Title: ShipwreckFinder: A QGIS Tool for Shipwreck Detection in Multibeam Sonar Data
Abstract:
In this paper, we introduce ShipwreckFinder, an open-source QGIS plugin that detects shipwrecks from multibeam sonar data. Shipwrecks are an important historical marker of maritime history, and can be discovered through manual inspection of bathymetric data. However, this is a time-consuming process and often requires expert analysis. Our proposed tool allows users to automatically preprocess bathymetry data, perform deep learning inference, threshold model outputs, and produce either pixel-wise segmentation masks or bounding boxes of predicted shipwrecks. The backbone of this open-source tool is a deep learning model, which is trained on a variety of shipwreck data from the Great Lakes and the coasts of Ireland. Additionally, we employ synthetic data generation in order to increase the size and diversity of our dataset. We demonstrate superior segmentation performance with our open-source tool and training pipeline as compared to a deep learning-based ArcGIS toolkit and a more classical inverse sinkhole detection method. The open-source tool can be found at https://github.com/umfieldrobotics/ShipwreckFinderQGISPlugin.
Chinese: ShipwreckFinder 是一款开源 QGIS 插件,通过深度学习从多波束声纳数据中自动检测沉船,在分割精度上优于现有方法。
English: ShipwreckFinder is an open-source QGIS plugin that uses deep learning to automatically detect shipwrecks from multibeam sonar data, outperforming existing methods in segmentation accuracy.

Authors:Yinfeng Yu, Hailong Zhang, Meiling Zhu
Title: Dynamic Multi-Target Fusion for Efficient Audio-Visual Navigation
Abstract:
Audiovisual embodied navigation enables robots to locate audio sources by dynamically integrating visual observations from onboard sensors with the auditory signals emitted by the target. The core challenge lies in effectively leveraging multimodal cues to guide navigation. While prior works have explored basic fusion of visual and audio data, they often overlook deeper perceptual context. To address this, we propose the Dynamic Multi-Target Fusion for Efficient Audio-Visual Navigation (DMTF-AVN). Our approach uses a multi-target architecture coupled with a refined Transformer mechanism to filter and selectively fuse cross-modal information. Extensive experiments on the Replica and Matterport3D datasets demonstrate that DMTF-AVN achieves state-of-the-art performance, outperforming existing methods in success rate (SR), path efficiency (SPL), and scene adaptation (SNA). Furthermore, the model exhibits strong scalability and generalizability, paving the way for advanced multimodal fusion strategies in robotic navigation. The code and videos are available at https://github.com/zzzmmm-svg/DMTF.
Chinese: 提出的DMTF-AVN模型通过多目标Transformer架构动态融合视觉与听觉线索,在多个基准数据集上实现了导航精度与适应性的最先进性能。
English: The proposed DMTF-AVN model advances audiovisual navigation by dynamically fusing visual and auditory cues through a multi-target Transformer architecture, achieving state-of-the-art performance in accuracy and adaptability across benchmark datasets.

Authors:Jiahao Zhang, Wenzhe Yin, Shujian Yu
Title: Cross-Modal Retrieval with Cauchy-Schwarz Divergence
Abstract:
Effective cross-modal retrieval requires robust alignment of heterogeneous data types. Most existing methods focus on bi-modal retrieval tasks and rely on distributional alignment techniques such as Kullback-Leibler divergence, Maximum Mean Discrepancy, and correlation alignment. However, these methods often suffer from critical limitations, including numerical instability, sensitivity to hyperparameters, and their inability to capture the full structure of the underlying distributions. In this paper, we introduce the Cauchy-Schwarz (CS) divergence, a hyperparameter-free measure that improves both training stability and retrieval performance. We further propose a novel Generalized CS (GCS) divergence inspired by Hölder's inequality. This extension enables direct alignment of three or more modalities within a unified mathematical framework through a bidirectional circular comparison scheme, eliminating the need for exhaustive pairwise comparisons. Extensive experiments on six benchmark datasets demonstrate the effectiveness of our method in both bi-modal and tri-modal retrieval tasks. The code of our CS/GCS divergence is publicly available at https://github.com/JiahaoZhang666/CSD.
中文: 本文提出柯西-施瓦茨散度及其广义形式,通过无超参数且稳定的方法在统一框架中对齐多模态,解决了跨模态检索中的关键局限。
English: This paper introduces the Cauchy-Schwarz divergence and its generalized version to address limitations in cross-modal retrieval by providing a hyperparameter-free, stable method that effectively aligns multiple modalities within a unified framework.

Authors:Hmrishav Bandyopadhyay, Rahim Entezari, Jim Scott, Reshinth Adithyan, Yi-Zhe Song, Varun Jampani
Title: SD3.5-Flash: Distribution-Guided Distillation of Generative Flows
Abstract:
We present SD3.5-Flash, an efficient few-step distillation framework that brings high-quality image generation to accessible consumer devices. Our approach distills computationally prohibitive rectified flow models through a reformulated distribution matching objective tailored specifically for few-step generation. We introduce two key innovations: "timestep sharing" to reduce gradient noise and "split-timestep fine-tuning" to improve prompt alignment. Combined with comprehensive pipeline optimizations like text encoder restructuring and specialized quantization, our system enables both rapid generation and memory-efficient deployment across different hardware configurations. This democratizes access across the full spectrum of devices, from mobile phones to desktop computers. Through extensive evaluation including large-scale user studies, we demonstrate that SD3.5-Flash consistently outperforms existing few-step methods, making advanced generative AI truly accessible for practical deployment.

Authors:Yu Yuan, Xijun Wang, Tharindu Wickremasinghe, Zeeshan Nadir, Bole Ma, Stanley H. Chan
Title: NewtonGen: Physics-Consistent and Controllable Text-to-Video Generation via Neural Newtonian Dynamics
Abstract:
A primary bottleneck in large-scale text-to-video generation today is physical consistency and controllability. Despite recent advances, state-of-the-art models often produce unrealistic motions, such as objects falling upward, or abrupt changes in velocity and direction. Moreover, these models lack precise parameter control, struggling to generate physically consistent dynamics under different initial conditions. We argue that this fundamental limitation stems from current models learning motion distributions solely from appearance, while lacking an understanding of the underlying dynamics. In this work, we propose NewtonGen, a framework that integrates data-driven synthesis with learnable physical principles. At its core lies trainable Neural Newtonian Dynamics (NND), which can model and predict a variety of Newtonian motions, thereby injecting latent dynamical constraints into the video generation process. By jointly leveraging data priors and dynamical guidance, NewtonGen enables physically consistent video synthesis with precise parameter control.
中文摘要:现有文本生成视频模型存在物理一致性和可控性不足的问题,牛顿生成框架通过引入可训练的神经牛顿动力学,将物理规律融入生成过程,实现了具备精确参数控制的物理一致性视频合成。
English Summary: Current text-to-video models struggle with physical consistency and controllability, so NewtonGen introduces Neural Newtonian Dynamics to integrate learnable physics principles, enabling physically accurate video generation with precise parameter control.

Authors:Weilun Feng, Haotong Qin, Mingqiang Wu, Chuanguang Yang, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu
Title: Quantized Visual Geometry Grounded Transformer
Abstract:
Learning-based 3D reconstruction models, represented by Visual Geometry Grounded Transformers (VGGTs), have made remarkable progress with the use of large-scale transformers. Their prohibitive computational and memory costs severely hinder real-world deployment. Post-Training Quantization (PTQ) has become a common practice for compressing and accelerating models. However, we empirically observe that PTQ faces unique obstacles when compressing billion-scale VGGTs: the data-independent special tokens induce heavy-tailed activation distributions, while the multi-view nature of 3D data makes calibration sample selection highly unstable. This paper proposes the first Quantization framework for VGGTs, namely QuantVGGT. This mainly relies on two technical contributions: First, we introduce Dual-Smoothed Fine-Grained Quantization, which integrates pre-global Hadamard rotation and post-local channel smoothing to mitigate heavy-tailed distributions and inter-channel variance robustly. Second, we design Noise-Filtered Diverse Sampling, which filters outliers via deep-layer statistics and constructs frame-aware diverse calibration clusters to ensure stable quantization ranges. Comprehensive experiments demonstrate that QuantVGGT achieves the state-of-the-art results across different benchmarks and bit-width, surpassing the previous state-of-the-art generic quantization method with a great margin. We highlight that our 4-bit QuantVGGT can deliver a 3.7$\times$ memory reduction and 2.5$\times$ acceleration in real-hardware inference, while maintaining reconstruction accuracy above 98\% of its full-precision counterpart. This demonstrates the vast advantages and practicality of QuantVGGT in resource-constrained scenarios. Our code is released in https://github.com/wlfeng0509/QuantVGGT.
中文: 本文提出QuantVGGT量化框架,通过双平滑细粒度量化和噪声过滤多样性采样技术,有效解决了十亿级视觉几何变换器量化中的重尾分布和校准稳定性问题,在保持98%以上重建精度的同时实现了3.7倍内存压缩和2.5倍推理加速。
English: This paper introduces QuantVGGT, a novel quantization framework that addresses the challenges of compressing billion-scale Visual Geometry Grounded Transformers through dual-smoothed fine-grained quantization and noise-filtered diverse sampling, achieving significant memory reduction and acceleration while maintaining high reconstruction accuracy.

Authors:Yidan Zhang, Mutian Xu, Yiming Hao, Kun Zhou, Jiahao Chang, Xiaoqiang Liu, Pengfei Wan, Hongbo Fu, Xiaoguang Han
Title: VC-Agent: An Interactive Agent for Customized Video Dataset Collection
Abstract:
Facing scaling laws, video data from the internet becomes increasingly important. However, collecting extensive videos that meet specific needs is extremely labor-intensive and time-consuming. In this work, we study the way to expedite this collection process and propose VC-Agent, the first interactive agent that is able to understand users' queries and feedback, and accordingly retrieve/scale up relevant video clips with minimal user input. Specifically, considering the user interface, our agent defines various user-friendly ways for the user to specify requirements based on textual descriptions and confirmations. As for agent functions, we leverage existing multi-modal large language models to connect the user's requirements with the video content. More importantly, we propose two novel filtering policies that can be updated when user interaction is continually performed. Finally, we provide a new benchmark for personalized video dataset collection, and carefully conduct the user study to verify our agent's usage in various real scenarios. Extensive experiments demonstrate the effectiveness and efficiency of our agent for customized video dataset collection. Project page: https://allenyidan.github.io/vcagent_page/.

Authors:Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Yuming Jiang, Hang Zhang, Xin Li, Lidong Bing, Deli Zhao, Wei Lu, Yu Rong, Aixin Sun, Shijian Lu
Title: MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources
Abstract:
Large multimodal reasoning models have achieved rapid progress, but their advancement is constrained by two major limitations: the absence of open, large-scale, high-quality long chain-of-thought (CoT) data, and the instability of reinforcement learning (RL) algorithms in post-training. Group Relative Policy Optimization (GRPO), the standard framework for RL fine-tuning, is prone to gradient vanishing when reward variance is low, which weakens optimization signals and impairs convergence. This work makes three contributions: (1) We propose Variance-Aware Sampling (VAS), a data selection strategy guided by Variance Promotion Score (VPS) that combines outcome variance and trajectory diversity to promote reward variance and stabilize policy optimization. (2) We release large-scale, carefully curated resources containing ~1.6M long CoT cold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty, and diversity, along with a fully reproducible end-to-end training codebase. (3) We open-source a family of multimodal reasoning models in multiple scales, establishing standardized baselines for the community. Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS. Comprehensive ablation studies and analyses provide further insight into the contributions of each component. In addition, we theoretically establish that reward variance lower-bounds the expected policy gradient magnitude, with VAS serving as a practical mechanism to realize this guarantee. Our code, data, and checkpoints are available at https://github.com/LengSicong/MMR1.
中文摘要:本研究针对多模态推理模型的局限性,提出方差感知采样方法以稳定强化学习优化,并发布了大规模精选数据集及可复现的训练资源。
English Summary: This research addresses limitations in multimodal reasoning models by introducing Variance-Aware Sampling to stabilize reinforcement learning optimization and releasing large-scale curated datasets with reproducible training resources.

Authors:Xinyu Liu, Guolei Sun, Cheng Wang, Yixuan Yuan, Ender Konukoglu
Title: MedVSR: Medical Video Super-Resolution with Cross State-Space Propagation
Abstract:
High-resolution (HR) medical videos are vital for accurate diagnosis, yet are hard to acquire due to hardware limitations and physiological constraints. Clinically, the collected low-resolution (LR) medical videos present unique challenges for video super-resolution (VSR) models, including camera shake, noise, and abrupt frame transitions, which result in significant optical flow errors and alignment difficulties. Additionally, tissues and organs exhibit continuous and nuanced structures, but current VSR models are prone to introducing artifacts and distorted features that can mislead doctors. To this end, we propose MedVSR, a tailored framework for medical VSR. It first employs Cross State-Space Propagation (CSSP) to address the imprecise alignment by projecting distant frames as control matrices within state-space models, enabling the selective propagation of consistent and informative features to neighboring frames for effective alignment. Moreover, we design an Inner State-Space Reconstruction (ISSR) module that enhances tissue structures and reduces artifacts with joint long-range spatial feature learning and large-kernel short-range information aggregation. Experiments across four datasets in diverse medical scenarios, including endoscopy and cataract surgeries, show that MedVSR significantly outperforms existing VSR models in reconstruction performance and efficiency. Code released at https://github.com/CUHK-AIM-Group/MedVSR.
中文:提出的MedVSR框架通过跨状态空间传播解决特征对齐问题,并结合内部状态空间重建增强组织结构,有效应对医学视频超分辨率中的独特挑战,在多种医疗场景中展现出卓越性能。
English: The proposed MedVSR framework addresses unique challenges in medical video super-resolution, such as alignment difficulties and artifacts, through Cross State-Space Propagation for feature alignment and Inner State-Space Reconstruction for enhancing tissue structures, demonstrating superior performance across diverse medical scenarios.

Authors:Killian Steunou, Théo Druilhe, Sigurd Saue
Title: Sparse Representations Improve Adversarial Robustness of Neural Network Classifiers
Abstract:
Deep neural networks perform remarkably well on image classification tasks but remain vulnerable to carefully crafted adversarial perturbations. This work revisits linear dimensionality reduction as a simple, data-adapted defense. We empirically compare standard Principal Component Analysis (PCA) with its sparse variant (SPCA) as front-end feature extractors for downstream classifiers, and we complement these experiments with a theoretical analysis. On the theory side, we derive exact robustness certificates for linear heads applied to SPCA features: for both $\ell_\infty$ and $\ell_2$ threat models (binary and multiclass), the certified radius grows as the dual norms of $W^\top u$ shrink, where $W$ is the projection and $u$ the head weights. We further show that for general (non-linear) heads, sparsity reduces operator-norm bounds through a Lipschitz composition argument, predicting lower input sensitivity. Empirically, with a small non-linear network after the projection, SPCA consistently degrades more gracefully than PCA under strong white-box and black-box attacks while maintaining competitive clean accuracy. Taken together, the theory identifies the mechanism (sparser projections reduce adversarial leverage) and the experiments verify that this benefit persists beyond the linear setting. Our code is available at https://github.com/killian31/SPCARobustness.
中文: 本研究证明,采用稀疏主成分分析(SPCA)作为防御机制,通过稀疏投影降低输入敏感性,能有效提升神经网络对抗攻击的鲁棒性,在保持竞争力的准确率同时,在白盒与黑盒攻击下均优于标准PCA方法。
English: This study demonstrates that using sparse principal component analysis (SPCA) as a defense mechanism enhances neural network robustness against adversarial attacks by reducing input sensitivity through sparser projections, maintaining competitive accuracy while outperforming standard PCA under various attack scenarios.

Authors:Yuze He, Yanning Zhou, Wang Zhao, Jingwen Ye, Yushi Bai, Kaiwen Xiao, Yong-Jin Liu, Zhongqian Sun, Wei Yang
Title: CHARM: Control-point-based 3D Anime Hairstyle Auto-Regressive Modeling
Abstract:
We present CHARM, a novel parametric representation and generative framework for anime hairstyle modeling. While traditional hair modeling methods focus on realistic hair using strand-based or volumetric representations, anime hairstyle exhibits highly stylized, piecewise-structured geometry that challenges existing techniques. Existing works often rely on dense mesh modeling or hand-crafted spline curves, making them inefficient for editing and unsuitable for scalable learning. CHARM introduces a compact, invertible control-point-based parameterization, where a sequence of control points represents each hair card, and each point is encoded with only five geometric parameters. This efficient and accurate representation supports both artist-friendly design and learning-based generation. Built upon this representation, CHARM introduces an autoregressive generative framework that effectively generates anime hairstyles from input images or point clouds. By interpreting anime hairstyles as a sequential "hair language", our autoregressive transformer captures both local geometry and global hairstyle topology, resulting in high-fidelity anime hairstyle creation. To facilitate both training and evaluation of anime hairstyle generation, we construct AnimeHair, a large-scale dataset of 37K high-quality anime hairstyles with separated hair cards and processed mesh data. Extensive experiments demonstrate state-of-the-art performance of CHARM in both reconstruction accuracy and generation quality, offering an expressive and scalable solution for anime hairstyle modeling. Project page: https://hyzcluster.github.io/charm/

Authors:Suaiba Amina Salahuddin, Teresa Dorszewski, Marit Almenning Martiniussen, Tone Hovda, Antonio Portaluri, Solveig Thrun, Michael Kampffmeyer, Elisabeth Wetzer, Kristoffer Wickstrøm, Robert Jenssen
Title: Mammo-CLIP Dissect: A Framework for Analysing Mammography Concepts in Vision-Language Models
Abstract:
Understanding what deep learning (DL) models learn is essential for the safe deployment of artificial intelligence (AI) in clinical settings. While previous work has focused on pixel-based explainability methods, less attention has been paid to the textual concepts learned by these models, which may better reflect the reasoning used by clinicians. We introduce Mammo-CLIP Dissect, the first concept-based explainability framework for systematically dissecting DL vision models trained for mammography. Leveraging a mammography-specific vision-language model (Mammo-CLIP) as a "dissector," our approach labels neurons at specified layers with human-interpretable textual concepts and quantifies their alignment to domain knowledge. Using Mammo-CLIP Dissect, we investigate three key questions: (1) how concept learning differs between DL vision models trained on general image datasets versus mammography-specific datasets; (2) how fine-tuning for downstream mammography tasks affects concept specialisation; and (3) which mammography-relevant concepts remain underrepresented. We show that models trained on mammography data capture more clinically relevant concepts and align more closely with radiologists' workflows than models not trained on mammography data. Fine-tuning for task-specific classification enhances the capture of certain concept categories (e.g., benign calcifications) but can reduce coverage of others (e.g., density-related features), indicating a trade-off between specialisation and generalisation. Our findings show that Mammo-CLIP Dissect provides insights into how convolutional neural networks (CNNs) capture mammography-specific knowledge. By comparing models across training data and fine-tuning regimes, we reveal how domain-specific training and task-specific adaptation shape concept learning. Code and concept set are available: https://github.com/Suaiba/Mammo-CLIP-Dissect.
Chinese: Mammo-CLIP Dissect 提出了一种基于概念的可解释性框架,揭示了经过乳腺摄影数据训练的深度学习模型比通用模型更能捕捉临床相关概念,同时强调了在微调过程中专业化与泛化之间的权衡。
English: Mammo-CLIP Dissect introduces a concept-based explainability framework that reveals how deep learning models trained on mammography data capture clinically relevant concepts more effectively than general models, while highlighting the trade-off between specialization and generalization during fine-tuning.

Authors:Guojun Lei, Rong Zhang, Chi Wang, Tianhang Liu, Hong Li, Zhiyuan Ma, Weiwei Xu
Title: UniTransfer: Video Concept Transfer via Progressive Spatial and Timestep Decomposition
Abstract:
We propose a novel architecture UniTransfer, which introduces both spatial and diffusion timestep decomposition in a progressive paradigm, achieving precise and controllable video concept transfer. Specifically, in terms of spatial decomposition, we decouple videos into three key components: the foreground subject, the background, and the motion flow. Building upon this decomposed formulation, we further introduce a dual-to-single-stream DiT-based architecture for supporting fine-grained control over different components in the videos. We also introduce a self-supervised pretraining strategy based on random masking to enhance the decomposed representation learning from large-scale unlabeled video data. Inspired by the Chain-of-Thought reasoning paradigm, we further revisit the denoising diffusion process and propose a Chain-of-Prompt (CoP) mechanism to achieve the timestep decomposition. We decompose the denoising process into three stages of different granularity and leverage large language models (LLMs) for stage-specific instructions to guide the generation progressively. We also curate an animal-centric video dataset called OpenAnimal to facilitate the advancement and benchmarking of research in video concept transfer. Extensive experiments demonstrate that our method achieves high-quality and controllable video concept transfer across diverse reference images and scenes, surpassing existing baselines in both visual fidelity and editability. Web Page: https://yu-shaonian.github.io/UniTransfer-Web/

Authors:Songyue Cai, Zongqian Wu, Yujie Mo, Liang Peng, Ping Hu, Xiaoshuang Shi, Xiaofeng Zhu
Title: Background Prompt for Few-Shot Out-of-Distribution Detection
Abstract:
Existing foreground-background (FG-BG) decomposition methods for the few-shot out-of-distribution (FS-OOD) detection often suffer from low robustness due to over-reliance on the local class similarity and a fixed background patch extraction strategy. To address these challenges, we propose a new FG-BG decomposition framework, namely Mambo, for FS-OOD detection. Specifically, we propose to first learn a background prompt to obtain the local background similarity containing both the background and image semantic information, and then refine the local background similarity using the local class similarity. As a result, we use both the refined local background similarity and the local class similarity to conduct background extraction, reducing the dependence of the local class similarity in previous methods. Furthermore, we propose the patch self-calibrated tuning to consider the sample diversity to flexibly select numbers of background patches for different samples, and thus exploring the issue of fixed background extraction strategies in previous methods. Extensive experiments on real-world datasets demonstrate that our proposed Mambo achieves the best performance, compared to SOTA methods in terms of OOD detection and near OOD detection setting. The source code will be released at https://github.com/YuzunoKawori/Mambo.
中文:提出的Mambo框架通过结合精炼的局部背景相似性与类别相似性,并采用补丁自校准调整实现自适应背景块选择,显著提升了少样本分布外检测的鲁棒性和准确性,优于现有方法。
English: The proposed Mambo framework enhances few-shot out-of-distribution detection by integrating refined local background similarity with class similarity and employing patch self-calibrated tuning for adaptive background patch selection, outperforming existing methods in robustness and accuracy.

Authors:Sarmistha Das, R E Zera Marveen Lyngkhoi, Sriparna Saha, Alka Maurya
Title: Unlocking Financial Insights: An advanced Multimodal Summarization with Multimodal Output Framework for Financial Advisory Videos
Abstract:
The dynamic propagation of social media has broadened the reach of financial advisory content through podcast videos, yet extracting insights from lengthy, multimodal segments (30-40 minutes) remains challenging. We introduce FASTER (Financial Advisory Summariser with Textual Embedded Relevant images), a modular framework that tackles three key challenges: (1) extracting modality-specific features, (2) producing optimized, concise summaries, and (3) aligning visual keyframes with associated textual points. FASTER employs BLIP for semantic visual descriptions, OCR for textual patterns, and Whisper-based transcription with Speaker diarization as BOS features. A modified Direct Preference Optimization (DPO)-based loss function, equipped with BOS-specific fact-checking, ensures precision, relevance, and factual consistency against the human-aligned summary. A ranker-based retrieval mechanism further aligns keyframes with summarized content, enhancing interpretability and cross-modal coherence. To acknowledge data resource scarcity, we introduce Fin-APT, a dataset comprising 470 publicly accessible financial advisory pep-talk videos for robust multimodal research. Comprehensive cross-domain experiments confirm FASTER's strong performance, robustness, and generalizability when compared to Large Language Models (LLMs) and Vision-Language Models (VLMs). By establishing a new standard for multimodal summarization, FASTER makes financial advisory content more accessible and actionable, thereby opening new avenues for research. The dataset and code are available at: https://github.com/sarmistha-D/FASTER
中文摘要:FASTER框架通过整合多模态特征提取、优化摘要生成和视觉文本对齐,解决了冗长金融咨询视频的摘要难题,其综合测试表现优于现有模型。
English Summary: The FASTER framework addresses the challenge of summarizing lengthy financial advisory videos by integrating multimodal feature extraction, optimized summarization, and visual-text alignment, demonstrating superior performance over existing models through comprehensive testing.

Authors:Jianbo Zhao, Taiyu Ban, Xiangjie Li, Xingtai Gui, Hangning Zhou, Lei Liu, Hongwei Zhao, Bin Li
Title: Autoregressive End-to-End Planning with Time-Invariant Spatial Alignment and Multi-Objective Policy Refinement
Abstract:
The inherent sequential modeling capabilities of autoregressive models make them a formidable baseline for end-to-end planning in autonomous driving. Nevertheless, their performance is constrained by a spatio-temporal misalignment, as the planner must condition future actions on past sensory data. This creates an inconsistent worldview, limiting the upper bound of performance for an otherwise powerful approach. To address this, we propose a Time-Invariant Spatial Alignment (TISA) module that learns to project initial environmental features into a consistent ego-centric frame for each future time step, effectively correcting the agent's worldview without explicit future scene prediction. In addition, we employ a kinematic action prediction head (i.e., acceleration and yaw rate) to ensure physically feasible trajectories. Finally, we introduce a multi-objective post-training stage using Direct Preference Optimization (DPO) to move beyond pure imitation. Our approach provides targeted feedback on specific driving behaviors, offering a more fine-grained learning signal than the single, overall objective used in standard DPO. Our model achieves a state-of-the-art 89.8 PDMS on the NAVSIM dataset among autoregressive models. The video document is available at https://tisa-dpo-e2e.github.io/.
中文: 提出的时间不变空间对齐模块通过将环境特征投射到一致的自我中心框架中,解决了自动驾驶中的时空错位问题,同时结合运动学动作预测和直接偏好优化来提升轨迹可行性和行为精细度,实现了最先进的性能。
English: The proposed Time-Invariant Spatial Alignment module addresses spatio-temporal misalignment in autonomous driving by projecting environmental features into consistent ego-centric frames, while kinematic action prediction and Direct Preference Optimization enhance trajectory feasibility and behavioral refinement, achieving state-of-the-art performance.

Authors:Wenhao Tang, Heng Fang, Ge Wu, Xiang Li, Ming-Ming Cheng
Title: Revisiting Data Challenges of Computational Pathology: A Pack-based Multiple Instance Learning Framework
Abstract:
Computational pathology (CPath) digitizes pathology slides into whole slide images (WSIs), enabling analysis for critical healthcare tasks such as cancer diagnosis and prognosis. However, WSIs possess extremely long sequence lengths (up to 200K), significant length variations (from 200 to 200K), and limited supervision. These extreme variations in sequence length lead to high data heterogeneity and redundancy. Conventional methods often compromise on training efficiency and optimization to preserve such heterogeneity under limited supervision. To comprehensively address these challenges, we propose a pack-based MIL framework. It packs multiple sampled, variable-length feature sequences into fixed-length ones, enabling batched training while preserving data heterogeneity. Moreover, we introduce a residual branch that composes discarded features from multiple slides into a hyperslide which is trained with tailored labels. It offers multi-slide supervision while mitigating feature loss from sampling. Meanwhile, an attention-driven downsampler is introduced to compress features in both branches to reduce redundancy. By alleviating these challenges, our approach achieves an accuracy improvement of up to 8% while using only 12% of the training time in the PANDA(UNI). Extensive experiments demonstrate that focusing data challenges in CPath holds significant potential in the era of foundation models. The code is https://github.com/FangHeng/PackMIL
中文: 该研究提出的基于打包的多示例学习框架通过将可变长度特征序列打包为固定长度实现高效批量训练,同时引入残差分支和注意力下采样器来增强监督并减少冗余,在仅用12%训练时间的情况下实现了最高8%的准确率提升。
English: The proposed pack-based MIL framework addresses computational pathology challenges by packing variable-length feature sequences into fixed-length ones for efficient batched training, while introducing a residual branch and attention-driven downsampler to enhance supervision and reduce redundancy, achieving up to 8% accuracy improvement with only 12% training time.

Authors:Zhifei Li, Feng Qiu, Yiran Wang, Yujing Xia, Kui Xiao, Miao Zhang, Yan Zhang
Title: Integrating Object Interaction Self-Attention and GAN-Based Debiasing for Visual Question Answering
Abstract:
Visual Question Answering (VQA) presents a unique challenge by requiring models to understand and reason about visual content to answer questions accurately. Existing VQA models often struggle with biases introduced by the training data, leading to over-reliance on superficial patterns and inadequate generalization to diverse questions and images. This paper presents a novel model, IOG-VQA, which integrates Object Interaction Self-Attention and GAN-Based Debiasing to enhance VQA model performance. The self-attention mechanism allows our model to capture complex interactions between objects within an image, providing a more comprehensive understanding of the visual context. Meanwhile, the GAN-based debiasing framework generates unbiased data distributions, helping the model to learn more robust and generalizable features. By leveraging these two components, IOG-VQA effectively combines visual and textual information to address the inherent biases in VQA datasets. Extensive experiments on the VQA-CP v1 and VQA-CP v2 datasets demonstrate that our model shows excellent performance compared with the existing methods, particularly in handling biased and imbalanced data distributions highlighting the importance of addressing both object interactions and dataset biases in advancing VQA tasks. Our code is available at https://github.com/HubuKG/IOG-VQA.
Chinese Summary: IOG-VQA模型通过结合物体交互自注意力机制和基于GAN的去偏方法,有效提升了视觉问答性能,在标准数据集上展现出卓越的泛化能力和抗偏置特性。
English Summary: The IOG-VQA model enhances Visual Question Answering by integrating object interaction self-attention for better visual context understanding and GAN-based debiasing to mitigate dataset biases, achieving superior performance on benchmark datasets.

Authors:Yan Zhang, Jiaqing Lin, Miao Zhang, Kui Xiao, Xiaoju Hou, Yue Zhao, Zhifei Li
Title: SCRA-VQA: Summarized Caption-Rerank for Augmented Large Language Models in Visual Question Answering
Abstract:
Acquiring high-quality knowledge is a central focus in Knowledge-Based Visual Question Answering (KB-VQA). Recent methods use large language models (LLMs) as knowledge engines for answering. These methods generally employ image captions as visual text descriptions to assist LLMs in interpreting images. However, the captions frequently include excessive noise irrelevant to the question, and LLMs generally do not comprehend VQA tasks, limiting their reasoning capabilities. To address this issue, we propose the Summarized Caption-Rerank Augmented VQA (SCRA-VQA), which employs a pre-trained visual language model to convert images into captions. Moreover, SCRA-VQA generates contextual examples for the captions while simultaneously summarizing and reordering them to exclude unrelated information. The caption-rerank process enables LLMs to understand the image information and questions better, thus enhancing the model's reasoning ability and task adaptability without expensive end-to-end training. Based on an LLM with 6.7B parameters, SCRA-VQA performs excellently on two challenging knowledge-based VQA datasets: OK-VQA and A-OKVQA, achieving accuracies of 38.8% and 34.6%. Our code is available at https://github.com/HubuKG/SCRA-VQA.
中文:SCRA-VQA通过总结和重排图像描述来减少噪声,使大型语言模型能更好地理解图像并提升推理能力,无需昂贵训练即可在知识型视觉问答任务中取得优异表现。
English: SCRA-VQA enhances knowledge-based visual question answering by summarizing and reranking image captions to reduce noise, enabling large language models to better interpret images and improve reasoning without costly training, achieving high accuracy on challenging datasets.

Authors:Xiaonan Hu, Xuebing Li, Jinyu Xu, Abdulkadir Duran Adan, Letian Zhou, Xuhui Zhu, Yanan Li, Wei Guo, Shouyang Liu, Wenzhong Liu, Hao Lu
Title: TasselNetV4: A vision foundation model for cross-scene, cross-scale, and cross-species plant counting
Abstract:
Accurate plant counting provides valuable information for agriculture such as crop yield prediction, plant density assessment, and phenotype quantification. Vision-based approaches are currently the mainstream solution. Prior art typically uses a detection or a regression model to count a specific plant. However, plants have biodiversity, and new cultivars are increasingly bred each year. It is almost impossible to exhaust and build all species-dependent counting models. Inspired by class-agnostic counting (CAC) in computer vision, we argue that it is time to rethink the problem formulation of plant counting, from what plants to count to how to count plants. In contrast to most daily objects with spatial and temporal invariance, plants are dynamic, changing with time and space. Their non-rigid structure often leads to worse performance than counting rigid instances like heads and cars such that current CAC and open-world detection models are suboptimal to count plants. In this work, we inherit the vein of the TasselNet plant counting model and introduce a new extension, TasselNetV4, shifting from species-specific counting to cross-species counting. TasselNetV4 marries the local counting idea of TasselNet with the extract-and-match paradigm in CAC. It builds upon a plain vision transformer and incorporates novel multi-branch box-aware local counters used to enhance cross-scale robustness. Two challenging datasets, PAC-105 and PAC-Somalia, are harvested. Extensive experiments against state-of-the-art CAC models show that TasselNetV4 achieves not only superior counting performance but also high efficiency.Our results indicate that TasselNetV4 emerges to be a vision foundation model for cross-scene, cross-scale, and cross-species plant counting.
中文: TasselNetV4通过将植物计数从物种特定模型转向跨物种方法,结合视觉变换器和多分支局部计数器,在多种农业场景中实现了卓越的计数精度与效率。
English: TasselNetV4 advances plant counting by transitioning from species-specific models to a cross-species approach, leveraging a vision transformer and multi-branch local counters to achieve superior accuracy and efficiency across diverse agricultural scenarios.

Authors:Shihua Huang, Yongjie Hou, Longfei Liu, Xuanlong Yu, Xi Shen
Title: Real-Time Object Detection Meets DINOv3
Abstract:
Benefiting from the simplicity and effectiveness of Dense O2O and MAL, DEIM has become the mainstream training framework for real-time DETRs, significantly outperforming the YOLO series. In this work, we extend it with DINOv3 features, resulting in DEIMv2. DEIMv2 spans eight model sizes from X to Atto, covering GPU, edge, and mobile deployment. For the X, L, M, and S variants, we adopt DINOv3-pretrained or distilled backbones and introduce a Spatial Tuning Adapter (STA), which efficiently converts DINOv3's single-scale output into multi-scale features and complements strong semantics with fine-grained details to enhance detection. For ultra-lightweight models (Nano, Pico, Femto, and Atto), we employ HGNetv2 with depth and width pruning to meet strict resource budgets. Together with a simplified decoder and an upgraded Dense O2O, this unified design enables DEIMv2 to achieve a superior performance-cost trade-off across diverse scenarios, establishing new state-of-the-art results. Notably, our largest model, DEIMv2-X, achieves 57.8 AP with only 50.3 million parameters, surpassing prior X-scale models that require over 60 million parameters for just 56.5 AP. On the compact side, DEIMv2-S is the first sub-10 million model (9.71 million) to exceed the 50 AP milestone on COCO, reaching 50.9 AP. Even the ultra-lightweight DEIMv2-Pico, with just 1.5 million parameters, delivers 38.5 AP, matching YOLOv10-Nano (2.3 million) with around 50 percent fewer parameters. Our code and pre-trained models are available at https://github.com/Intellindust-AI-Lab/DEIMv2
中文: DEIMv2通过融合DINOv3特征和空间调谐适配器,在八个模型尺寸上实现了卓越性能,以更少参数创下实时检测新纪录。
English: DEIMv2 enhances the DEIM framework by integrating DINOv3 features and a Spatial Tuning Adapter, achieving superior performance across eight model sizes with state-of-the-art results in real-time detection while reducing parameter counts.

Authors:Hyomin Choi, Heeji Han, Chris Rosewarne, Fabien Racapé
Title: CompressAI-Vision: Open-source software to evaluate compression methods for computer vision tasks
Abstract:
With the increasing use of neural network (NN)-based computer vision applications that process image and video data as input, interest has emerged in video compression technology optimized for computer vision tasks. In fact, given the variety of vision tasks, associated NN models and datasets, a consolidated platform is needed as a common ground to implement and evaluate compression methods optimized for downstream vision tasks. CompressAI-Vision is introduced as a comprehensive evaluation platform where new coding tools compete to efficiently compress the input of vision network while retaining task accuracy in the context of two different inference scenarios: "remote" and "split" inferencing. Our study showcases various use cases of the evaluation platform incorporated with standard codecs (under development) by examining the compression gain on several datasets in terms of bit-rate versus task accuracy. This evaluation platform has been developed as open-source software and is adopted by the Moving Pictures Experts Group (MPEG) for the development the Feature Coding for Machines (FCM) standard. The software is available publicly at https://github.com/InterDigitalInc/CompressAI-Vision.
中文: 随着基于神经网络的计算机视觉应用日益增多,CompressAI-Vision作为一个开源评估平台被推出,用于测试在远程和分离推理场景下保持任务准确性的视频压缩方法,现已被MPEG采纳用于开发FCM标准。
English: With the rise of neural network-based computer vision applications, CompressAI-Vision is introduced as an open-source platform to evaluate video compression methods that maintain task accuracy in remote and split inference scenarios, now adopted by MPEG for the FCM standard.

Authors:Yuxuan Zhou, Xingxing Li, Shengyu Li, Zhuohao Yan, Chunxi Xia, Shaoquan Feng
Title: MASt3R-Fusion: Integrating Feed-Forward Visual Model with IMU, GNSS for High-Functionality SLAM
Abstract:
Visual SLAM is a cornerstone technique in robotics, autonomous driving and extended reality (XR), yet classical systems often struggle with low-texture environments, scale ambiguity, and degraded performance under challenging visual conditions. Recent advancements in feed-forward neural network-based pointmap regression have demonstrated the potential to recover high-fidelity 3D scene geometry directly from images, leveraging learned spatial priors to overcome limitations of traditional multi-view geometry methods. However, the widely validated advantages of probabilistic multi-sensor information fusion are often discarded in these pipelines. In this work, we propose MASt3R-Fusion,a multi-sensor-assisted visual SLAM framework that tightly integrates feed-forward pointmap regression with complementary sensor information, including inertial measurements and GNSS data. The system introduces Sim(3)-based visualalignment constraints (in the Hessian form) into a universal metric-scale SE(3) factor graph for effective information fusion. A hierarchical factor graph design is developed, which allows both real-time sliding-window optimization and global optimization with aggressive loop closures, enabling real-time pose tracking, metric-scale structure perception and globally consistent mapping. We evaluate our approach on both public benchmarks and self-collected datasets, demonstrating substantial improvements in accuracy and robustness over existing visual-centered multi-sensor SLAM systems. The code will be released open-source to support reproducibility and further research (https://github.com/GREAT-WHU/MASt3R-Fusion).
中文摘要:MASt3R-Fusion创新地将神经网络点云回归与惯性/GNSS数据通过分层因子图紧密融合,构建了能够实时进行姿态跟踪和全局一致建图的多传感器视觉SLAM系统,显著提升了精度与鲁棒性。
English Summary: MASt3R-Fusion is a novel multi-sensor visual SLAM framework that integrates neural pointmap regression with inertial and GNSS data through a hierarchical factor graph, achieving enhanced accuracy and robustness in real-time mapping and pose tracking.

Authors:Yu Guo, Shengfeng He, Yuxu Lu, Haonan An, Yihang Tao, Huilin Zhu, Jingxian Liu, Yuguang Fang
Title: Neptune-X: Active X-to-Maritime Generation for Universal Maritime Object Detection
Abstract:
Maritime object detection is essential for navigation safety, surveillance, and autonomous operations, yet constrained by two key challenges: the scarcity of annotated maritime data and poor generalization across various maritime attributes (e.g., object category, viewpoint, location, and imaging environment). To address these challenges, we propose Neptune-X, a data-centric generative-selection framework that enhances training effectiveness by leveraging synthetic data generation with task-aware sample selection. From the generation perspective, we develop X-to-Maritime, a multi-modality-conditioned generative model that synthesizes diverse and realistic maritime scenes. A key component is the Bidirectional Object-Water Attention module, which captures boundary interactions between objects and their aquatic surroundings to improve visual fidelity. To further improve downstream tasking performance, we propose Attribute-correlated Active Sampling, which dynamically selects synthetic samples based on their task relevance. To support robust benchmarking, we construct the Maritime Generation Dataset, the first dataset tailored for generative maritime learning, encompassing a wide range of semantic conditions. Extensive experiments demonstrate that our approach sets a new benchmark in maritime scene synthesis, significantly improving detection accuracy, particularly in challenging and previously underrepresented settings. The code is available at https://github.com/gy65896/Neptune-X.
中文: Neptune-X框架通过X-to-Maritime生成模型合成多样化海事场景,并采用任务感知样本选择机制,有效提升了海事目标检测的准确性,尤其在具有挑战性的场景中表现突出。
English: Neptune-X is a data-centric generative-selection framework that enhances maritime object detection by synthesizing diverse scenes with its X-to-Maritime model and dynamically selecting task-relevant samples, significantly improving accuracy in challenging conditions.

Authors:Ruixu Zhang, Yuran Wang, Xinyi Hu, Chaoyu Mai, Wenxuan Liu, Danni Xu, Xian Zhong, Zheng Wang
Title: Beyond the Individual: Introducing Group Intention Forecasting with SHOT Dataset
Abstract:
Intention recognition has traditionally focused on individual intentions, overlooking the complexities of collective intentions in group settings. To address this limitation, we introduce the concept of group intention, which represents shared goals emerging through the actions of multiple individuals, and Group Intention Forecasting (GIF), a novel task that forecasts when group intentions will occur by analyzing individual actions and interactions before the collective goal becomes apparent. To investigate GIF in a specific scenario, we propose SHOT, the first large-scale dataset for GIF, consisting of 1,979 basketball video clips captured from 5 camera views and annotated with 6 types of individual attributes. SHOT is designed with 3 key characteristics: multi-individual information, multi-view adaptability, and multi-level intention, making it well-suited for studying emerging group intentions. Furthermore, we introduce GIFT (Group Intention ForecasTer), a framework that extracts fine-grained individual features and models evolving group dynamics to forecast intention emergence. Experimental results confirm the effectiveness of SHOT and GIFT, establishing a strong foundation for future research in group intention forecasting. The dataset is available at https://xinyi-hu.github.io/SHOT_DATASET.

Authors:Yandan Yang, Baoxiong Jia, Shujie Zhang, Siyuan Huang
Title: SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent
Abstract:
Indoor scene synthesis has become increasingly important with the rise of Embodied AI, which requires 3D environments that are not only visually realistic but also physically plausible and functionally diverse. While recent approaches have advanced visual fidelity, they often remain constrained to fixed scene categories, lack sufficient object-level detail and physical consistency, and struggle to align with complex user instructions. In this work, we present SceneWeaver, a reflective agentic framework that unifies diverse scene synthesis paradigms through tool-based iterative refinement. At its core, SceneWeaver employs a language model-based planner to select from a suite of extensible scene generation tools, ranging from data-driven generative models to visual- and LLM-based methods, guided by self-evaluation of physical plausibility, visual realism, and semantic alignment with user input. This closed-loop reason-act-reflect design enables the agent to identify semantic inconsistencies, invoke targeted tools, and update the environment over successive iterations. Extensive experiments on both common and open-vocabulary room types demonstrate that SceneWeaver not only outperforms prior methods on physical, visual, and semantic metrics, but also generalizes effectively to complex scenes with diverse instructions, marking a step toward general-purpose 3D environment generation. Project website: https://scene-weaver.github.io/.
中文: SceneWeaver提出了一种反思性智能体框架,通过迭代优化整合多样化的场景合成工具,在物理合理性、视觉真实性和语义对齐方面超越现有方法,并能泛化至复杂场景。
English: SceneWeaver introduces a reflective agentic framework that unifies diverse scene synthesis tools through iterative refinement, outperforming prior methods in physical plausibility, visual realism, and semantic alignment while generalizing to complex scenes.

Authors:Chen Wang, Chuhao Chen, Yiming Huang, Zhiyang Dou, Yuan Liu, Jiatao Gu, Lingjie Liu
Title: PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation
Abstract:
Existing video generation models excel at producing photo-realistic videos from text or images, but often lack physical plausibility and 3D controllability. To overcome these limitations, we introduce PhysCtrl, a novel framework for physics-grounded image-to-video generation with physical parameters and force control. At its core is a generative physics network that learns the distribution of physical dynamics across four materials (elastic, sand, plasticine, and rigid) via a diffusion model conditioned on physics parameters and applied forces. We represent physical dynamics as 3D point trajectories and train on a large-scale synthetic dataset of 550K animations generated by physics simulators. We enhance the diffusion model with a novel spatiotemporal attention block that emulates particle interactions and incorporates physics-based constraints during training to enforce physical plausibility. Experiments show that PhysCtrl generates realistic, physics-grounded motion trajectories which, when used to drive image-to-video models, yield high-fidelity, controllable videos that outperform existing methods in both visual quality and physical plausibility. Project Page: https://cwchenwang.github.io/physctrl
中文:PhysCtrl是一种基于物理学的创新框架,通过基于合成动画训练的扩散模型学习物理动力学,生成逼真且可控的视频,在视觉质量和物理合理性方面均优于现有方法。
English: PhysCtrl is a novel physics-grounded framework that generates realistic and controllable videos by learning physical dynamics through a diffusion model trained on synthetic animations, outperforming existing methods in both visual quality and physical plausibility.

Authors:Bishal Adhikari, Jiajia Li, Eric S. Michel, Jacob Dykes, Te-Ming Paul Tseng, Mary Love Tagert, Dong Chen
Title: A Comprehensive Evaluation of YOLO-based Deer Detection Performance on Edge Devices
Abstract:
The escalating economic losses in agriculture due to deer intrusion, estimated to be in the hundreds of millions of dollars annually in the U.S., highlight the inadequacy of traditional mitigation strategies since these methods are often labor-intensive, costly, and ineffective for modern farming systems. To overcome this, there is a critical need for intelligent, autonomous solutions which require accurate and efficient deer detection. But the progress in this field is impeded by a significant gap in the literature, mainly the lack of a domain-specific, practical dataset and limited study on the on-field deployability of deer detection systems. Addressing this gap, this study presents a comprehensive evaluation of state-of-the-art deep learning models for deer detection in challenging real-world scenarios. The contributions of this work are threefold. First, we introduce a curated, publicly available dataset of 3,095 annotated images with bounding-box annotations of deer, derived from the Idaho Cameratraps project. Second, we provide an extensive comparative analysis of 12 model variants across four recent YOLO architectures(v8, v9, v10, and v11). Finally, we benchmarked performance on a high-end NVIDIA RTX 5090 GPU and evaluated on two representative edge computing platforms: Raspberry Pi 5 and NVIDIA Jetson AGX Xavier. Results show that the real-time detection is not feasible in Raspberry Pi without hardware-specific model optimization, while NVIDIA Jetson provides greater than 30 FPS with GPU-accelerated inference on 's' and 'n' series models. This study also reveals that smaller, architecturally advanced models such as YOLOv11n, YOLOv8s, and YOLOv9s offer the optimal balance of high accuracy (AP@.5 > 0.85) and computational efficiency (FPS > 30). To support further research, both the source code and datasets are publicly available at https://github.com/WinnerBishal/track-the-deer.
中文: 本研究通过提供标注数据集并评估先进深度学习模型,解决了传统方法在应对农业中鹿群入侵方面的不足,发现优化后的模型如YOLOv11n在边缘设备上能实现高精度和实时检测性能。
English: This study addresses the inadequacy of traditional methods for mitigating deer intrusion in agriculture by introducing a curated dataset and evaluating advanced deep learning models, finding that optimized models like YOLOv11n achieve high accuracy and real-time performance on edge devices such as NVIDIA Jetson.

Authors:Xichen Xu, Yanshu Wang, Jinbao Wang, Xiaoning Lei, Guoyang Xie, Guannan Jiang, Zhichao Lu
Title: FAST: Foreground-aware Diffusion with Accelerated Sampling Trajectory for Segmentation-oriented Anomaly Synthesis
Abstract:
Industrial anomaly segmentation relies heavily on pixel-level annotations, yet real-world anomalies are often scarce, diverse, and costly to label. Segmentation-oriented industrial anomaly synthesis (SIAS) has emerged as a promising alternative; however, existing methods struggle to balance sampling efficiency and generation quality. Moreover, most approaches treat all spatial regions uniformly, overlooking the distinct statistical differences between anomaly and background areas. This uniform treatment hinders the synthesis of controllable, structure-specific anomalies tailored for segmentation tasks. In this paper, we propose FAST, a foreground-aware diffusion framework featuring two novel modules: the Anomaly-Informed Accelerated Sampling (AIAS) and the Foreground-Aware Reconstruction Module (FARM). AIAS is a training-free sampling algorithm specifically designed for segmentation-oriented industrial anomaly synthesis, which accelerates the reverse process through coarse-to-fine aggregation and enables the synthesis of state-of-the-art segmentation-oriented anomalies in as few as 10 steps. Meanwhile, FARM adaptively adjusts the anomaly-aware noise within the masked foreground regions at each sampling step, preserving localized anomaly signals throughout the denoising trajectory. Extensive experiments on multiple industrial benchmarks demonstrate that FAST consistently outperforms existing anomaly synthesis methods in downstream segmentation tasks. We release the code at: https://github.com/Chhro123/fast-foreground-aware-anomaly-synthesis.
中文:FAST框架通过AIAS和FARM两个创新模块,在加速采样的同时保持局部异常信号,能高效生成面向分割的高质量工业异常样本,在下游任务中显著优于现有方法。
English: The proposed FAST framework introduces two novel modules, AIAS and FARM, to efficiently generate high-quality, segmentation-oriented industrial anomalies by accelerating sampling and preserving localized anomaly signals, significantly outperforming existing methods in downstream tasks.

Authors:Dayu Tan, Zhenpeng Xu, Yansen Su, Xin Peng, Chunhou Zheng, Weimin Zhong
Title: HiPerformer: A High-Performance Global-Local Segmentation Model with Modular Hierarchical Fusion Strategy
Abstract:
Both local details and global context are crucial in medical image segmentation, and effectively integrating them is essential for achieving high accuracy. However, existing mainstream methods based on CNN-Transformer hybrid architectures typically employ simple feature fusion techniques such as serial stacking, endpoint concatenation, or pointwise addition, which struggle to address the inconsistencies between features and are prone to information conflict and loss. To address the aforementioned challenges, we innovatively propose HiPerformer. The encoder of HiPerformer employs a novel modular hierarchical architecture that dynamically fuses multi-source features in parallel, enabling layer-wise deep integration of heterogeneous information. The modular hierarchical design not only retains the independent modeling capability of each branch in the encoder, but also ensures sufficient information transfer between layers, effectively avoiding the degradation of features and information loss that come with traditional stacking methods. Furthermore, we design a Local-Global Feature Fusion (LGFF) module to achieve precise and efficient integration of local details and global semantic information, effectively alleviating the feature inconsistency problem and resulting in a more comprehensive feature representation. To further enhance multi-scale feature representation capabilities and suppress noise interference, we also propose a Progressive Pyramid Aggregation (PPA) module to replace traditional skip connections. Experiments on eleven public datasets demonstrate that the proposed method outperforms existing segmentation techniques, demonstrating higher segmentation accuracy and robustness. The code is available at https://github.com/xzphappy/HiPerformer.
中文摘要:本文提出的HiPerformer模型通过模块化分层编码器和专门设计的融合模块,实现了多尺度特征的动态整合,有效克服传统混合架构的局限性,在多个医学图像分割数据集上展现出卓越性能。
English Summary: The proposed HiPerformer model introduces a modular hierarchical encoder and specialized fusion modules to dynamically integrate multi-scale features, effectively overcoming limitations of conventional hybrid architectures and achieving superior medical image segmentation performance across multiple datasets.

Authors:Hao Lu, Zhuang Ma, Guangfeng Jiang, Wenhang Ge, Bohan Li, Yuzhan Cai, Wenzhao Zheng, Yunpeng Zhang, Yingcong Chen
Title: 4D Driving Scene Generation With Stereo Forcing
Abstract:
Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization. Bridging generation and novel view synthesis remains a major challenge. We present PhiGenesis, a unified framework for 4D scene generation that extends video generation techniques with geometric and temporal consistency. Given multi-view image sequences and camera parameters, PhiGenesis produces temporally continuous 4D Gaussian splatting representations along target 3D trajectories. In its first stage, PhiGenesis leverages a pre-trained video VAE with a novel range-view adapter to enable feed-forward 4D reconstruction from multi-view images. This architecture supports single-frame or video inputs and outputs complete 4D scenes including geometry, semantics, and motion. In the second stage, PhiGenesis introduces a geometric-guided video diffusion model, using rendered historical 4D scenes as priors to generate future views conditioned on trajectories. To address geometric exposure bias in novel views, we propose Stereo Forcing, a novel conditioning strategy that integrates geometric uncertainty during denoising. This method enhances temporal coherence by dynamically adjusting generative influence based on uncertainty-aware perturbations. Our experimental results demonstrate that our method achieves state-of-the-art performance in both appearance and geometric reconstruction, temporal generation and novel view synthesis (NVS) tasks, while simultaneously delivering competitive performance in downstream evaluations. Homepage is at \href{https://jiangxb98.github.io/PhiGensis}{PhiGensis}.

Authors:Kwang-Hyun Uhm, Hyunjun Cho, Sung-Hoo Hong, Seung-Won Jung
Title: An Anisotropic Cross-View Texture Transfer with Multi-Reference Non-Local Attention for CT Slice Interpolation
Abstract:
Computed tomography (CT) is one of the most widely used non-invasive imaging modalities for medical diagnosis. In clinical practice, CT images are usually acquired with large slice thicknesses due to the high cost of memory storage and operation time, resulting in an anisotropic CT volume with much lower inter-slice resolution than in-plane resolution. Since such inconsistent resolution may lead to difficulties in disease diagnosis, deep learning-based volumetric super-resolution methods have been developed to improve inter-slice resolution. Most existing methods conduct single-image super-resolution on the through-plane or synthesize intermediate slices from adjacent slices; however, the anisotropic characteristic of 3D CT volume has not been well explored. In this paper, we propose a novel cross-view texture transfer approach for CT slice interpolation by fully utilizing the anisotropic nature of 3D CT volume. Specifically, we design a unique framework that takes high-resolution in-plane texture details as a reference and transfers them to low-resolution through-plane images. To this end, we introduce a multi-reference non-local attention module that extracts meaningful features for reconstructing through-plane high-frequency details from multiple in-plane images. Through extensive experiments, we demonstrate that our method performs significantly better in CT slice interpolation than existing competing methods on public CT datasets including a real-paired benchmark, verifying the effectiveness of the proposed framework. The source code of this work is available at https://github.com/khuhm/ACVTT.
中文: 本文提出了一种新颖的跨视图纹理迁移方法,通过利用3D CT图像的各向异性特征,将高分辨率平面内纹理细节迁移至低分辨率平面间图像,在CT切片插值任务中显著优于现有方法。
English: This paper introduces a novel cross-view texture transfer method that leverages the anisotropic nature of 3D CT volumes to enhance inter-slice resolution by transferring high-resolution in-plane textures to through-plane images, demonstrating superior performance in CT slice interpolation compared to existing methods.

Authors:Tom Burgert, Oliver Stoll, Paolo Rota, Begüm Demir
Title: ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression
Abstract:
The hypothesis that Convolutional Neural Networks (CNNs) are inherently texture-biased has shaped much of the discourse on feature use in deep learning. We revisit this hypothesis by examining limitations in the cue-conflict experiment by Geirhos et al. To address these limitations, we propose a domain-agnostic framework that quantifies feature reliance through systematic suppression of shape, texture, and color cues, avoiding the confounds of forced-choice conflicts. By evaluating humans and neural networks under controlled suppression conditions, we find that CNNs are not inherently texture-biased but predominantly rely on local shape features. Nonetheless, this reliance can be substantially mitigated through modern training strategies or architectures (ConvNeXt, ViTs). We further extend the analysis across computer vision, medical imaging, and remote sensing, revealing that reliance patterns differ systematically: computer vision models prioritize shape, medical imaging models emphasize color, and remote sensing models exhibit a stronger reliance on texture. Code is available at https://github.com/tomburgert/feature-reliance.
中文: 该研究挑战了卷积神经网络天生偏向纹理的假设,通过领域无关框架证明其主要依赖局部形状特征,且在计算机视觉、医学影像和遥感领域表现出不同的特征依赖模式。
English: The study challenges the notion that CNNs are inherently texture-biased, demonstrating through a domain-agnostic framework that they primarily rely on local shape features, with reliance patterns varying across computer vision, medical imaging, and remote sensing domains.

Authors:Mahmoud Khater, Mona Strauss, Philipp von Olshausen, Alexander Reiterer
Title: PU-Gaussian: Point Cloud Upsampling using 3D Gaussian Representation
Abstract:
Point clouds produced by 3D sensors are often sparse and noisy, posing challenges for tasks requiring dense and high-fidelity 3D representations. Prior work has explored both implicit feature-based upsampling and distance-function learning to address this, but often at the expense of geometric interpretability or robustness to input sparsity. To overcome these limitations, we propose PU-Gaussian, a novel upsampling network that models the local neighborhood around each point using anisotropic 3D Gaussian distributions. These Gaussians capture the underlying geometric structure, allowing us to perform upsampling explicitly in the local geometric domain by direct point sampling. The sampling process generates a dense, but coarse, point cloud. A subsequent refinement network adjusts the coarse output to produce a more uniform distribution and sharper edges. We perform extensive testing on the PU1K and PUGAN datasets, demonstrating that PU-Gaussian achieves state-of-the-art performance. We make code and model weights publicly available at https://github.com/mvg-inatech/PU-Gaussian.git.
中文摘要:PU-Gaussian是一种创新的上采样网络,通过各向异性3D高斯分布建模局部几何结构,实现显式上采样和精细化处理,在基准数据集上取得了最优性能,能生成稠密且高保真的点云。
English Summary: PU-Gaussian is a novel upsampling network that uses anisotropic 3D Gaussian distributions to model local geometry, enabling explicit upsampling and refinement for dense, high-fidelity point clouds, achieving state-of-the-art results on benchmark datasets.

Authors:Zhi Qin Tan, Xiatian Zhu, Owen Addison, Yunpeng Li
Title: U-Mamba2-SSL for Semi-Supervised Tooth and Pulp Segmentation in CBCT
Abstract:
Accurate segmentation of teeth and pulp in Cone-Beam Computed Tomography (CBCT) is vital for clinical applications like treatment planning and diagnosis. However, this process requires extensive expertise and is exceptionally time-consuming, highlighting the critical need for automated algorithms that can effectively utilize unlabeled data. In this paper, we propose U-Mamba2-SSL, a novel semi-supervised learning framework that builds on the U-Mamba2 model and employs a multi-stage training strategy. The framework first pre-trains U-Mamba2 in a self-supervised manner using a disruptive autoencoder. It then leverages unlabeled data through consistency regularization, where we introduce input and feature perturbations to ensure stable model outputs. Finally, a pseudo-labeling strategy is implemented with a reduced loss weighting to minimize the impact of potential errors. U-Mamba2-SSL achieved an average score of 0.789 and a DSC of 0.917 on the hidden test set, achieving first place in Task 1 of the STSR 2025 challenge. The code is available at https://github.com/zhiqin1998/UMamba2.
中文: 本文提出U-Mamba2-SSL半监督学习框架,通过多阶段训练提升CBCT分割精度,并在STSR 2025挑战赛中荣获第一名。
English: This paper introduces U-Mamba2-SSL, a semi-supervised learning framework that enhances CBCT segmentation accuracy through multi-stage training and achieved top performance in the STSR 2025 challenge.

Authors:Min Cen, Zhenfeng Zhuang, Yuzhe Zhang, Min Zeng, Baptiste Magnier, Lequan Yu, Hong Zhang, Liansheng Wang
Title: C$^2$MIL: Synchronizing Semantic and Topological Causalities in Multiple Instance Learning for Robust and Interpretable Survival Analysis
Abstract:
Graph-based Multiple Instance Learning (MIL) is widely used in survival analysis with Hematoxylin and Eosin (H\&E)-stained whole slide images (WSIs) due to its ability to capture topological information. However, variations in staining and scanning can introduce semantic bias, while topological subgraphs that are not relevant to the causal relationships can create noise, resulting in biased slide-level representations. These issues can hinder both the interpretability and generalization of the analysis. To tackle this, we introduce a dual structural causal model as the theoretical foundation and propose a novel and interpretable dual causal graph-based MIL model, C$^2$MIL. C$^2$MIL incorporates a novel cross-scale adaptive feature disentangling module for semantic causal intervention and a new Bernoulli differentiable causal subgraph sampling method for topological causal discovery. A joint optimization strategy combining disentangling supervision and contrastive learning enables simultaneous refinement of both semantic and topological causalities. Experiments demonstrate that C$^2$MIL consistently improves generalization and interpretability over existing methods and can serve as a causal enhancement for diverse MIL baselines. The code is available at https://github.com/mimic0127/C2MIL.
中文:提出的C$^2$MIL模型通过双结构因果框架整合语义因果干预与拓扑因果发现,有效解决了基于图的多示例学习中染色差异和无关拓扑噪声问题,显著提升了泛化能力和可解释性。
English: The proposed C$^2$MIL model addresses staining variations and irrelevant topological noise in graph-based MIL for survival analysis by integrating semantic causal intervention and topological causal discovery through a dual structural causal framework, significantly enhancing both generalization and interpretability.

Authors:Zizheng Yang, Hu Yu, Bing Li, Jinghao Zhang, Jie Huang, Feng Zhao
Title: Unleashing the Potential of the Semantic Latent Space in Diffusion Models for Image Dehazing
Abstract:
Diffusion models have recently been investigated as powerful generative solvers for image dehazing, owing to their remarkable capability to model the data distribution. However, the massive computational burden imposed by the retraining of diffusion models, coupled with the extensive sampling steps during the inference, limit the broader application of diffusion models in image dehazing. To address these issues, we explore the properties of hazy images in the semantic latent space of frozen pre-trained diffusion models, and propose a Diffusion Latent Inspired network for Image Dehazing, dubbed DiffLI$^2$D. Specifically, we first reveal that the semantic latent space of pre-trained diffusion models can represent the content and haze characteristics of hazy images, as the diffusion time-step changes. Building upon this insight, we integrate the diffusion latent representations at different time-steps into a delicately designed dehazing network to provide instructions for image dehazing. Our DiffLI$^2$D avoids re-training diffusion models and iterative sampling process by effectively utilizing the informative representations derived from the pre-trained diffusion models, which also offers a novel perspective for introducing diffusion models to image dehazing. Extensive experiments on multiple datasets demonstrate that the proposed method achieves superior performance to existing image dehazing methods. Code is available at https://github.com/aaaasan111/difflid.
中文摘要:提出的DiffLI²D网络利用预训练扩散模型的语义潜在空间特性,无需重新训练或迭代采样即可实现高效图像去雾,在多个数据集上达到了最优性能。
English Summary: The proposed DiffLI²D network leverages the semantic latent space of pre-trained diffusion models to enable efficient image dehazing without retraining or iterative sampling, achieving state-of-the-art performance across multiple datasets.

Authors:Manahil Raza, Ayesha Azam, Talha Qaiser, Nasir Rajpoot
Title: PS3: A Multimodal Transformer Integrating Pathology Reports with Histology Images and Biological Pathways for Cancer Survival Prediction
Abstract:
Current multimodal fusion approaches in computational oncology primarily focus on integrating multi-gigapixel histology whole slide images (WSIs) with genomic or transcriptomic data, demonstrating improved survival prediction. We hypothesize that incorporating pathology reports can further enhance prognostic performance. Pathology reports, as essential components of clinical workflows, offer readily available complementary information by summarizing histopathological findings and integrating expert interpretations and clinical context. However, fusing these modalities poses challenges due to their heterogeneous nature. WSIs are high-dimensional, each containing several billion pixels, whereas pathology reports consist of concise text summaries of varying lengths, leading to potential modality imbalance. To address this, we propose a prototype-based approach to generate balanced representations, which are then integrated using a Transformer-based fusion model for survival prediction that we term PS3 (Predicting Survival from Three Modalities). Specifically, we present: (1) Diagnostic prototypes from pathology reports, leveraging self-attention to extract diagnostically relevant sections and standardize text representation; (2) Histological prototypes to compactly represent key morphological patterns in WSIs; and (3) Biological pathway prototypes to encode transcriptomic expressions, accurately capturing cellular functions. PS3, the three-modal transformer model, processes the resulting prototype-based multimodal tokens and models intra-modal and cross-modal interactions across pathology reports, WSIs and transcriptomic data. The proposed model outperforms state-of-the-art methods when evaluated against clinical, unimodal and multimodal baselines on six datasets from The Cancer Genome Atlas (TCGA). The code is available at: https://github.com/manahilr/PS3.
中文摘要:本研究提出PS3模型,通过原型表示整合病理报告、全切片图像和转录组数据,利用基于Transformer的融合方法提升癌症生存预测性能,在多个TCGA数据集上验证了其优于现有方法的有效性。
English Summary: This study introduces PS3, a transformer-based model that integrates pathology reports, whole slide images, and transcriptomic data through prototype representations to improve cancer survival prediction, demonstrating superior performance over existing methods across multiple TCGA datasets.

Authors:Nico Schulthess, Ender Konukoglu
Title: Anomaly Detection by Clustering DINO Embeddings using a Dirichlet Process Mixture
Abstract:
In this work, we leverage informative embeddings from foundational models for unsupervised anomaly detection in medical imaging. For small datasets, a memory-bank of normative features can directly be used for anomaly detection which has been demonstrated recently. However, this is unsuitable for large medical datasets as the computational burden increases substantially. Therefore, we propose to model the distribution of normative DINOv2 embeddings with a Dirichlet Process Mixture model (DPMM), a non-parametric mixture model that automatically adjusts the number of mixture components to the data at hand. Rather than using a memory bank, we use the similarity between the component centers and the embeddings as anomaly score function to create a coarse anomaly segmentation mask. Our experiments show that through DPMM embeddings of DINOv2, despite being trained on natural images, achieve very competitive anomaly detection performance on medical imaging benchmarks and can do this while at least halving the computation time at inference. Our analysis further indicates that normalized DINOv2 embeddings are generally more aligned with anatomical structures than unnormalized features, even in the presence of anomalies, making them great representations for anomaly detection. The code is available at https://github.com/NicoSchulthess/anomalydino-dpmm.
中文: 本研究提出一种无监督医学影像异常检测方法,通过狄利克雷过程混合模型对DINOv2特征进行建模,在降低计算成本的同时实现了优越的检测性能,并将推理时间至少缩短一半。
English: This study introduces an unsupervised anomaly detection method for medical imaging by modeling DINOv2 embeddings with a Dirichlet Process Mixture model, which reduces computational costs while achieving competitive performance and faster inference times.

Authors:Rui Xu, Tianyang Xue, Qiujie Dong, Le Wan, Zhe Zhu, Peng Li, Zhiyang Dou, Cheng Lin, Shiqing Xin, Yuan Liu, Wenping Wang, Taku Komura
Title: MeshMosaic: Scaling Artist Mesh Generation via Local-to-Global Assembly
Abstract:
Scaling artist-designed meshes to high triangle numbers remains challenging for autoregressive generative models. Existing transformer-based methods suffer from long-sequence bottlenecks and limited quantization resolution, primarily due to the large number of tokens required and constrained quantization granularity. These issues prevent faithful reproduction of fine geometric details and structured density patterns. We introduce MeshMosaic, a novel local-to-global framework for artist mesh generation that scales to over 100K triangles--substantially surpassing prior methods, which typically handle only around 8K faces. MeshMosaic first segments shapes into patches, generating each patch autoregressively and leveraging shared boundary conditions to promote coherence, symmetry, and seamless connectivity between neighboring regions. This strategy enhances scalability to high-resolution meshes by quantizing patches individually, resulting in more symmetrical and organized mesh density and structure. Extensive experiments across multiple public datasets demonstrate that MeshMosaic significantly outperforms state-of-the-art methods in both geometric fidelity and user preference, supporting superior detail representation and practical mesh generation for real-world applications.
中文: MeshMosaic提出了一种从局部到全局的创新框架,通过自回归生成连贯的网格块,将艺术家设计的网格扩展到超过10万个三角形,在几何保真度和用户偏好上显著优于现有方法。
English: MeshMosaic introduces a local-to-global framework that scales artist-designed meshes to over 100K triangles by generating coherent patches autoregressively, significantly outperforming existing methods in geometric fidelity and user preference.

Authors:Phyo Thet Yee, Dimitrios Kollias, Sudeepta Mishra, Abhinav Dhall
Title: SynchroRaMa : Lip-Synchronized and Emotion-Aware Talking Face Generation via Multi-Modal Emotion Embedding
Abstract:
Audio-driven talking face generation has received growing interest, particularly for applications requiring expressive and natural human-avatar interaction. However, most existing emotion-aware methods rely on a single modality (either audio or image) for emotion embedding, limiting their ability to capture nuanced affective cues. Additionally, most methods condition on a single reference image, restricting the model's ability to represent dynamic changes in actions or attributes across time. To address these issues, we introduce SynchroRaMa, a novel framework that integrates a multi-modal emotion embedding by combining emotional signals from text (via sentiment analysis) and audio (via speech-based emotion recognition and audio-derived valence-arousal features), enabling the generation of talking face videos with richer and more authentic emotional expressiveness and fidelity. To ensure natural head motion and accurate lip synchronization, SynchroRaMa includes an audio-to-motion (A2M) module that generates motion frames aligned with the input audio. Finally, SynchroRaMa incorporates scene descriptions generated by Large Language Model (LLM) as additional textual input, enabling it to capture dynamic actions and high-level semantic attributes. Conditioning the model on both visual and textual cues enhances temporal consistency and visual realism. Quantitative and qualitative experiments on benchmark datasets demonstrate that SynchroRaMa outperforms the state-of-the-art, achieving improvements in image quality, expression preservation, and motion realism. A user study further confirms that SynchroRaMa achieves higher subjective ratings than competing methods in overall naturalness, motion diversity, and video smoothness. Our project page is available at .

Authors:Sarmistha Das, R E Zera Marveen Lyngkhoi, Kirtan Jain, Vinayak Goyal, Sriparna Saha, Manish Gupta
Title: When Words Can't Capture It All: Towards Video-Based User Complaint Text Generation with Multimodal Video Complaint Dataset
Abstract:
While there exists a lot of work on explainable complaint mining, articulating user concerns through text or video remains a significant challenge, often leaving issues unresolved. Users frequently struggle to express their complaints clearly in text but can easily upload videos depicting product defects (e.g., vague text such as `worst product' paired with a 5-second video depicting a broken headphone with the right earcup). This paper formulates a new task in the field of complaint mining to aid the common users' need to write an expressive complaint, which is Complaint Description from Videos (CoD-V) (e.g., to help the above user articulate her complaint about the defective right earcup). To this end, we introduce ComVID, a video complaint dataset containing 1,175 complaint videos and the corresponding descriptions, also annotated with the emotional state of the complainer. Additionally, we present a new complaint retention (CR) evaluation metric that discriminates the proposed (CoD-V) task against standard video summary generation and description tasks. To strengthen this initiative, we introduce a multimodal Retrieval-Augmented Generation (RAG) embedded VideoLLaMA2-7b model, designed to generate complaints while accounting for the user's emotional state. We conduct a comprehensive evaluation of several Video Language Models on several tasks (pre-trained and fine-tuned versions) with a range of established evaluation metrics, including METEOR, perplexity, and the Coleman-Liau readability score, among others. Our study lays the foundation for a new research direction to provide a platform for users to express complaints through video. Dataset and resources are available at: https://github.com/sarmistha-D/CoD-V.
中文: 本文提出了基于视频的投诉描述(CoD-V)新任务,通过利用视频内容帮助用户更清晰地表达产品问题,并提供了ComVID数据集及融合多模态检索增强生成的模型作为支持。
English: This paper introduces Complaint Description from Videos (CoD-V), a novel task that leverages video content to help users articulate product complaints more effectively, supported by the ComVID dataset and a multimodal RAG-enhanced model.

Authors:Edmund Bu, Yossi Gandelsman
Title: Interpreting ResNet-based CLIP via Neuron-Attention Decomposition
Abstract:
We present a novel technique for interpreting the neurons in CLIP-ResNet by decomposing their contributions to the output into individual computation paths. More specifically, we analyze all pairwise combinations of neurons and the following attention heads of CLIP's attention-pooling layer. We find that these neuron-head pairs can be approximated by a single direction in CLIP-ResNet's image-text embedding space. Leveraging this insight, we interpret each neuron-head pair by associating it with text. Additionally, we find that only a sparse set of the neuron-head pairs have a significant contribution to the output value, and that some neuron-head pairs, while polysemantic, represent sub-concepts of their corresponding neurons. We use these observations for two applications. First, we employ the pairs for training-free semantic segmentation, outperforming previous methods for CLIP-ResNet. Second, we utilize the contributions of neuron-head pairs to monitor dataset distribution shifts. Our results demonstrate that examining individual computation paths in neural networks uncovers interpretable units, and that such units can be utilized for downstream tasks.

Authors:Hyunjin Cho, Giyun Choi, Jongwon Choi
Title: AJAHR: Amputated Joint Aware 3D Human Mesh Recovery
Abstract:
Existing human mesh recovery methods assume a standard human body structure, overlooking diverse anatomical conditions such as limb loss. This assumption introduces bias when applied to individuals with amputations - a limitation further exacerbated by the scarcity of suitable datasets. To address this gap, we propose Amputated Joint Aware 3D Human Mesh Recovery (AJAHR), which is an adaptive pose estimation framework that improves mesh reconstruction for individuals with limb loss. Our model integrates a body-part amputation classifier, jointly trained with the mesh recovery network, to detect potential amputations. We also introduce Amputee 3D (A3D), which is a synthetic dataset offering a wide range of amputee poses for robust training. While maintaining competitive performance on non-amputees, our approach achieves state-of-the-art results for amputated individuals. Additional materials can be found at the project webpage.

Authors:Guo Chen, Jiarun Liu, Sicong Du, Chenming Wu, Deqi Li, Shi-Sheng Huang, Guofeng Zhang, Sheng Yang
Title: GS-RoadPatching: Inpainting Gaussians via 3D Searching and Placing for Driving Scenes
Abstract:
This paper presents GS-RoadPatching, an inpainting method for driving scene completion by referring to completely reconstructed regions, which are represented by 3D Gaussian Splatting (3DGS). Unlike existing 3DGS inpainting methods that perform generative completion relying on 2D perspective-view-based diffusion or GAN models to predict limited appearance or depth cues for missing regions, our approach enables substitutional scene inpainting and editing directly through the 3DGS modality, extricating it from requiring spatial-temporal consistency of 2D cross-modals and eliminating the need for time-intensive retraining of Gaussians. Our key insight is that the highly repetitive patterns in driving scenes often share multi-modal similarities within the implicit 3DGS feature space and are particularly suitable for structural matching to enable effective 3DGS-based substitutional inpainting. Practically, we construct feature-embedded 3DGS scenes to incorporate a patch measurement method for abstracting local context at different scales and, subsequently, propose a structural search method to find candidate patches in 3D space effectively. Finally, we propose a simple yet effective substitution-and-fusion optimization for better visual harmony. We conduct extensive experiments on multiple publicly available datasets to demonstrate the effectiveness and efficiency of our proposed method in driving scenes, and the results validate that our method achieves state-of-the-art performance compared to the baseline methods in terms of both quality and interoperability. Additional experiments in general scenes also demonstrate the applicability of the proposed 3D inpainting strategy. The project page and code are available at: https://shanzhaguoo.github.io/GS-RoadPatching/

Authors:Miren Samaniego, Igor Rodriguez, Elena Lazkano
Title: CapStARE: Capsule-based Spatiotemporal Architecture for Robust and Efficient Gaze Estimation
Abstract:
We introduce CapStARE, a capsule-based spatio-temporal architecture for gaze estimation that integrates a ConvNeXt backbone, capsule formation with attention routing, and dual GRU decoders specialized for slow and rapid gaze dynamics. This modular design enables efficient part-whole reasoning and disentangled temporal modeling, achieving state-of-the-art performance on ETH-XGaze (3.36) and MPIIFaceGaze (2.65) while maintaining real-time inference (< 10 ms). The model also generalizes well to unconstrained conditions in Gaze360 (9.06) and human-robot interaction scenarios in RT-GENE (4.76), outperforming or matching existing methods with fewer parameters and greater interpretability. These results demonstrate that CapStARE offers a practical and robust solution for real-time gaze estimation in interactive systems. The related code and results for this article can be found on: https://github.com/toukapy/capsStare
中文: CapStARE是一种基于胶囊的时空架构,在多个数据集上实现了最先进的视线估计性能,兼具实时效率和强大的泛化能力。
English: CapStARE is a capsule-based spatio-temporal architecture that achieves state-of-the-art gaze estimation performance with real-time efficiency and strong generalization across multiple datasets.

Authors:Xiangyang Chen, Shuzhao Li, Xiuwen Zhu, Yongfan Chen, Fan Yang, Cheng Fang, Lin Qu, Xiaoxiao Xu, Hu Wei, Minggang Wu
Title: Logics-Parsing Technical Report
Abstract:
Recent advances in Large Vision-Language models (LVLM) have spurred significant progress in document parsing task. Compared to traditional pipeline-based methods, end-to-end paradigms have shown their excellence in converting PDF images into structured outputs through integrated Optical Character Recognition (OCR), table recognition, mathematical formula recognition and so on. However, the absence of explicit analytical stages for document layouts and reading orders limits the LVLM's capability in handling complex document types such as multi-column newspapers or posters. To address this limitation, we propose in this report Logics-Parsing: an end-to-end LVLM-based model augmented with reinforcement learning. Our model incorporates meticulously designed reward mechanisms to optimize complex layout analysis and reading order inference. In addition, we expand the model's versatility by incorporating diverse data types such as chemical formulas and handwritten Chinese characters into supervised fine-tuning. Finally, to enable rigorous evaluation of our approach, we introduce LogicsParsingBench, a curated set of 1,078 page-level PDF images spanning nine major categories and over twenty sub-categories, which will be released later. Comprehensive experiments conducted on LogicsParsingBench have validated the efficacy and State-of-the-art (SOTA) performance of our proposed model across diverse document analysis scenarios. Project Page: https://github.com/alibaba/Logics-Parsing
中文摘要:本报告提出Logics-Parsing模型,通过强化学习优化布局分析和阅读顺序处理,在包含九大类文档的基准测试中展现出卓越性能,有效解决了复杂文档解析的挑战。
English Summary: This report introduces Logics-Parsing, an enhanced end-to-end Large Vision-Language model that integrates reinforcement learning with specialized reward mechanisms to improve layout analysis and reading order processing for complex documents, validated through a comprehensive benchmark showing state-of-the-art performance.

Authors:Jinhui Zheng, Xueyuan Gong
Title: ExpFace: Exponential Angular Margin Loss for Deep Face Recognition
Abstract:
Face recognition is an open-set problem requiring high discriminative power to ensure that intra-class distances remain smaller than inter-class distances. Margin-based softmax losses, such as SphereFace, CosFace, and ArcFace, have been widely adopted to enhance intra-class compactness and inter-class separability, yet they overlook the impact of noisy samples. By examining the distribution of samples in the angular space, we observe that clean samples predominantly cluster in the center region, whereas noisy samples tend to shift toward the peripheral region. Motivated by this observation, we propose the Exponential Angular Margin Loss (ExpFace), which introduces an angular exponential term as the margin. This design applies a larger penalty in the center region and a smaller penalty in the peripheral region within the angular space, thereby emphasizing clean samples while suppressing noisy samples. We present a unified analysis of ExpFace and classical margin-based softmax losses in terms of margin embedding forms, similarity curves, and gradient curves, showing that ExpFace not only avoids the training instability of SphereFace and the non-monotonicity of ArcFace, but also exhibits a similarity curve that applies penalties in the same manner as the decision boundary in the angular space. Extensive experiments demonstrate that ExpFace achieves state-of-the-art performance. To facilitate future research, we have released the source code at: https://github.com/dfr-code/ExpFace.
中文: 提出的指数化角度间隔损失(ExpFace)通过在角度空间中加大对中心区域干净样本的惩罚、减小对边缘噪声样本的惩罚,有效提升了人脸识别的判别能力,在克服现有方法缺陷的同时实现了最优性能。
English: The proposed Exponential Angular Margin Loss (ExpFace) enhances face recognition by applying larger penalties to centrally clustered clean samples and smaller penalties to peripheral noisy samples in angular space, achieving state-of-the-art performance while addressing limitations of previous methods.

Authors:Yi Yang
Title: nnFilterMatch: A Unified Semi-Supervised Learning Framework with Uncertainty-Aware Pseudo-Label Filtering for Efficient Medical Segmentation
Abstract:
Semi-supervised learning (SSL) has emerged as a promising paradigm in medical image segmentation, offering competitive performance while substantially reducing the need for extensive manual annotation. When combined with active learning (AL), these strategies further minimize annotation burden by selectively incorporating the most informative samples. However, conventional SSL_AL hybrid approaches often rely on iterative and loop-based retraining cycles after each annotation round, incurring significant computational overhead and limiting scalability in clinical applications. In this study, we present a novel, annotation-efficient, and self-adaptive deep segmentation framework that integrates SSL with entropy-based pseudo-label filtering (FilterMatch), an AL-inspired mechanism, within the single-pass nnU-Net training segmentation framework (nnFilterMatch). By selectively excluding high-confidence pseudo-labels during training, our method circumvents the need for retraining loops while preserving the benefits of uncertainty-guided learning. We validate the proposed framework across multiple clinical segmentation benchmarks and demonstrate that it achieves performance comparable to or exceeding fully supervised models, even with only 5\%--20\% labeled data. This work introduces a scalable, end-to-end learning strategy for reducing annotation demands in medical image segmentation without compromising accuracy. Code is available here: https://github.com/Ordi117/nnFilterMatch.git.
中文: 本研究提出nnFilterMatch框架,将主动学习启发的伪标签过滤机制融入半监督学习,仅需5%-20%标注数据即可获得与全监督模型相当的医学图像分割效果,同时避免了传统方法的重训练循环。
English: This study introduces nnFilterMatch, a novel semi-supervised learning framework that integrates active learning-inspired pseudo-label filtering to achieve medical image segmentation performance comparable to fully supervised models using only 5%-20% labeled data, eliminating the need for computational retraining loops.

Authors:Jiesi Hu, Yanwu Yang, Zhiyu Ye, Chenfei Ye, Hanyang Peng, Jianfeng Cao, Ting Ma
Title: Towards Robust In-Context Learning for Medical Image Segmentation via Data Synthesis
Abstract:
The rise of In-Context Learning (ICL) for universal medical image segmentation has introduced an unprecedented demand for large-scale, diverse datasets for training, exacerbating the long-standing problem of data scarcity. While data synthesis offers a promising solution, existing methods often fail to simultaneously achieve both high data diversity and a domain distribution suitable for medical data. To bridge this gap, we propose \textbf{SynthICL}, a novel data synthesis framework built upon domain randomization. SynthICL ensures realism by leveraging anatomical priors from real-world datasets, generates diverse anatomical structures to cover a broad data distribution, and explicitly models inter-subject variations to create data cohorts suitable for ICL. Extensive experiments on four held-out datasets validate our framework's effectiveness, showing that models trained with our data achieve performance gains of up to 63\% in average Dice and substantially enhanced generalization to unseen anatomical domains. Our work helps mitigate the data bottleneck for ICL-based segmentation, paving the way for robust models. Our code and the generated dataset are publicly available at https://github.com/jiesihu/Neuroverse3D.
中文摘要:提出的SynthICL框架通过领域随机化生成解剖学真实且多样化的合成数据,有效解决了通用医学图像分割中的数据稀缺问题,在四个数据集上实现了高达63%的性能提升并显著增强了泛化能力。
English Summary: The proposed SynthICL framework overcomes data scarcity in universal medical image segmentation by generating anatomically realistic and diverse synthetic data through domain randomization, achieving up to 63% performance improvement and enhanced generalization across four datasets.

Authors:Ling Lo, Kelvin C. K. Chan, Wen-Huang Cheng, Ming-Hsuan Yang
Title: From Prompt to Progression: Taming Video Diffusion Models for Seamless Attribute Transition
Abstract:
Existing models often struggle with complex temporal changes, particularly when generating videos with gradual attribute transitions. The most common prompt interpolation approach for motion transitions often fails to handle gradual attribute transitions, where inconsistencies tend to become more pronounced. In this work, we propose a simple yet effective method to extend existing models for smooth and consistent attribute transitions, through introducing frame-wise guidance during the denoising process. Our approach constructs a data-specific transitional direction for each noisy latent, guiding the gradual shift from initial to final attributes frame by frame while preserving the motion dynamics of the video. Moreover, we present the Controlled-Attribute-Transition Benchmark (CAT-Bench), which integrates both attribute and motion dynamics, to comprehensively evaluate the performance of different models. We further propose two metrics to assess the accuracy and smoothness of attribute transitions. Experimental results demonstrate that our approach performs favorably against existing baselines, achieving visual fidelity, maintaining alignment with text prompts, and delivering seamless attribute transitions. Code and CATBench are released: https://github.com/lynn-ling-lo/Prompt2Progression.
Chinese: 本文提出了一种在去噪过程中引入逐帧引导的方法,增强了视频生成模型处理属性渐变的能力,同时保持运动动态,并通过新基准和指标验证了其优越性能。
English: This paper introduces a method that enhances video generation models by incorporating frame-wise guidance during denoising, enabling smooth attribute transitions while preserving motion dynamics, and validates its effectiveness with a new benchmark and metrics.

Authors:Kunlun Xu, Yibo Feng, Jiangmeng Li, Yongsheng Qi, Jiahuan Zhou
Title: C${}^2$Prompt: Class-aware Client Knowledge Interaction for Federated Continual Learning
Abstract:
Federated continual learning (FCL) tackles scenarios of learning from continuously emerging task data across distributed clients, where the key challenge lies in addressing both temporal forgetting over time and spatial forgetting simultaneously. Recently, prompt-based FCL methods have shown advanced performance through task-wise prompt communication.In this study, we underscore that the existing prompt-based FCL methods are prone to class-wise knowledge coherence between prompts across clients. The class-wise knowledge coherence includes two aspects: (1) intra-class distribution gap across clients, which degrades the learned semantics across prompts, (2) inter-prompt class-wise relevance, which highlights cross-class knowledge confusion. During prompt communication, insufficient class-wise coherence exacerbates knowledge conflicts among new prompts and induces interference with old prompts, intensifying both spatial and temporal forgetting. To address these issues, we propose a novel Class-aware Client Knowledge Interaction (C${}^2$Prompt) method that explicitly enhances class-wise knowledge coherence during prompt communication. Specifically, a local class distribution compensation mechanism (LCDC) is introduced to reduce intra-class distribution disparities across clients, thereby reinforcing intra-class knowledge consistency. Additionally, a class-aware prompt aggregation scheme (CPA) is designed to alleviate inter-class knowledge confusion by selectively strengthening class-relevant knowledge aggregation. Extensive experiments on multiple FCL benchmarks demonstrate that C${}^2$Prompt achieves state-of-the-art performance. Our source code is available at https://github.com/zhoujiahuan1991/NeurIPS2025-C2Prompt
中文摘要:本研究针对基于提示的联邦持续学习中存在的类间知识一致性问题,提出了C²Prompt方法,通过增强类内一致性和减少类间混淆来缓解空间和时间遗忘。
English Summary: This study addresses class-wise knowledge coherence issues in prompt-based federated continual learning by proposing the C²Prompt method, which enhances intra-class consistency and reduces inter-class confusion to mitigate both spatial and temporal forgetting.

Authors:Jason Chen, I-Chun Arthur Liu, Gaurav Sukhatme, Daniel Seita
Title: ROPA: Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation
Abstract:
Training robust bimanual manipulation policies via imitation learning requires demonstration data with broad coverage over robot poses, contacts, and scene contexts. However, collecting diverse and precise real-world demonstrations is costly and time-consuming, which hinders scalability. Prior works have addressed this with data augmentation, typically for either eye-in-hand (wrist camera) setups with RGB inputs or for generating novel images without paired actions, leaving augmentation for eye-to-hand (third-person) RGB-D training with new action labels less explored. In this paper, we propose Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation (ROPA), an offline imitation learning data augmentation method that fine-tunes Stable Diffusion to synthesize third-person RGB and RGB-D observations of novel robot poses. Our approach simultaneously generates corresponding joint-space action labels while employing constrained optimization to enforce physical consistency through appropriate gripper-to-object contact constraints in bimanual scenarios. We evaluate our method on 5 simulated and 3 real-world tasks. Our results across 2625 simulation trials and 300 real-world trials demonstrate that ROPA outperforms baselines and ablations, showing its potential for scalable RGB and RGB-D data augmentation in eye-to-hand bimanual manipulation. Our project website is available at: https://ropaaug.github.io/.

Authors:Weijie Wang, Yeqing Chen, Zeyu Zhang, Hengyu Liu, Haoxiao Wang, Zhiyuan Feng, Wenkang Qin, Zheng Zhu, Donny Y. Chen, Bohan Zhuang
Title: VolSplat: Rethinking Feed-Forward 3D Gaussian Splatting with Voxel-Aligned Prediction
Abstract:
Feed-forward 3D Gaussian Splatting (3DGS) has emerged as a highly effective solution for novel view synthesis. Existing methods predominantly rely on a pixel-aligned Gaussian prediction paradigm, where each 2D pixel is mapped to a 3D Gaussian. We rethink this widely adopted formulation and identify several inherent limitations: it renders the reconstructed 3D models heavily dependent on the number of input views, leads to view-biased density distributions, and introduces alignment errors, particularly when source views contain occlusions or low texture. To address these challenges, we introduce VolSplat, a new multi-view feed-forward paradigm that replaces pixel alignment with voxel-aligned Gaussians. By directly predicting Gaussians from a predicted 3D voxel grid, it overcomes pixel alignment's reliance on error-prone 2D feature matching, ensuring robust multi-view consistency. Furthermore, it enables adaptive control over Gaussian density based on 3D scene complexity, yielding more faithful Gaussian point clouds, improved geometric consistency, and enhanced novel-view rendering quality. Experiments on widely used benchmarks including RealEstate10K and ScanNet demonstrate that VolSplat achieves state-of-the-art performance while producing more plausible and view-consistent Gaussian reconstructions. In addition to superior results, our approach establishes a more scalable framework for feed-forward 3D reconstruction with denser and more robust representations, paving the way for further research in wider communities. The video results, code and trained models are available on our project page: https://lhmd.top/volsplat.
Chinese: VolSplat提出了一种体素对齐的高斯范式,克服了像素对齐方法的局限,在真实感视图合成中实现了最先进的性能,同时提升了几何一致性和重建鲁棒性。
English: VolSplat introduces a voxel-aligned Gaussian paradigm that overcomes the limitations of pixel-aligned methods, achieving state-of-the-art performance in novel view synthesis with improved geometric consistency and more robust 3D reconstructions.

Authors:Bingnan Li, Chen-Yu Wang, Haiyang Xu, Xiang Zhang, Ethan Armand, Divyansh Srivastava, Xiaojun Shan, Zeyuan Chen, Jianwen Xie, Zhuowen Tu
Title: OverLayBench: A Benchmark for Layout-to-Image Generation with Dense Overlaps
Abstract:
Despite steady progress in layout-to-image generation, current methods still struggle with layouts containing significant overlap between bounding boxes. We identify two primary challenges: (1) large overlapping regions and (2) overlapping instances with minimal semantic distinction. Through both qualitative examples and quantitative analysis, we demonstrate how these factors degrade generation quality. To systematically assess this issue, we introduce OverLayScore, a novel metric that quantifies the complexity of overlapping bounding boxes. Our analysis reveals that existing benchmarks are biased toward simpler cases with low OverLayScore values, limiting their effectiveness in evaluating model performance under more challenging conditions. To bridge this gap, we present OverLayBench, a new benchmark featuring high-quality annotations and a balanced distribution across different levels of OverLayScore. As an initial step toward improving performance on complex overlaps, we also propose CreatiLayout-AM, a model fine-tuned on a curated amodal mask dataset. Together, our contributions lay the groundwork for more robust layout-to-image generation under realistic and challenging scenarios. Project link: https://mlpc-ucsd.github.io/OverLayBench.

Authors:Gabriel Maldonado, Narges Rashvand, Armin Danesh Pazho, Ghazal Alinezhad Noghre, Vinit Katariya, Hamed Tabkhi
Title: Adversarially-Refined VQ-GAN with Dense Motion Tokenization for Spatio-Temporal Heatmaps
Abstract:
Continuous human motion understanding remains a core challenge in computer vision due to its high dimensionality and inherent redundancy. Efficient compression and representation are crucial for analyzing complex motion dynamics. In this work, we introduce an adversarially-refined VQ-GAN framework with dense motion tokenization for compressing spatio-temporal heatmaps while preserving the fine-grained traces of human motion. Our approach combines dense motion tokenization with adversarial refinement, which eliminates reconstruction artifacts like motion smearing and temporal misalignment observed in non-adversarial baselines. Our experiments on the CMU Panoptic dataset provide conclusive evidence of our method's superiority, outperforming the dVAE baseline by 9.31% SSIM and reducing temporal instability by 37.1%. Furthermore, our dense tokenization strategy enables a novel analysis of motion complexity, revealing that 2D motion can be optimally represented with a compact 128-token vocabulary, while 3D motion's complexity demands a much larger 1024-token codebook for faithful reconstruction. These results establish practical deployment feasibility across diverse motion analysis applications. The code base for this work is available at https://github.com/TeCSAR-UNCC/Pose-Quantization.
中文: 本研究提出了一种对抗性优化的VQ-GAN框架,通过密集运动标记化有效压缩人体运动数据,在重建质量和时间稳定性上显著优于基线方法,同时揭示了2D和3D运动表征的最佳词汇量规模。
English: This study presents an adversarially-refined VQ-GAN framework that effectively compresses human motion data using dense tokenization, significantly outperforming baselines in reconstruction quality and temporal stability while revealing optimal vocabulary sizes for 2D and 3D motion representation.

Authors:Ioanna Ntinou, Alexandros Xenos, Yassine Ouali, Adrian Bulat, Georgios Tzimiropoulos
Title: Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions
Abstract:
Contrastively-trained Vision-Language Models (VLMs), such as CLIP, have become the standard approach for learning discriminative vision-language representations. However, these models often exhibit shallow language understanding, manifesting bag-of-words behaviour. These limitations are reinforced by their dual-encoder design, which induces a modality gap. Additionally, the reliance on vast web-collected data corpora for training makes the process computationally expensive and introduces significant privacy concerns. To address these limitations, in this work, we challenge the necessity of vision encoders for retrieval tasks by introducing a vision-free, single-encoder retrieval pipeline. Departing from the traditional text-to-image retrieval paradigm, we migrate to a text-to-text paradigm with the assistance of VLLM-generated structured image descriptions. We demonstrate that this paradigm shift has significant advantages, including a substantial reduction of the modality gap, improved compositionality, and better performance on short and long caption queries, all attainable with only a few hours of calibration on two GPUs. Additionally, substituting raw images with textual descriptions introduces a more privacy-friendly alternative for retrieval. To further assess generalisation and address some of the shortcomings of prior compositionality benchmarks, we release two benchmarks derived from Flickr30k and COCO, containing diverse compositional queries made of short captions, which we coin subFlickr and subCOCO. Our vision-free retriever matches and often surpasses traditional multimodal models. Importantly, our approach achieves state-of-the-art zero-shot performance on multiple retrieval and compositionality benchmarks, with models as small as 0.3B parameters. Code is available at: https://github.com/IoannaNti/LexiCLIP
This work introduces a vision-free, single-encoder retrieval method that replaces traditional multimodal models with text-to-text retrieval using structured image descriptions, achieving superior performance while reducing computational costs and privacy concerns.
English Summary:

Authors:Yun Wang, Junjie Hu, Junhui Hou, Chenghao Zhang, Renwei Yang, Dapeng Oliver Wu
Title: RoSe: Robust Self-supervised Stereo Matching under Adverse Weather Conditions
Abstract:
Recent self-supervised stereo matching methods have made significant progress, but their performance significantly degrades under adverse weather conditions such as night, rain, and fog. We identify two primary weaknesses contributing to this performance degradation. First, adverse weather introduces noise and reduces visibility, making CNN-based feature extractors struggle with degraded regions like reflective and textureless areas. Second, these degraded regions can disrupt accurate pixel correspondences, leading to ineffective supervision based on the photometric consistency assumption. To address these challenges, we propose injecting robust priors derived from the visual foundation model into the CNN-based feature extractor to improve feature representation under adverse weather conditions. We then introduce scene correspondence priors to construct robust supervisory signals rather than relying solely on the photometric consistency assumption. Specifically, we create synthetic stereo datasets with realistic weather degradations. These datasets feature clear and adverse image pairs that maintain the same semantic context and disparity, preserving the scene correspondence property. With this knowledge, we propose a robust self-supervised training paradigm, consisting of two key steps: robust self-supervised scene correspondence learning and adverse weather distillation. Both steps aim to align underlying scene results from clean and adverse image pairs, thus improving model disparity estimation under adverse weather effects. Extensive experiments demonstrate the effectiveness and versatility of our proposed solution, which outperforms existing state-of-the-art self-supervised methods. Codes are available at \textcolor{blue}{https://github.com/cocowy1/RoSe-Robust-Self-supervised-Stereo-Matching-under-Adverse-Weather-Conditions}.
中文: 针对现有自监督立体匹配方法在恶劣天气下因特征提取困难和像素对应关系破坏而性能下降的问题,提出融合视觉基础模型先验与场景对应学习的鲁棒训练范式,有效提升了模型在雨雾等复杂环境下的视差估计精度。
English: Recent self-supervised stereo matching methods struggle in adverse weather due to degraded feature extraction and disrupted pixel correspondences, prompting the development of a robust training paradigm that integrates visual foundation model priors and scene correspondence learning to significantly enhance performance.

Authors:Görkay Aydemir, Weidi Xie, Fatma Güney
Title: Track-On2: Enhancing Online Point Tracking with Memory
Abstract:
In this paper, we consider the problem of long-term point tracking, which requires consistent identification of points across video frames under significant appearance changes, motion, and occlusion. We target the online setting, i.e. tracking points frame-by-frame, making it suitable for real-time and streaming applications. We extend our prior model Track-On into Track-On2, a simple and efficient transformer-based model for online long-term tracking. Track-On2 improves both performance and efficiency through architectural refinements, more effective use of memory, and improved synthetic training strategies. Unlike prior approaches that rely on full-sequence access or iterative updates, our model processes frames causally and maintains temporal coherence via a memory mechanism, which is key to handling drift and occlusions without requiring future frames. At inference, we perform coarse patch-level classification followed by refinement. Beyond architecture, we systematically study synthetic training setups and their impact on memory behavior, showing how they shape temporal robustness over long sequences. Through comprehensive experiments, Track-On2 achieves state-of-the-art results across five synthetic and real-world benchmarks, surpassing prior online trackers and even strong offline methods that exploit bidirectional context. These results highlight the effectiveness of causal, memory-based architectures trained purely on synthetic data as scalable solutions for real-world point tracking. Project page: https://kuis-ai.github.io/track_on2

Authors:Alexey Nekrasov, Ali Athar, Daan de Geus, Alexander Hermans, Bastian Leibe
Title: 3rd Place Report of LSVOS 2025 MeViS Track: Sa2VA-i: Improving Sa2VA Results with Consistent Training and Inference
Abstract:
Sa2VA is a recent model for language-guided dense grounding in images and video that achieves state-of-the-art results on multiple segmentation benchmarks and that has become widely popular. However, we found that Sa2VA does not perform according to its full potential for referring video object segmentation tasks. We identify inconsistencies between training and inference procedures as the key factor holding it back. To mitigate this issue, we propose an improved version of Sa2VA, Sa2VA-i, that rectifies these issues and improves the results. In fact, Sa2VA-i sets a new state of the art for multiple video benchmarks and achieves improvements of up to +11.6 J&F on MeViS, +1.4 on Ref-YT-VOS, +3.3 on Ref-DAVIS and +4.1 on ReVOS using the same Sa2VA checkpoints. With our fixes, the Sa2VA-i-1B model even performs on par with the original Sa2VA-26B model on the MeViS benchmark. We hope that this work will show the importance of seemingly trivial implementation details and that it will provide valuable insights for the referring video segmentation field. We provide the code and updated models at https://github.com/kumuji/sa2va-i
中文: Sa2VA-i是Sa2VA的改进版本,通过修正训练与推理间的不一致性问题,在多个视频分割基准测试中创下最新记录并显著提升性能。
English: Sa2VA-i is an enhanced version of Sa2VA that addresses inconsistencies between training and inference, achieving state-of-the-art results on multiple video segmentation benchmarks with significant performance improvements.

Authors:Honghao Chen, Xingzhou Lou, Xiaokun Feng, Kaiqi Huang, Xinlong Wang
Title: Unveiling Chain of Step Reasoning for Vision-Language Models with Fine-grained Rewards
Abstract:
Chain of thought reasoning has demonstrated remarkable success in large language models, yet its adaptation to vision-language reasoning remains an open challenge with unclear best practices. Existing attempts typically employ reasoning chains at a coarse-grained level, which struggles to perform fine-grained structured reasoning and, more importantly, are difficult to evaluate the reward and quality of intermediate reasoning. In this work, we delve into chain of step reasoning for vision-language models, enabling assessing reasoning step quality accurately and leading to effective reinforcement learning and inference-time scaling with fine-grained rewards. We present a simple, effective, and fully transparent framework, including the step-level reasoning data, process reward model (PRM), and reinforcement learning training. With the proposed approaches, our models set strong baselines with consistent improvements on challenging vision-language benchmarks. More importantly, we conduct a thorough empirical analysis and ablation study, unveiling the impact of each component and several intriguing properties of inference-time scaling. We believe this paper serves as a baseline for vision-language models and offers insights into more complex multimodal reasoning. Our dataset, PRM, and code will be available at https://github.com/baaivision/CoS.
中文: 本文提出视觉语言模型的逐步推理链方法,通过细粒度奖励评估和强化学习在基准测试中实现稳定提升,同时提供透明框架组件与实证分析启示。
English: This paper introduces chain of step reasoning for vision-language models, enabling fine-grained reward evaluation and reinforcement learning to achieve consistent improvements on benchmarks while providing transparent framework components and empirical insights.

Authors:Lorenzo Shaikewitz, Tim Nguyen, Luca Carlone
Title: Category-Level Object Shape and Pose Estimation in Less Than a Millisecond
Abstract:
Object shape and pose estimation is a foundational robotics problem, supporting tasks from manipulation to scene understanding and navigation. We present a fast local solver for shape and pose estimation which requires only category-level object priors and admits an efficient certificate of global optimality. Given an RGB-D image of an object, we use a learned front-end to detect sparse, category-level semantic keypoints on the target object. We represent the target object's unknown shape using a linear active shape model and pose a maximum a posteriori optimization problem to solve for position, orientation, and shape simultaneously. Expressed in unit quaternions, this problem admits first-order optimality conditions in the form of an eigenvalue problem with eigenvector nonlinearities. Our primary contribution is to solve this problem efficiently with self-consistent field iteration, which only requires computing a 4-by-4 matrix and finding its minimum eigenvalue-vector pair at each iterate. Solving a linear system for the corresponding Lagrange multipliers gives a simple global optimality certificate. One iteration of our solver runs in about 100 microseconds, enabling fast outlier rejection. We test our method on synthetic data and a variety of real-world settings, including two public datasets and a drone tracking scenario. Code is released at https://github.com/MIT-SPARK/Fast-ShapeAndPose.
Chinese: 本文提出了一种快速局部求解器,用于物体形状和姿态估计,仅需类别级先验知识并提供高效的全局最优性验证,通过自洽场迭代实现每轮约100微秒的快速计算。
English: This paper introduces a fast local solver for object shape and pose estimation that uses category-level priors and provides an efficient global optimality certificate, achieving rapid performance with self-consistent field iteration in about 100 microseconds per iteration.

Authors:Pamela Osuna-Vargas, Altug Kamacioglu, Dominik F. Aschauer, Petros E. Vlachos, Sercan Alipek, Jochen Triesch, Simon Rumpel, Matthias Kaschube
Title: SynapFlow: A Modular Framework Towards Large-Scale Analysis of Dendritic Spines
Abstract:
Dendritic spines are key structural components of excitatory synapses in the brain. Given the size of dendritic spines provides a proxy for synaptic efficacy, their detection and tracking across time is important for studies of the neural basis of learning and memory. Despite their relevance, large-scale analyses of the structural dynamics of dendritic spines in 3D+time microscopy data remain challenging and labor-intense. Here, we present a modular machine learning-based pipeline designed to automate the detection, time-tracking, and feature extraction of dendritic spines in volumes chronically recorded with two-photon microscopy. Our approach tackles the challenges posed by biological data by combining a transformer-based detection module, a depth-tracking component that integrates spatial features, a time-tracking module to associate 3D spines across time by leveraging spatial consistency, and a feature extraction unit that quantifies biologically relevant spine properties. We validate our method on open-source labeled spine data, and on two complementary annotated datasets that we publish alongside this work: one for detection and depth-tracking, and one for time-tracking, which, to the best of our knowledge, is the first data of this kind. To encourage future research, we release our data, code, and pre-trained weights at https://github.com/pamelaosuna/SynapFlow, establishing a baseline for scalable, end-to-end analysis of dendritic spine dynamics.
Chinese: 本研究提出了一种机器学习流程,可自动检测、追踪和分析三维延时显微镜数据中的树突棘,以促进学习和记忆研究。
English: This study introduces a machine learning pipeline that automates the detection, tracking, and feature analysis of dendritic spines in 3D time-lapse microscopy data to facilitate research on learning and memory.

Authors:Kartik Teotia, Helge Rhodin, Mohit Mendiratta, Hyeongwoo Kim, Marc Habermann, Christian Theobalt
Title: Audio-Driven Universal Gaussian Head Avatars
Abstract:
We introduce the first method for audio-driven universal photorealistic avatar synthesis, combining a person-agnostic speech model with our novel Universal Head Avatar Prior (UHAP). UHAP is trained on cross-identity multi-view videos. In particular, our UHAP is supervised with neutral scan data, enabling it to capture the identity-specific details at high fidelity. In contrast to previous approaches, which predominantly map audio features to geometric deformations only while ignoring audio-dependent appearance variations, our universal speech model directly maps raw audio inputs into the UHAP latent expression space. This expression space inherently encodes, both, geometric and appearance variations. For efficient personalization to new subjects, we employ a monocular encoder, which enables lightweight regression of dynamic expression variations across video frames. By accounting for these expression-dependent changes, it enables the subsequent model fine-tuning stage to focus exclusively on capturing the subject's global appearance and geometry. Decoding these audio-driven expression codes via UHAP generates highly realistic avatars with precise lip synchronization and nuanced expressive details, such as eyebrow movement, gaze shifts, and realistic mouth interior appearance as well as motion. Extensive evaluations demonstrate that our method is not only the first generalizable audio-driven avatar model that can account for detailed appearance modeling and rendering, but it also outperforms competing (geometry-only) methods across metrics measuring lip-sync accuracy, quantitative image quality, and perceptual realism.
Chinese: 本文提出首个通用逼真虚拟形象合成方法,通过将原始音频输入映射到同时编码几何与外观变化的潜在表情空间,生成具有精确口型同步和细腻表情细节的高度逼真虚拟形象。
English: This paper presents the first universal photorealistic avatar synthesis method that maps raw audio inputs into a latent expression space encoding both geometric and appearance variations, enabling highly realistic avatars with precise lip synchronization and nuanced expressive details.

Authors:Chuni Liu, Hongjie Li, Jiaqi Du, Yangyang Hou, Qian Sun, Lei Jin, Ke Xu
Title: Advancing Metallic Surface Defect Detection via Anomaly-Guided Pretraining on a Large Industrial Dataset
Abstract:
The pretraining-finetuning paradigm is a crucial strategy in metallic surface defect detection for mitigating the challenges posed by data scarcity. However, its implementation presents a critical dilemma. Pretraining on natural image datasets such as ImageNet, faces a significant domain gap. Meanwhile, naive self-supervised pretraining on in-domain industrial data is often ineffective due to the inability of existing learning objectives to distinguish subtle defect patterns from complex background noise and textures. To resolve this, we introduce Anomaly-Guided Self-Supervised Pretraining (AGSSP), a novel paradigm that explicitly guides representation learning through anomaly priors. AGSSP employs a two-stage framework: (1) it first pretrains the model's backbone by distilling knowledge from anomaly maps, encouraging the network to capture defect-salient features; (2) it then pretrains the detector using pseudo-defect boxes derived from these maps, aligning it with localization tasks. To enable this, we develop a knowledge-enhanced method to generate high-quality anomaly maps and collect a large-scale industrial dataset of 120,000 images. Additionally, we present two small-scale, pixel-level labeled metallic surface defect datasets for validation. Extensive experiments demonstrate that AGSSP consistently enhances performance across various settings, achieving up to a 10\% improvement in mAP@0.5 and 11.4\% in mAP@0.5:0.95 compared to ImageNet-based models. All code, pretrained models, and datasets are publicly available at https://clovermini.github.io/AGSSP-Dev/.

Authors:Suzannah Wistreich, Baiyu Shi, Stephen Tian, Samuel Clarke, Michael Nath, Chengyi Xu, Zhenan Bao, Jiajun Wu
Title: DexSkin: High-Coverage Conformable Robotic Skin for Learning Contact-Rich Manipulation
Abstract:
Human skin provides a rich tactile sensing stream, localizing intentional and unintentional contact events over a large and contoured region. Replicating these tactile sensing capabilities for dexterous robotic manipulation systems remains a longstanding challenge. In this work, we take a step towards this goal by introducing DexSkin. DexSkin is a soft, conformable capacitive electronic skin that enables sensitive, localized, and calibratable tactile sensing, and can be tailored to varying geometries. We demonstrate its efficacy for learning downstream robotic manipulation by sensorizing a pair of parallel jaw gripper fingers, providing tactile coverage across almost the entire finger surfaces. We empirically evaluate DexSkin's capabilities in learning challenging manipulation tasks that require sensing coverage across the entire surface of the fingers, such as reorienting objects in hand and wrapping elastic bands around boxes, in a learning-from-demonstration framework. We then show that, critically for data-driven approaches, DexSkin can be calibrated to enable model transfer across sensor instances, and demonstrate its applicability to online reinforcement learning on real robots. Our results highlight DexSkin's suitability and practicality for learning real-world, contact-rich manipulation. Please see our project webpage for videos and visualizations: https://dex-skin.github.io/.

Authors:Kuang Xiaodong, Li Bingxuan, Li Yuan, Rao Fan, Ma Gege, Xie Qingguo, Mok Greta S P, Liu Huafeng, Zhu Wentao
Title: A Kernel Space-based Multidimensional Sparse Model for Dynamic PET Image Denoising
Abstract:
Achieving high image quality for temporal frames in dynamic positron emission tomography (PET) is challenging due to the limited statistic especially for the short frames. Recent studies have shown that deep learning (DL) is useful in a wide range of medical image denoising tasks. In this paper, we propose a model-based neural network for dynamic PET image denoising. The inter-frame spatial correlation and intra-frame structural consistency in dynamic PET are used to establish the kernel space-based multidimensional sparse (KMDS) model. We then substitute the inherent forms of the parameter estimation with neural networks to enable adaptive parameters optimization, forming the end-to-end neural KMDS-Net. Extensive experimental results from simulated and real data demonstrate that the neural KMDS-Net exhibits strong denoising performance for dynamic PET, outperforming previous baseline methods. The proposed method may be used to effectively achieve high temporal and spatial resolution for dynamic PET. Our source code is available at https://github.com/Kuangxd/Neural-KMDS-Net/tree/main.
中文: 本文提出了一种基于模型的神经网络KMDS-Net,利用动态PET图像的帧间空间相关性和帧内结构一致性进行去噪处理,在仿真和真实数据实验中均展现出优于现有方法的性能。
English: This paper introduces a neural KMDS-Net, a model-based deep learning approach that leverages inter-frame spatial correlations and intra-frame structural consistency to effectively denoise dynamic PET images, demonstrating superior performance over existing methods in both simulated and real data.

Authors:Ruichao Hou, Xingyuan Li, Tongwei Ren, Dongming Zhou, Gangshan Wu, Jinde Cao
Title: HyPSAM: Hybrid Prompt-driven Segment Anything Model for RGB-Thermal Salient Object Detection
Abstract:
RGB-thermal salient object detection (RGB-T SOD) aims to identify prominent objects by integrating complementary information from RGB and thermal modalities. However, learning the precise boundaries and complete objects remains challenging due to the intrinsic insufficient feature fusion and the extrinsic limitations of data scarcity. In this paper, we propose a novel hybrid prompt-driven segment anything model (HyPSAM), which leverages the zero-shot generalization capabilities of the segment anything model (SAM) for RGB-T SOD. Specifically, we first propose a dynamic fusion network (DFNet) that generates high-quality initial saliency maps as visual prompts. DFNet employs dynamic convolution and multi-branch decoding to facilitate adaptive cross-modality interaction, overcoming the limitations of fixed-parameter kernels and enhancing multi-modal feature representation. Moreover, we propose a plug-and-play refinement network (P2RNet), which serves as a general optimization strategy to guide SAM in refining saliency maps by using hybrid prompts. The text prompt ensures reliable modality input, while the mask and box prompts enable precise salient object localization. Extensive experiments on three public datasets demonstrate that our method achieves state-of-the-art performance. Notably, HyPSAM has remarkable versatility, seamlessly integrating with different RGB-T SOD methods to achieve significant performance gains, thereby highlighting the potential of prompt engineering in this field. The code and results of our method are available at: https://github.com/milotic233/HyPSAM.
中文摘要:本文提出HyPSAM模型,通过动态融合网络生成视觉提示并利用混合提示优化策略,结合SAM的零样本泛化能力提升RGB-热成像显著目标检测性能,在多个公开数据集上达到最优效果。
English Summary: This paper introduces HyPSAM, a hybrid prompt-driven model that enhances RGB-thermal salient object detection by leveraging SAM's zero-shot capabilities through dynamic fusion and refinement networks, achieving state-of-the-art performance across multiple datasets.

Authors:Yingquan Wang, Pingping Zhang, Chong Sun, Dong Wang, Huchuan Lu
Title: What Makes You Unique? Attribute Prompt Composition for Object Re-Identification
Abstract:
Object Re-IDentification (ReID) aims to recognize individuals across non-overlapping camera views. While recent advances have achieved remarkable progress, most existing models are constrained to either single-domain or cross-domain scenarios, limiting their real-world applicability. Single-domain models tend to overfit to domain-specific features, whereas cross-domain models often rely on diverse normalization strategies that may inadvertently suppress identity-specific discriminative cues. To address these limitations, we propose an Attribute Prompt Composition (APC) framework, which exploits textual semantics to jointly enhance discrimination and generalization. Specifically, we design an Attribute Prompt Generator (APG) consisting of a Semantic Attribute Dictionary (SAD) and a Prompt Composition Module (PCM). SAD is an over-complete attribute dictionary to provide rich semantic descriptions, while PCM adaptively composes relevant attributes from SAD to generate discriminative attribute-aware features. In addition, motivated by the strong generalization ability of Vision-Language Models (VLM), we propose a Fast-Slow Training Strategy (FSTS) to balance ReID-specific discrimination and generalizable representation learning. Specifically, FSTS adopts a Fast Update Stream (FUS) to rapidly acquire ReID-specific discriminative knowledge and a Slow Update Stream (SUS) to retain the generalizable knowledge inherited from the pre-trained VLM. Through a mutual interaction, the framework effectively focuses on ReID-relevant features while mitigating overfitting. Extensive experiments on both conventional and Domain Generalized (DG) ReID datasets demonstrate that our framework surpasses state-of-the-art methods, exhibiting superior performances in terms of both discrimination and generalization. The source code is available at https://github.com/AWangYQ/APC.
中文摘要:提出的属性提示组合(APC)框架利用文本语义和双流训练策略,在目标重识别任务中同时提升判别性与泛化能力,在多个数据集上超越了现有最优方法。
English Summary: The proposed Attribute Prompt Composition (APC) framework leverages textual semantics and a dual-stream training strategy to enhance both discrimination and generalization in object re-identification, outperforming existing methods across diverse datasets.

Authors:Nicolas Toussaint, Emanuele Colleoni, Ricardo Sanchez-Matilla, Joshua Sutcliffe, Vanessa Thompson, Muhammad Asad, Imanol Luengo, Danail Stoyanov
Title: Zero-shot Monocular Metric Depth for Endoscopic Images
Abstract:
Monocular relative and metric depth estimation has seen a tremendous boost in the last few years due to the sharp advancements in foundation models and in particular transformer based networks. As we start to see applications to the domain of endoscopic images, there is still a lack of robust benchmarks and high-quality datasets in that area. This paper addresses these limitations by presenting a comprehensive benchmark of state-of-the-art (metric and relative) depth estimation models evaluated on real, unseen endoscopic images, providing critical insights into their generalisation and performance in clinical scenarios. Additionally, we introduce and publish a novel synthetic dataset (EndoSynth) of endoscopic surgical instruments paired with ground truth metric depth and segmentation masks, designed to bridge the gap between synthetic and real-world data. We demonstrate that fine-tuning depth foundation models using our synthetic dataset boosts accuracy on most unseen real data by a significant margin. By providing both a benchmark and a synthetic dataset, this work advances the field of depth estimation for endoscopic images and serves as an important resource for future research. Project page, EndoSynth dataset and trained weights are available at https://github.com/TouchSurgery/EndoSynth.
中文: 本文提出了一个全面的基准测试和新型合成数据集EndoSynth,旨在提升内窥镜图像的深度估计效果,通过微调显著提高模型精度,并推动临床应用的发展。
English: This paper introduces a comprehensive benchmark and a novel synthetic dataset, EndoSynth, to enhance depth estimation in endoscopic images, significantly improving model accuracy through fine-tuning and advancing clinical applications.

Authors:Yuanhuiyi Lyu, Chi Kit Wong, Chenfei Liao, Lutao Jiang, Xu Zheng, Zexin Lu, Linfeng Zhang, Xuming Hu
Title: Understanding-in-Generation: Reinforcing Generative Capability of Unified Model via Infusing Understanding into Generation
Abstract:
Recent works have made notable advancements in enhancing unified models for text-to-image generation through the Chain-of-Thought (CoT). However, these reasoning methods separate the processes of understanding and generation, which limits their ability to guide the reasoning of unified models in addressing the deficiencies of their generative capabilities. To this end, we propose a novel reasoning framework for unified models, Understanding-in-Generation (UiG), which harnesses the robust understanding capabilities of unified models to reinforce their performance in image generation. The core insight of our UiG is to integrate generative guidance by the strong understanding capabilities during the reasoning process, thereby mitigating the limitations of generative abilities. To achieve this, we introduce "Image Editing" as a bridge to infuse understanding into the generation process. Initially, we verify the generated image and incorporate the understanding of unified models into the editing instructions. Subsequently, we enhance the generated image step by step, gradually infusing the understanding into the generation process. Our UiG framework demonstrates a significant performance improvement in text-to-image generation over existing text-to-image reasoning methods, e.g., a 3.92% gain on the long prompt setting of the TIIF benchmark. The project code: https://github.com/QC-LY/UiG
中文摘要:本研究提出的理解生成一体化(UiG)框架通过图像编辑将统一模型的强大理解能力融入生成过程,有效提升了文本到图像的生成性能,并在基准测试中显著优于现有方法。
English Summary: The proposed Understanding-in-Generation (UiG) framework integrates image editing to leverage unified models' strong comprehension capabilities, thereby enhancing text-to-image generation performance and achieving significant improvements over existing methods.

Authors:Neel P. Bhatt, Yunhao Yang, Rohan Siva, Pranay Samineni, Daniel Milan, Zhangyang Wang, Ufuk Topcu
Title: VLN-Zero: Rapid Exploration and Cache-Enabled Neurosymbolic Vision-Language Planning for Zero-Shot Transfer in Robot Navigation
Abstract:
Rapid adaptation in unseen environments is essential for scalable real-world autonomy, yet existing approaches rely on exhaustive exploration or rigid navigation policies that fail to generalize. We present VLN-Zero, a two-phase vision-language navigation framework that leverages vision-language models to efficiently construct symbolic scene graphs and enable zero-shot neurosymbolic navigation. In the exploration phase, structured prompts guide VLM-based search toward informative and diverse trajectories, yielding compact scene graph representations. In the deployment phase, a neurosymbolic planner reasons over the scene graph and environmental observations to generate executable plans, while a cache-enabled execution module accelerates adaptation by reusing previously computed task-location trajectories. By combining rapid exploration, symbolic reasoning, and cache-enabled execution, the proposed framework overcomes the computational inefficiency and poor generalization of prior vision-language navigation methods, enabling robust and scalable decision-making in unseen environments. VLN-Zero achieves 2x higher success rate compared to state-of-the-art zero-shot models, outperforms most fine-tuned baselines, and reaches goal locations in half the time with 55% fewer VLM calls on average compared to state-of-the-art models across diverse environments. Codebase, datasets, and videos for VLN-Zero are available at: https://vln-zero.github.io/.

Authors:Zixin Zhu, Haoxiang Li, Xuelu Feng, He Wu, Chunming Qiao, Junsong Yuan
Title: GeoRemover: Removing Objects and Their Causal Visual Artifacts
Abstract:
Towards intelligent image editing, object removal should eliminate both the target object and its causal visual artifacts, such as shadows and reflections. However, existing image appearance-based methods either follow strictly mask-aligned training and fail to remove these causal effects which are not explicitly masked, or adopt loosely mask-aligned strategies that lack controllability and may unintentionally over-erase other objects. We identify that these limitations stem from ignoring the causal relationship between an object's geometry presence and its visual effects. To address this limitation, we propose a geometry-aware two-stage framework that decouples object removal into (1) geometry removal and (2) appearance rendering. In the first stage, we remove the object directly from the geometry (e.g., depth) using strictly mask-aligned supervision, enabling structure-aware editing with strong geometric constraints. In the second stage, we render a photorealistic RGB image conditioned on the updated geometry, where causal visual effects are considered implicitly as a result of the modified 3D geometry. To guide learning in the geometry removal stage, we introduce a preference-driven objective based on positive and negative sample pairs, encouraging the model to remove objects as well as their causal visual artifacts while avoiding new structural insertions. Extensive experiments demonstrate that our method achieves state-of-the-art performance in removing both objects and their associated artifacts on two popular benchmarks. The code is available at https://github.com/buxiangzhiren/GeoRemover.
This paper introduces a geometry-aware two-stage framework that first removes objects from geometric data using strict mask alignment and then renders photorealistic images, effectively eliminating both target objects and their causal visual artifacts while maintaining structural integrity.
English Summary:

Authors:Mohammad Hosseini, Maryam M. Shanechi
Title: Dynamical Modeling of Behaviorally Relevant Spatiotemporal Patterns in Neural Imaging Data
Abstract:
High-dimensional imaging of neural activity, such as widefield calcium and functional ultrasound imaging, provide a rich source of information for understanding the relationship between brain activity and behavior. Accurately modeling neural dynamics in these modalities is crucial for understanding this relationship but is hindered by the high-dimensionality, complex spatiotemporal dependencies, and prevalent behaviorally irrelevant dynamics in these modalities. Existing dynamical models often employ preprocessing steps to obtain low-dimensional representations from neural image modalities. However, this process can discard behaviorally relevant information and miss spatiotemporal structure. We propose SBIND, a novel data-driven deep learning framework to model spatiotemporal dependencies in neural images and disentangle their behaviorally relevant dynamics from other neural dynamics. We validate SBIND on widefield imaging datasets, and show its extension to functional ultrasound imaging, a recent modality whose dynamical modeling has largely remained unexplored. We find that our model effectively identifies both local and long-range spatial dependencies across the brain while also dissociating behaviorally relevant neural dynamics. Doing so, SBIND outperforms existing models in neural-behavioral prediction. Overall, SBIND provides a versatile tool for investigating the neural mechanisms underlying behavior using imaging modalities.
中文: SBIND是一种新型深度学习框架,能有效建模神经影像数据的时空依赖性并分离行为相关动态,在宽场和功能超声成像等模态的神经行为预测中优于现有模型。
English: SBIND is a novel deep learning framework that effectively models spatiotemporal dependencies in neural imaging data to disentangle behaviorally relevant dynamics, outperforming existing models in neural-behavioral prediction across modalities like widefield and functional ultrasound imaging.

Authors:Maximilian Fehrentz, Alexander Winkler, Thomas Heiliger, Nazim Haouchine, Christian Heiliger, Nassir Navab
Title: BridgeSplat: Bidirectionally Coupled CT and Non-Rigid Gaussian Splatting for Deformable Intraoperative Surgical Navigation
Abstract:
We introduce BridgeSplat, a novel approach for deformable surgical navigation that couples intraoperative 3D reconstruction with preoperative CT data to bridge the gap between surgical video and volumetric patient data. Our method rigs 3D Gaussians to a CT mesh, enabling joint optimization of Gaussian parameters and mesh deformation through photometric supervision. By parametrizing each Gaussian relative to its parent mesh triangle, we enforce alignment between Gaussians and mesh and obtain deformations that can be propagated back to update the CT. We demonstrate BridgeSplat's effectiveness on visceral pig surgeries and synthetic data of a human liver under simulation, showing sensible deformations of the preoperative CT on monocular RGB data. Code, data, and additional resources can be found at https://maxfehrentz.github.io/ct-informed-splatting/ .

Authors:Md Mostafijur Rahman, Radu Marculescu
Title: MK-UNet: Multi-kernel Lightweight CNN for Medical Image Segmentation
Abstract:
In this paper, we introduce MK-UNet, a paradigm shift towards ultra-lightweight, multi-kernel U-shaped CNNs tailored for medical image segmentation. Central to MK-UNet is the multi-kernel depth-wise convolution block (MKDC) we design to adeptly process images through multiple kernels, while capturing complex multi-resolution spatial relationships. MK-UNet also emphasizes the images salient features through sophisticated attention mechanisms, including channel, spatial, and grouped gated attention. Our MK-UNet network, with a modest computational footprint of only 0.316M parameters and 0.314G FLOPs, represents not only a remarkably lightweight, but also significantly improved segmentation solution that provides higher accuracy over state-of-the-art (SOTA) methods across six binary medical imaging benchmarks. Specifically, MK-UNet outperforms TransUNet in DICE score with nearly 333$\times$ and 123$\times$ fewer parameters and FLOPs, respectively. Similarly, when compared against UNeXt, MK-UNet exhibits superior segmentation performance, improving the DICE score up to 6.7% margins while operating with 4.7$\times$ fewer #Params. Our MK-UNet also outperforms other recent lightweight networks, such as MedT, CMUNeXt, EGE-UNet, and Rolling-UNet, with much lower computational resources. This leap in performance, coupled with drastic computational gains, positions MK-UNet as an unparalleled solution for real-time, high-fidelity medical diagnostics in resource-limited settings, such as point-of-care devices. Our implementation is available at https://github.com/SLDGroup/MK-UNet.
中文: MK-UNet提出了一种超轻量级多核U形CNN用于医学图像分割,在参数和计算量大幅减少的情况下,相比现有最优方法实现了更高的精度,非常适合资源受限的实时诊断场景。
English: MK-UNet introduces an ultra-lightweight multi-kernel U-shaped CNN for medical image segmentation, achieving higher accuracy with significantly fewer parameters and computational costs than state-of-the-art methods, making it ideal for resource-limited real-time diagnostics.

Authors:Binhua Huang, Wendong Yao, Shaowu Chen, Guoxin Wang, Qingyuan Wang, Soumyabrata Dev
Title: MoCrop: Training Free Motion Guided Cropping for Efficient Video Action Recognition
Abstract:
We introduce MoCrop, a motion-aware adaptive cropping module for efficient video action recognition in the compressed domain. MoCrop uses motion vectors that are available in H.264 video to locate motion-dense regions and produces a single clip-level crop that is applied to all I-frames at inference. The module is training free, adds no parameters, and can be plugged into diverse backbones. A lightweight pipeline that includes denoising & merge (DM), Monte Carlo sampling (MCS), and adaptive cropping (AC) via a motion-density submatrix search yields robust crops with negligible overhead. On UCF101, MoCrop improves accuracy or reduces compute. With ResNet-50, it delivers +3.5% Top-1 accuracy at equal FLOPs (attention setting), or +2.4% Top-1 accuracy with 26.5% fewer FLOPs (efficiency setting). Applied to CoViAR, it reaches 89.2% Top-1 accuracy at the original cost and 88.5% Top-1 accuracy while reducing compute from 11.6 to 8.5 GFLOPs. Consistent gains on MobileNet-V3, EfficientNet-B1, and Swin-B indicate strong generality and make MoCrop practical for real-time deployment in the compressed domain. Our code and models are available at https://github.com/microa/MoCrop.
MoCrop是一种无需训练、基于运动感知的自适应裁剪模块,通过利用压缩视频中的运动矢量聚焦关键区域,提升视频动作识别的效率,在不同模型上均能提高精度并降低计算开销。
MoCrop is a training-free, motion-aware adaptive cropping module that enhances video action recognition efficiency by using motion vectors from compressed videos to focus on key regions, improving accuracy and reducing computational costs across various models.

Authors:Zitian Zhang, Joshua Urban Davis, Jeanne Phuong Anh Vu, Jiangtao Kuang, Jean-François Lalonde
Title: Improving the color accuracy of lighting estimation models
Abstract:
Advances in high dynamic range (HDR) lighting estimation from a single image have opened new possibilities for augmented reality (AR) applications. Predicting complex lighting environments from a single input image allows for the realistic rendering and compositing of virtual objects. In this work, we investigate the color robustness of such methods -- an often overlooked yet critical factor for achieving visual realism. While most evaluations conflate color with other lighting attributes (e.g., intensity, direction), we isolate color as the primary variable of interest. Rather than introducing a new lighting estimation algorithm, we explore whether simple adaptation techniques can enhance the color accuracy of existing models. Using a novel HDR dataset featuring diverse lighting colors, we systematically evaluate several adaptation strategies. Our results show that preprocessing the input image with a pre-trained white balance network improves color robustness, outperforming other strategies across all tested scenarios. Notably, this approach requires no retraining of the lighting estimation model. We further validate the generality of this finding by applying the technique to three state-of-the-art lighting estimation methods from recent literature.

Authors:Binhua Huang, Ni Wang, Wendong Yao, Soumyabrata Dev
Title: MVP: Motion Vector Propagation for Zero-Shot Video Object Detection
Abstract:
Running a large open-vocabulary (Open-vocab) detector on every video frame is accurate but expensive. We introduce a training-free pipeline that invokes OWLv2 only on fixed-interval keyframes and propagates detections to intermediate frames using compressed-domain motion vectors (MV). A simple 3x3 grid aggregation of motion vectors provides translation and uniform-scale updates, augmented with an area-growth check and an optional single-class switch. The method requires no labels, no fine-tuning, and uses the same prompt list for all open-vocabulary methods. On ILSVRC2015-VID (validation dataset), our approach (MVP) attains mAP@0.5=0.609 and mAP@[0.5:0.95]=0.316. At loose intersection-over-union (IoU) thresholds it remains close to framewise OWLv2-Large (0.747/0.721 at 0.2/0.3 versus 0.784/0.780), reflecting that coarse localization is largely preserved. Under the same keyframe schedule, MVP outperforms tracker-based propagation (MOSSE, KCF, CSRT) at mAP@0.5. A supervised reference (YOLOv12x) reaches 0.631 at mAP@0.5 but requires labeled training, whereas our method remains label-free and open-vocabulary. These results indicate that compressed-domain propagation is a practical way to reduce detector invocations while keeping strong zero-shot coverage in videos. Our code and models are available at https://github.com/microa/MVP.
中文: 本文提出MVP方法,通过仅在关键帧使用开放词汇检测器并利用压缩域运动矢量传播检测结果,无需训练即可降低计算成本,在保持零样本视频覆盖能力的同时获得接近逐帧检测的精度。
English: This paper introduces MVP, a training-free method that reduces computational costs by applying an open-vocabulary detector only to keyframes and propagating detections via compressed-domain motion vectors, achieving competitive accuracy while maintaining zero-shot video coverage.

Authors:Mehrdad Moradi, Shengzhe Chen, Hao Yan, Kamran Paynabar
Title: A Single Image Is All You Need: Zero-Shot Anomaly Localization Without Training Data
Abstract:
Anomaly detection in images is typically addressed by learning from collections of training data or relying on reference samples. In many real-world scenarios, however, such training data may be unavailable, and only the test image itself is provided. We address this zero-shot setting by proposing a single-image anomaly localization method that leverages the inductive bias of convolutional neural networks, inspired by Deep Image Prior (DIP). Our method is named Single Shot Decomposition Network (SSDnet). Our key assumption is that natural images often exhibit unified textures and patterns, and that anomalies manifest as localized deviations from these repetitive or stochastic patterns. To learn the deep image prior, we design a patch-based training framework where the input image is fed directly into the network for self-reconstruction, rather than mapping random noise to the image as done in DIP. To avoid the model simply learning an identity mapping, we apply masking, patch shuffling, and small Gaussian noise. In addition, we use a perceptual loss based on inner-product similarity to capture structure beyond pixel fidelity. Our approach needs no external training data, labels, or references, and remains robust in the presence of noise or missing pixels. SSDnet achieves 0.99 AUROC and 0.60 AUPRC on MVTec-AD and 0.98 AUROC and 0.67 AUPRC on the fabric dataset, outperforming state-of-the-art methods. The implementation code will be released at https://github.com/mehrdadmoradi124/SSDnet
中文: SSDnet是一种无需训练数据的零样本异常定位方法,通过基于图像块的自重构网络结合掩码和感知损失来检测异常,在多个基准数据集上取得了领先的性能。
English: SSDnet is a zero-shot anomaly localization method that uses a patch-based self-reconstruction network with masking and perceptual loss to detect anomalies without any training data, achieving state-of-the-art performance on benchmark datasets.

Authors:Oussema Dhaouadi, Riccardo Marin, Johannes Meier, Jacques Kaiser, Daniel Cremers
Title: OrthoLoC: UAV 6-DoF Localization and Calibration Using Orthographic Geodata
Abstract:
Accurate visual localization from aerial views is a fundamental problem with applications in mapping, large-area inspection, and search-and-rescue operations. In many scenarios, these systems require high-precision localization while operating with limited resources (e.g., no internet connection or GNSS/GPS support), making large image databases or heavy 3D models impractical. Surprisingly, little attention has been given to leveraging orthographic geodata as an alternative paradigm, which is lightweight and increasingly available through free releases by governmental authorities (e.g., the European Union). To fill this gap, we propose OrthoLoC, the first large-scale dataset comprising 16,425 UAV images from Germany and the United States with multiple modalities. The dataset addresses domain shifts between UAV imagery and geospatial data. Its paired structure enables fair benchmarking of existing solutions by decoupling image retrieval from feature matching, allowing isolated evaluation of localization and calibration performance. Through comprehensive evaluation, we examine the impact of domain shifts, data resolutions, and covisibility on localization accuracy. Finally, we introduce a refinement technique called AdHoP, which can be integrated with any feature matcher, improving matching by up to 95% and reducing translation error by up to 63%. The dataset and code are available at: https://deepscenario.github.io/OrthoLoC.

Authors:Yixin Zhang, Ryan Chamberlain, Lawrence Ngo, Kevin Kramer, Maciej A. Mazurowski
Title: Rethinking Pulmonary Embolism Segmentation: A Study of Current Approaches and Challenges with an Open Weight Model
Abstract:
In this study, we curated a densely annotated in-house dataset comprising 490 CTPA scans. Using this dataset, we systematically evaluated nine widely used segmentation architectures from both the CNN and Vision Transformer (ViT) families, initialized with either pretrained or random weights, under a unified testing framework as a performance audit. Our study leads to several important observations: (1) 3D U-Net with a ResNet encoder remains a highly effective architecture for PE segmentation; (2) 3D models are particularly well-suited to this task given the morphological characteristics of emboli; (3) CNN-based models generally yield superior performance compared to their ViT-based counterparts in PE segmentation; (4) classification-based pretraining, even on large PE datasets, can adversely impact segmentation performance compared to training from scratch, suggesting that PE classification and segmentation may rely on different sets of discriminative features; (5) different model architectures show a highly consistent pattern of segmentation performance when trained on the same data; and (6) while central and large emboli can be segmented with satisfactory accuracy, distal emboli remain challenging due to both task complexity and the scarcity of high-quality datasets. Besides these findings, our best-performing model achieves a mean Dice score of 0.7131 for segmentation. It detects 181 emboli with 49 false positives and 28 false negatives from 60 in-house testing scans. Its generalizability is further validated on public datasets.
中文: 本研究利用490例密集标注的CTPA扫描数据集评估了九种分割架构,发现带ResNet编码器的3D U-Net在肺栓塞分割中表现最佳,基于CNN的模型普遍优于视觉Transformer,且预训练可能损害分割性能。
English: This study evaluated nine segmentation architectures using a densely annotated dataset of 490 CTPA scans, finding that 3D U-Net with a ResNet encoder performs best for pulmonary embolism segmentation, with CNN-based models generally outperforming Vision Transformers and pretraining potentially harming performance.

Authors:Yi Gu, Kuniaki Saito, Jiaxin Ma
Title: Learning Contrastive Multimodal Fusion with Improved Modality Dropout for Disease Detection and Prediction
Abstract:
As medical diagnoses increasingly leverage multimodal data, machine learning models are expected to effectively fuse heterogeneous information while remaining robust to missing modalities. In this work, we propose a novel multimodal learning framework that integrates enhanced modalities dropout and contrastive learning to address real-world limitations such as modality imbalance and missingness. Our approach introduces learnable modality tokens for improving missingness-aware fusion of modalities and augments conventional unimodal contrastive objectives with fused multimodal representations. We validate our framework on large-scale clinical datasets for disease detection and prediction tasks, encompassing both visual and tabular modalities. Experimental results demonstrate that our method achieves state-of-the-art performance, particularly in challenging and practical scenarios where only a single modality is available. Furthermore, we show its adaptability through successful integration with a recent CT foundation model. Our findings highlight the effectiveness, efficiency, and generalizability of our approach for multimodal learning, offering a scalable, low-cost solution with significant potential for real-world clinical applications. The code is available at https://github.com/omron-sinicx/medical-modality-dropout.
中文摘要:本研究提出一种多模态学习框架,通过改进的模态丢弃和对比学习技术增强对缺失数据和模态不平衡的鲁棒性,在临床应用中实现了卓越性能。
English Summary: This study introduces a multimodal learning framework that enhances robustness to missing data and modality imbalance through improved dropout techniques and contrastive learning, achieving superior performance in clinical applications.

Authors:Xiuding Cai, Yaoyao Zhu, Linjie Fu, Dong Miao, Yu Yao
Title: Self Identity Mapping
Abstract:
Regularization is essential in deep learning to enhance generalization and mitigate overfitting. However, conventional techniques often rely on heuristics, making them less reliable or effective across diverse settings. We propose Self Identity Mapping (SIM), a simple yet effective, data-intrinsic regularization framework that leverages an inverse mapping mechanism to enhance representation learning. By reconstructing the input from its transformed output, SIM reduces information loss during forward propagation and facilitates smoother gradient flow. To address computational inefficiencies, We instantiate SIM as $ ρ\text{SIM} $ by incorporating patch-level feature sampling and projection-based method to reconstruct latent features, effectively lowering complexity. As a model-agnostic, task-agnostic regularizer, SIM can be seamlessly integrated as a plug-and-play module, making it applicable to different network architectures and tasks. We extensively evaluate $ρ\text{SIM}$ across three tasks: image classification, few-shot prompt learning, and domain generalization. Experimental results show consistent improvements over baseline methods, highlighting $ρ\text{SIM}$'s ability to enhance representation learning across various tasks. We also demonstrate that $ρ\text{SIM}$ is orthogonal to existing regularization methods, boosting their effectiveness. Moreover, our results confirm that $ρ\text{SIM}$ effectively preserves semantic information and enhances performance in dense-to-dense tasks, such as semantic segmentation and image translation, as well as in non-visual domains including audio classification and time series anomaly detection. The code is publicly available at https://github.com/XiudingCai/SIM-pytorch.
中文: 本文提出自身份映射(SIM)这一数据内在正则化框架,通过逆向映射机制从变换输出重构输入,以增强表征学习并改善梯度流,其计算高效且能作为即插即用模块广泛应用于多种网络架构与任务。
English: The authors propose Self Identity Mapping (SIM), a data-intrinsic regularization framework that uses inverse mapping to reconstruct inputs from transformed outputs, improving representation learning and gradient flow while being computationally efficient and applicable across various tasks and architectures.

Authors:Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, Bokai Xu, Junbo Cui, Yingjing Xu, Liqing Ruan, Luoyuan Zhang, Hanyu Liu, Jingkun Tang, Hongyuan Liu, Qining Guo, Wenhao Hu, Bingxiang He, Jie Zhou, Jie Cai, Ji Qi, Zonghao Guo, Chi Chen, Guoyang Zeng, Yuxuan Li, Ganqu Cui, Ning Ding, Xu Han, Yuan Yao, Zhiyuan Liu, Maosong Sun
Title: MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
Abstract:
Multimodal Large Language Models (MLLMs) are undergoing rapid progress and represent the frontier of AI development. However, their training and inference efficiency have emerged as a core bottleneck in making MLLMs more accessible and scalable. To address the challenges, we present MiniCPM-V 4.5, an 8B parameter model designed for high efficiency and strong performance. We introduce three core improvements in model architecture, data strategy and training method: a unified 3D-Resampler model architecture for highly compact encoding over images and videos, a unified learning paradigm for document knowledge and text recognition without heavy data engineering, and a hybrid reinforcement learning strategy for proficiency in both short and long reasoning modes. Comprehensive experimental results in OpenCompass evaluation show that MiniCPM-V 4.5 surpasses widely used proprietary models such as GPT-4o-latest, and significantly larger open-source models such as Qwen2.5-VL 72B. Notably, the strong performance is achieved with remarkable efficiency. For example, on the widely adopted VideoMME benchmark, MiniCPM-V 4.5 achieves state-of-the-art performance among models under 30B size, using just 46.7\% GPU memory cost and 8.7\% inference time of Qwen2.5-VL 7B.
Chinese: MiniCPM-V 4.5是一款高效的80亿参数多模态模型,在性能上超越主流专有模型和更大规模开源模型的同时,显著降低了GPU内存消耗和推理时间。
English: MiniCPM-V 4.5 is an efficient 8B parameter multimodal model that surpasses leading proprietary and larger open-source models in performance while significantly reducing GPU memory and inference time.

Authors:Julian Kaltheuner, Alexander Oebel, Hannah Droege, Patrick Stotko, Reinhard Klein
Title: Preconditioned Deformation Grids
Abstract:
Dynamic surface reconstruction of objects from point cloud sequences is a challenging field in computer graphics. Existing approaches either require multiple regularization terms or extensive training data which, however, lead to compromises in reconstruction accuracy as well as over-smoothing or poor generalization to unseen objects and motions. To address these lim- itations, we introduce Preconditioned Deformation Grids, a novel technique for estimating coherent deformation fields directly from unstructured point cloud sequences without requiring or forming explicit correspondences. Key to our approach is the use of multi-resolution voxel grids that capture the overall motion at varying spatial scales, enabling a more flexible deformation representation. In conjunction with incorporating grid-based Sobolev preconditioning into gradient-based optimization, we show that applying a Chamfer loss between the input point clouds as well as to an evolving template mesh is sufficient to obtain accurate deformations. To ensure temporal consistency along the object surface, we include a weak isometry loss on mesh edges which complements the main objective without constraining deformation fidelity. Extensive evaluations demonstrate that our method achieves superior results, particularly for long sequences, compared to state-of-the-art techniques.
中文摘要:本文提出预条件变形网格技术,通过多分辨率体素网格捕捉不同空间尺度的整体运动,结合基于网格的Sobolev预条件优化,仅需使用Chamfer损失和弱等距损失即可从无序点云序列中实现高精度的动态表面重建,在长序列处理上优于现有方法。
English Summary: This paper introduces Preconditioned Deformation Grids, a novel technique that reconstructs dynamic surfaces from point cloud sequences using multi-resolution voxel grids and grid-based Sobolev preconditioning, achieving superior accuracy and temporal consistency without requiring explicit correspondences or extensive training data.

Authors:Chaehyun Kim, Heeseong Shin, Eunbeen Hong, Heeji Yoon, Anurag Arnab, Paul Hongsuck Seo, Sunghwan Hong, Seungryong Kim
Title: Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers
Abstract:
Text-to-image diffusion models excel at translating language prompts into photorealistic images by implicitly grounding textual concepts through their cross-modal attention mechanisms. Recent multi-modal diffusion transformers extend this by introducing joint self-attention over concatenated image and text tokens, enabling richer and more scalable cross-modal alignment. However, a detailed understanding of how and where these attention maps contribute to image generation remains limited. In this paper, we introduce Seg4Diff (Segmentation for Diffusion), a systematic framework for analyzing the attention structures of MM-DiT, with a focus on how specific layers propagate semantic information from text to image. Through comprehensive analysis, we identify a semantic grounding expert layer, a specific MM-DiT block that consistently aligns text tokens with spatially coherent image regions, naturally producing high-quality semantic segmentation masks. We further demonstrate that applying a lightweight fine-tuning scheme with mask-annotated image data enhances the semantic grouping capabilities of these layers and thereby improves both segmentation performance and generated image fidelity. Our findings demonstrate that semantic grouping is an emergent property of diffusion transformers and can be selectively amplified to advance both segmentation and generation performance, paving the way for unified models that bridge visual perception and generation.

Authors:Ye Liu, Zongyang Ma, Junfu Pu, Zhongang Qi, Yang Wu, Ying Shan, Chang Wen Chen
Title: UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning
Abstract:
Recent advances in Large Multi-modal Models (LMMs) have demonstrated their remarkable success as general-purpose multi-modal assistants, with particular focuses on holistic image- and video-language understanding. Conversely, less attention has been given to scaling fine-grained pixel-level understanding capabilities, where the models are expected to realize pixel-level alignment between visual signals and language semantics. Some previous studies have applied LMMs to related tasks such as region-level captioning and referring expression segmentation. However, these models are limited to performing either referring or segmentation tasks independently and fail to integrate these fine-grained perception capabilities into visual reasoning. To bridge this gap, we propose UniPixel, a large multi-modal model capable of flexibly comprehending visual prompt inputs and generating mask-grounded responses. Our model distinguishes itself by seamlessly integrating pixel-level perception with general visual understanding capabilities. Specifically, UniPixel processes visual prompts and generates relevant masks on demand, and performs subsequent reasoning conditioning on these intermediate pointers during inference, thereby enabling fine-grained pixel-level reasoning. The effectiveness of our approach has been verified on 10 benchmarks across a diverse set of tasks, including pixel-level referring/segmentation and object-centric understanding in images/videos. A novel PixelQA task that jointly requires referring, segmentation, and question answering is also designed to verify the flexibility of our method.
中文摘要:UniPixel模型通过整合视觉提示与掩码生成,实现了像素级细粒度推理,在包括新型PixelQA任务在内的多个基准测试中验证了其有效性。
English Summary: The UniPixel model bridges the gap in pixel-level understanding by integrating visual prompts with mask generation for fine-grained reasoning, validated across multiple benchmarks including a novel PixelQA task.

Authors:Guocheng Gordon Qian, Daniil Ostashev, Egor Nemchinov, Avihay Assouline, Sergey Tulyakov, Kuan-Chieh Jackson Wang, Kfir Aberman
Title: ComposeMe: Attribute-Specific Image Prompts for Controllable Human Image Generation
Abstract:
Generating high-fidelity images of humans with fine-grained control over attributes such as hairstyle and clothing remains a core challenge in personalized text-to-image synthesis. While prior methods emphasize identity preservation from a reference image, they lack modularity and fail to provide disentangled control over specific visual attributes. We introduce a new paradigm for attribute-specific image prompting, in which distinct sets of reference images are used to guide the generation of individual aspects of human appearance, such as hair, clothing, and identity. Our method encodes these inputs into attribute-specific tokens, which are injected into a pre-trained text-to-image diffusion model. This enables compositional and disentangled control over multiple visual factors, even across multiple people within a single image. To promote natural composition and robust disentanglement, we curate a cross-reference training dataset featuring subjects in diverse poses and expressions, and propose a multi-attribute cross-reference training strategy that encourages the model to generate faithful outputs from misaligned attribute inputs while adhering to both identity and textual conditioning. Extensive experiments show that our method achieves state-of-the-art performance in accurately following both visual and textual prompts. Our framework paves the way for more configurable human image synthesis by combining visual prompting with text-driven generation. Webpage is available at: https://snap-research.github.io/composeme/.

Authors:Jiahe Li, Jiawei Zhang, Youmin Zhang, Xiao Bai, Jin Zheng, Xiaohan Yu, Lin Gu
Title: GeoSVR: Taming Sparse Voxels for Geometrically Accurate Surface Reconstruction
Abstract:
Reconstructing accurate surfaces with radiance fields has achieved remarkable progress in recent years. However, prevailing approaches, primarily based on Gaussian Splatting, are increasingly constrained by representational bottlenecks. In this paper, we introduce GeoSVR, an explicit voxel-based framework that explores and extends the under-investigated potential of sparse voxels for achieving accurate, detailed, and complete surface reconstruction. As strengths, sparse voxels support preserving the coverage completeness and geometric clarity, while corresponding challenges also arise from absent scene constraints and locality in surface refinement. To ensure correct scene convergence, we first propose a Voxel-Uncertainty Depth Constraint that maximizes the effect of monocular depth cues while presenting a voxel-oriented uncertainty to avoid quality degradation, enabling effective and robust scene constraints yet preserving highly accurate geometries. Subsequently, Sparse Voxel Surface Regularization is designed to enhance geometric consistency for tiny voxels and facilitate the voxel-based formation of sharp and accurate surfaces. Extensive experiments demonstrate our superior performance compared to existing methods across diverse challenging scenarios, excelling in geometric accuracy, detail preservation, and reconstruction completeness while maintaining high efficiency. Code is available at https://github.com/Fictionarry/GeoSVR.
中文摘要:GeoSVR提出了一种新颖的显式体素框架,通过引入体素不确定性深度约束和稀疏体素表面正则化,利用稀疏体素克服辐射场表面重建中的表示瓶颈,在保持高效率的同时实现了卓越的几何精度、细节保留和完整重建效果。
English Summary: GeoSVR introduces a novel explicit voxel-based framework that overcomes representational bottlenecks in radiance field surface reconstruction by leveraging sparse voxels with innovative constraints and regularization, achieving superior geometric accuracy, detail preservation, and completeness while maintaining high efficiency.

Authors:Yunheng Li, Jing Cheng, Shaoyong Jia, Hangyi Kuang, Shaohui Jiao, Qibin Hou, Ming-Ming Cheng
Title: TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs
Abstract:
This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework designed to improve the effectiveness of adapting multimodal large language models (MLLMs) to video temporal grounding tasks. We reveal that existing reinforcement learning methods, such as Group Relative Policy Optimization (GRPO), rely on on-policy sampling for policy updates. However, in tasks with large temporal search spaces, this strategy becomes both inefficient and limited in performance, as it often fails to identify temporally accurate solutions. To address this limitation, TempSamp-R1 leverages ground-truth annotations as off-policy supervision to provide temporally precise guidance, effectively compensating for the sparsity and misalignment in on-policy solutions. To further stabilize training and reduce variance in reward-based updates, TempSamp-R1 provides a non-linear soft advantage computation method that dynamically reshapes the reward feedback via an asymmetric transformation. By employing a hybrid Chain-of-Thought (CoT) training paradigm, TempSamp-R1 optimizes a single unified model to support both CoT and non-CoT inference modes, enabling efficient handling of queries with varying reasoning complexity. Experimental results demonstrate that TempSamp-R1 outperforms GRPO-based baselines, establishing new state-of-the-art performance on benchmark datasets: Charades-STA (R1@0.7: 52.9%, +2.7%), ActivityNet Captions (R1@0.5: 56.0%, +5.3%), and QVHighlights (mAP: 30.0%, +3.0%). Moreover, TempSamp-R1 shows robust few-shot generalization capabilities under limited data. Code: https://github.com/HVision-NKU/TempSamp-R1
中文: TempSamp-R1是一种强化微调框架,通过引入离策略监督和非线性软优势计算,提升了多模态大语言模型在视频时序定位任务中的性能,在多个基准数据集上取得了最优效果。
English: TempSamp-R1 is a reinforcement fine-tuning framework that enhances multimodal large language models for video temporal grounding by incorporating off-policy supervision and a non-linear soft advantage method, achieving state-of-the-art results on benchmark datasets.

Authors:Kai Li, Xingxing Weng, Yupeng Deng, Yu Meng, Chao Pang, Gui-Song Xia, Xiangyu Zhao
Title: DragOSM: Extract Building Roofs and Footprints from Aerial Images by Aligning Historical Labels
Abstract:
Extracting polygonal roofs and footprints from remote sensing images is critical for large-scale urban analysis. Most existing methods rely on segmentation-based models that assume clear semantic boundaries of roofs, but these approaches struggle in off- nadir images, where the roof and footprint are significantly displaced, and facade pixels are fused with the roof boundary. With the increasing availability of open vector map annotations, e.g., OpenStreetMap, utilizing historical labels for off-nadir image annotation has become viable because remote sensing images are georeferenced once captured. However, these historical labels commonly suffer from significant positional discrepancies with new images and only have one annotation (roof or footprint), which fails to describe the correct structures of a building. To address these discrepancies, we first introduce a concept of an alignment token, which encodes the correction vector to guide the label correction. Based on this concept, we then propose Drag OpenStreetMap Labels (DragOSM), a novel model designed to align dislocated historical labels with roofs and footprints. Specifically, DragOSM formulates the label alignment as an interactive denoising process, modeling the positional discrepancy as a Gaussian distribution. During training, it learns to correct these errors by simulating misalignment with random Gaussian perturbations; during inference, it iteratively refines the positions of input labels. To validate our method, we further present a new dataset, Repairing Buildings in OSM (ReBO), comprising 179,265 buildings with both OpenStreetMap and manually corrected annotations across 5,473 images from 41 cities. Experimental results on ReBO demonstrate the effectiveness of DragOSM. Code, dataset, and trained models are publicly available at https://github.com/likaiucas/DragOSM.git.
中文摘要:DragOSM模型通过将标签对齐构建为交互式去噪过程,有效校正历史地图标注的位置偏差,并在新构建的ReBO数据集上验证了其优越性能。
English Summary: The proposed DragOSM model effectively corrects positional discrepancies in historical map labels by treating alignment as an interactive denoising process, using a new ReBO dataset for validation.

Authors:Romain Thoreau, Jessie Levillain, Dawa Derksen
Title: Can multimodal representation learning by alignment preserve modality-specific information?
Abstract:
Combining multimodal data is a key issue in a wide range of machine learning tasks, including many remote sensing problems. In Earth observation, early multimodal data fusion methods were based on specific neural network architectures and supervised learning. Ever since, the scarcity of labeled data has motivated self-supervised learning techniques. State-of-the-art multimodal representation learning techniques leverage the spatial alignment between satellite data from different modalities acquired over the same geographic area in order to foster a semantic alignment in the latent space. In this paper, we investigate how this methods can preserve task-relevant information that is not shared across modalities. First, we show, under simplifying assumptions, when alignment strategies fundamentally lead to an information loss. Then, we support our theoretical insight through numerical experiments in more realistic settings. With those theoretical and empirical evidences, we hope to support new developments in contrastive learning for the combination of multimodal satellite data. Our code and data is publicly available at https://github.com/Romain3Ch216/alg_maclean_25.
中文摘要:本文探讨了地球观测中多模态对比学习如何保留跨模态未共享的任务相关信息,从理论和实验两方面揭示了对齐策略导致信息损失的问题。
English Summary: This paper examines how multimodal contrastive learning in Earth observation can preserve task-specific information not shared across modalities, revealing both theoretical and empirical evidence of information loss from alignment strategies.

Authors:Yuanhan Wang, Yifei Chen, Shuo Jiang, Wenjing Yu, Mingxuan Liu, Beining Wu, Jinying Zong, Feiwei Qin, Changmiao Wang, Qiyuan Tian
Title: SmaRT: Style-Modulated Robust Test-Time Adaptation for Cross-Domain Brain Tumor Segmentation in MRI
Abstract:
Reliable brain tumor segmentation in MRI is indispensable for treatment planning and outcome monitoring, yet models trained on curated benchmarks often fail under domain shifts arising from scanner and protocol variability as well as population heterogeneity. Such gaps are especially severe in low-resource and pediatric cohorts, where conventional test-time or source-free adaptation strategies often suffer from instability and structural inconsistency. We propose SmaRT, a style-modulated robust test-time adaptation framework that enables source-free cross-domain generalization. SmaRT integrates style-aware augmentation to mitigate appearance discrepancies, a dual-branch momentum strategy for stable pseudo-label refinement, and structural priors enforcing consistency, integrity, and connectivity. This synergy ensures both adaptation stability and anatomical fidelity under extreme domain shifts. Extensive evaluations on sub-Saharan Africa and pediatric glioma datasets show that SmaRT consistently outperforms state-of-the-art methods, with notable gains in Dice accuracy and boundary precision. Overall, SmaRT bridges the gap between algorithmic advances and equitable clinical applicability, supporting robust deployment of MRI-based neuro-oncology tools in diverse clinical environments. Our source code is available at https://github.com/baiyou1234/SmaRT.
中文摘要:SmaRT是一种风格调制测试时自适应框架,通过整合风格感知增强、动量伪标签优化和结构一致性约束,有效提升脑肿瘤分割在跨域场景下的准确性和稳定性,在资源匮乏及儿科数据集中表现优异。
English Summary: SmaRT is a style-modulated test-time adaptation framework that enhances brain tumor segmentation accuracy under domain shifts by integrating style-aware augmentation, momentum-based pseudo-label refinement, and structural consistency priors, demonstrating superior performance in challenging clinical datasets.

Authors:Geewook Kim, Minjoon Seo
Title: Does Audio Matter for Modern Video-LLMs and Their Benchmarks?
Abstract:
Modern multimodal large language models often claim "video understanding," yet most evaluations use muted videos or simply discard audio. We ask a direct question: how much does audio actually matter for contemporary Video-LLMs and the benchmarks that certify them? We audit widely used suites and observe that many items are even solvable from a single frame, rendering audio largely redundant. Building on LLaVA-OneVision architecture, we attach a speech/audio encoder (e.g., Whisper) and analyze when audio helps, while addressing audio token explosion with a lightweight Mamba-based state-space token compressor. We find that audio yields minimal gains on recent video benchmarks but is decisive on curated, audio-sensitive subsets. To enable faithful evaluation, we release AVQA-Hard and Music-AVQA-Hard, our model, and code. Our findings surface a growing gap between current academic practice and real-world expectations, and provide practical tools for scalable audio-visual Video-LLMs. We will fully open-source our work at https://github.com/naver-ai/LLaVA-AV-SSM.
中文摘要:当前视频大语言模型在标准测试中音频作用甚微,但在音频敏感任务中至关重要,为此我们推出了新的评估工具和可扩展模型,以弥合学术实践与现实需求之间的差距。
English Summary: Current Video-LLMs show minimal audio benefit on standard benchmarks but prove crucial for audio-sensitive tasks, prompting new evaluation tools and a scalable model to bridge the gap between academic practice and real-world needs.

Authors:Yiyang Chen, Xuanhua He, Xiujun Ma, Yue Ma
Title: ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment
Abstract:
Training-free video object editing aims to achieve precise object-level manipulation, including object insertion, swapping, and deletion. However, it faces significant challenges in maintaining fidelity and temporal consistency. Existing methods, often designed for U-Net architectures, suffer from two primary limitations: inaccurate inversion due to first-order solvers, and contextual conflicts caused by crude "hard" feature replacement. These issues are more challenging in Diffusion Transformers (DiTs), where the unsuitability of prior layer-selection heuristics makes effective guidance challenging. To address these limitations, we introduce ContextFlow, a novel training-free framework for DiT-based video object editing. In detail, we first employ a high-order Rectified Flow solver to establish a robust editing foundation. The core of our framework is Adaptive Context Enrichment (for specifying what to edit), a mechanism that addresses contextual conflicts. Instead of replacing features, it enriches the self-attention context by concatenating Key-Value pairs from parallel reconstruction and editing paths, empowering the model to dynamically fuse information. Additionally, to determine where to apply this enrichment (for specifying where to edit), we propose a systematic, data-driven analysis to identify task-specific vital layers. Based on a novel Guidance Responsiveness Metric, our method pinpoints the most influential DiT blocks for different tasks (e.g., insertion, swapping), enabling targeted and highly effective guidance. Extensive experiments show that ContextFlow significantly outperforms existing training-free methods and even surpasses several state-of-the-art training-based approaches, delivering temporally coherent, high-fidelity results.

Authors:Aniello Panariello, Daniel Marczak, Simone Magistri, Angelo Porrello, Bartłomiej Twardowski, Andrew D. Bagdanov, Simone Calderara, Joost van de Weijer
Title: Accurate and Efficient Low-Rank Model Merging in Core Space
Abstract:
In this paper, we address the challenges associated with merging low-rank adaptations of large neural networks. With the rise of parameter-efficient adaptation techniques, such as Low-Rank Adaptation (LoRA), model fine-tuning has become more accessible. While fine-tuning models with LoRA is highly efficient, existing merging methods often sacrifice this efficiency by merging fully-sized weight matrices. We propose the Core Space merging framework, which enables the merging of LoRA-adapted models within a common alignment basis, thereby preserving the efficiency of low-rank adaptation while substantially improving accuracy across tasks. We further provide a formal proof that projection into Core Space ensures no loss of information and provide a complexity analysis showing the efficiency gains. Extensive empirical results demonstrate that Core Space significantly improves existing merging techniques and achieves state-of-the-art results on both vision and language tasks while utilizing a fraction of the computational resources. Codebase is available at https://github.com/apanariello4/core-space-merging.
中文: Core Space框架能够在共享对齐基中高效合并LoRA适配的神经网络,在保持低秩效率的同时显著提升视觉和语言任务的准确性,且仅需少量计算资源。
English: The Core Space framework enables efficient merging of LoRA-adapted neural networks within a shared alignment basis, preserving low-rank efficiency while significantly boosting accuracy across vision and language tasks with minimal computational resources.

Authors:Guanjie Wang, Zehua Ma, Han Fang, Weiming Zhang
Title: I2VWM: Robust Watermarking for Image to Video Generation
Abstract:
The rapid progress of image-guided video generation (I2V) has raised concerns about its potential misuse in misinformation and fraud, underscoring the urgent need for effective digital watermarking. While existing watermarking methods demonstrate robustness within a single modality, they fail to trace source images in I2V settings. To address this gap, we introduce the concept of Robust Diffusion Distance, which measures the temporal persistence of watermark signals in generated videos. Building on this, we propose I2VWM, a cross-modal watermarking framework designed to enhance watermark robustness across time. I2VWM leverages a video-simulation noise layer during training and employs an optical-flow-based alignment module during inference. Experiments on both open-source and commercial I2V models demonstrate that I2VWM significantly improves robustness while maintaining imperceptibility, establishing a new paradigm for cross-modal watermarking in the era of generative video. \href{https://github.com/MrCrims/I2VWM-Robust-Watermarking-for-Image-to-Video-Generation}{Code Released.}
中文:本文提出I2VWM跨模态水印框架,通过鲁棒扩散距离测量时间持续性,并采用视频模拟噪声和光流对齐技术,显著提升了图像到视频生成中水印的鲁棒性。
English: This paper introduces I2VWM, a cross-modal watermarking framework that enhances robustness in image-to-video generation by measuring temporal persistence through Robust Diffusion Distance and employing video-simulation noise with optical-flow alignment.

Authors:Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo Zheng, Rui Men, Fan Zhou, Bowen Yu, Jianxin Yang, Le Yu, Jingren Zhou, Junyang Lin
Title: Qwen3-Omni Technical Report
Abstract:
We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.
Chinese: Qwen3-Omni是首个在文本、图像、音频和视频领域均保持顶尖性能的多模态模型,尤其在音频任务上表现卓越,超越了Gemini-2.5-Pro等主流闭源模型。
English: Qwen3-Omni is a groundbreaking multimodal model that maintains top-tier performance across text, image, audio, and video, excelling particularly in audio tasks where it surpasses leading closed-source models.

Authors:Bo Li, Yunkuo Lei, Tingting Bao, Yaxian Wang, Lingling Zhang, Jun Liu
Title: Neurodynamics-Driven Coupled Neural P Systems for Multi-Focus Image Fusion
Abstract:
Multi-focus image fusion (MFIF) is a crucial technique in image processing, with a key challenge being the generation of decision maps with precise boundaries. However, traditional methods based on heuristic rules and deep learning methods with black-box mechanisms are difficult to generate high-quality decision maps. To overcome this challenge, we introduce neurodynamics-driven coupled neural P (CNP) systems, which are third-generation neural computation models inspired by spiking mechanisms, to enhance the accuracy of decision maps. Specifically, we first conduct an in-depth analysis of the model's neurodynamics to identify the constraints between the network parameters and the input signals. This solid analysis avoids abnormal continuous firing of neurons and ensures the model accurately distinguishes between focused and unfocused regions, generating high-quality decision maps for MFIF. Based on this analysis, we propose a Neurodynamics-Driven CNP Fusion model (ND-CNPFuse) tailored for the challenging MFIF task. Unlike current ideas of decision map generation, ND-CNPFuse distinguishes between focused and unfocused regions by mapping the source image into interpretable spike matrices. By comparing the number of spikes, an accurate decision map can be generated directly without any post-processing. Extensive experimental results show that ND-CNPFuse achieves new state-of-the-art performance on four classical MFIF datasets, including Lytro, MFFW, MFI-WHU, and Real-MFF. The code is available at https://github.com/MorvanLi/ND-CNPFuse.
中文: 该研究提出了一种神经动力学驱动的耦合神经P系统,通过分析神经约束并将源图像映射为可解释的脉冲矩阵,为多焦点图像融合生成精确的决策图,无需后处理即达到最先进性能。
English: The study introduces a neurodynamics-driven coupled neural P system that generates precise decision maps for multi-focus image fusion by analyzing neural constraints and mapping images into interpretable spike matrices, achieving state-of-the-art results without post-processing.

Authors:Mariette Schönfeld, Wannes Meert, Hendrik Blockeel
Title: Tailored Transformation Invariance for Industrial Anomaly Detection
Abstract:
Industrial Anomaly Detection (IAD) is a subproblem within Computer Vision Anomaly Detection that has been receiving increasing amounts of attention due to its applicability to real-life scenarios. Recent research has focused on how to extract the most informative features, contrasting older kNN-based methods that use only pretrained features. These recent methods are much more expensive to train however and could complicate real-life application. Careful study of related work with regards to transformation invariance leads to the idea that popular benchmarks require robustness to only minor translations. With this idea we then formulate LWinNN, a local window based approach that creates a middle ground between kNN based methods that have either complete or no translation invariance. Our experiments demonstrate that this small change increases accuracy considerably, while simultaneously decreasing both train and test time. This teaches us two things: first, the gap between kNN-based approaches and more complex state-of-the-art methodology can still be narrowed by effective usage of the limited data available. Second, our assumption of requiring only limited translation invariance highlights potential areas of interest for future work and the need for more spatially diverse benchmarks, for which our method can hopefully serve as a new baseline. Our code can be found at https://github.com/marietteschonfeld/LWinNN .
中文:提出的LWinNN方法通过引入有限平移不变性,在kNN基础方法和复杂方法之间找到平衡,显著提升检测精度并降低计算成本,同时揭示了当前基准测试需要更大空间多样性的问题。
English: The proposed LWinNN method bridges kNN-based and complex approaches by introducing limited translation invariance, significantly improving accuracy while reducing computational costs, and highlighting the need for more diverse benchmarks.

Authors:Pingyi Chen, Yujing Lou, Shen Cao, Jinhui Guo, Lubin Fan, Yue Wu, Lin Yang, Lizhuang Ma, Jieping Ye
Title: SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models
Abstract:
While vision language models (VLMs) excel in 2D semantic visual understanding, their ability to quantitatively reason about 3D spatial relationships remains under-explored, due to the deficiency of 2D images' spatial representation ability. In this paper, we analyze the problem hindering VLMs' spatial understanding abilities and propose SD-VLM, a novel framework that significantly enhances fundamental spatial perception abilities of VLMs through two key contributions: (1) propose Massive Spatial Measuring and Understanding (MSMU) dataset with precise spatial annotations, and (2) introduce a simple depth positional encoding method strengthening VLMs' spatial awareness. MSMU dataset covers massive quantitative spatial tasks with 700K QA pairs, 2.5M physical numerical annotations, and 10K chain-of-thought augmented samples. We have trained SD-VLM, a strong generalist VLM which shows superior quantitative spatial measuring and understanding capability. SD-VLM not only achieves state-of-the-art performance on our proposed MSMU-Bench, but also shows spatial generalization abilities on other spatial understanding benchmarks including Q-Spatial and SpatialRGPT-Bench. Extensive experiments demonstrate that SD-VLM outperforms GPT-4o and Intern-VL3-78B by 26.91% and 25.56% respectively on MSMU-Bench. Code and models are released at https://github.com/cpystan/SD-VLM.
中文: 本文提出SD-VLM框架,通过大规模空间测量理解数据集和深度位置编码方法,显著提升了视觉语言模型的三维空间感知能力,在多个空间理解基准测试中表现优异。
English: This paper introduces SD-VLM, a novel framework that enhances vision language models' 3D spatial reasoning through a comprehensive MSMU dataset and depth positional encoding, achieving state-of-the-art performance on spatial benchmarks.

Authors:Sehyun Kim, Hye Jun Lee, Jiwoo Lee, Taemin Lee
Title: Clothing agnostic Pre-inpainting Virtual Try-ON
Abstract:
With the development of deep learning technology, virtual try-on technology has become an important application value in the fields of e-commerce, fashion, and entertainment. The recently proposed Leffa has improved the texture distortion problem of diffu-sion-based models, but there are limitations in that the bottom detection inaccuracy and the existing clothing silhouette remain in the synthesis results. To solve this problem, this study proposes CaP-VTON (Clothing agnostic Pre-inpainting Virtual Try-ON). CaP-VTON has improved the naturalness and consistency of whole-body clothing syn-thesis by integrating multi-category masking based on Dress Code and skin inpainting based on Stable Diffusion. In particular, a generate skin module was introduced to solve the skin restoration problem that occurs when long-sleeved images are converted into short-sleeved or sleeveless ones, and high-quality restoration was implemented consider-ing the human body posture and color. As a result, CaP-VTON recorded 92.5%, which is 15.4% better than Leffa in short-sleeved synthesis accuracy, and showed the performance of consistently reproducing the style and shape of reference clothing in visual evaluation. These structures maintain model-agnostic properties and are applicable to various diffu-sion-based virtual inspection systems, and can contribute to applications that require high-precision virtual wearing, such as e-commerce, custom styling, and avatar creation.
中文: 本研究提出的CaP-VTON虚拟试穿系统通过整合多类别遮罩和皮肤修复技术,显著提升了全身服装合成的自然度与一致性,在精度和视觉还原度上均优于现有方法。
English: This study introduces CaP-VTON, a virtual try-on system that enhances full-body clothing synthesis by integrating multi-category masking and skin inpainting, achieving significant improvements in accuracy and visual consistency over previous methods.

Authors:Soroush Mahdi, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, Mahdi Javanmardi
Title: Evict3R: Training-Free Token Eviction for Memory-Bounded Streaming Visual Geometry Transformers
Abstract:
Streaming visual transformers like StreamVGGT achieve strong 3D perception but suffer from unbounded growth of key value (KV) memory, which limits scalability. We propose a training-free, inference-time token eviction policy that bounds memory by discarding redundant tokens while keeping the most informative ones. Our method uses significantly less memory with little to no drop in accuracy: on 7-Scenes with long sequences it reduces peak memory from 18.63 GB to 9.39 GB while accuracy and completeness drop by only 0.003. Under strict memory budgets, eviction enables denser frame sampling, which improves reconstruction accuracy compared to the baseline. Experiments across video depth estimation (Sintel, KITTI), 3D reconstruction (7-Scenes, NRGBD), and camera pose estimation (Sintel, TUM-dynamics) show that our approach closely matches StreamVGGT at a fraction of the memory and makes long-horizon streaming inference more practical.
中文: 该研究提出无需训练的推理时令牌淘汰策略,通过保留关键信息令牌来限制内存增长,在多种3D感知任务中大幅降低内存使用的同时保持了精度。
English: The proposed training-free token eviction policy effectively bounds KV memory growth in streaming visual transformers by discarding redundant tokens, maintaining high accuracy while drastically reducing memory usage across various 3D perception tasks.

Authors:Jinshu Chen, Xinghui Li, Xu Bai, Tianxiang Ma, Pengze Zhang, Zhuowei Chen, Gen Li, Lijie Liu, Songtao Zhao, Bingchuan Li, Qian He
Title: OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models
Abstract:
Recent advances in video insertion based on diffusion models are impressive. However, existing methods rely on complex control signals but struggle with subject consistency, limiting their practical applicability. In this paper, we focus on the task of Mask-free Video Insertion and aim to resolve three key challenges: data scarcity, subject-scene equilibrium, and insertion harmonization. To address the data scarcity, we propose a new data pipeline InsertPipe, constructing diverse cross-pair data automatically. Building upon our data pipeline, we develop OmniInsert, a novel unified framework for mask-free video insertion from both single and multiple subject references. Specifically, to maintain subject-scene equilibrium, we introduce a simple yet effective Condition-Specific Feature Injection mechanism to distinctly inject multi-source conditions and propose a novel Progressive Training strategy that enables the model to balance feature injection from subjects and source video. Meanwhile, we design the Subject-Focused Loss to improve the detailed appearance of the subjects. To further enhance insertion harmonization, we propose an Insertive Preference Optimization methodology to optimize the model by simulating human preferences, and incorporate a Context-Aware Rephraser module during reference to seamlessly integrate the subject into the original scenes. To address the lack of a benchmark for the field, we introduce InsertBench, a comprehensive benchmark comprising diverse scenes with meticulously selected subjects. Evaluation on InsertBench indicates OmniInsert outperforms state-of-the-art closed-source commercial solutions. The code will be released.

Authors:Aiming Zhang, Tianyuan Yu, Liang Bai, Jun Tang, Yanming Guo, Yirun Ruan, Yun Zhou, Zhihe Lu
Title: COLA: Context-aware Language-driven Test-time Adaptation
Abstract:
Test-time adaptation (TTA) has gained increasing popularity due to its efficacy in addressing ``distribution shift'' issue while simultaneously protecting data privacy. However, most prior methods assume that a paired source domain model and target domain sharing the same label space coexist, heavily limiting their applicability. In this paper, we investigate a more general source model capable of adaptation to multiple target domains without needing shared labels. This is achieved by using a pre-trained vision-language model (VLM), \egno, CLIP, that can recognize images through matching with class descriptions. While the zero-shot performance of VLMs is impressive, they struggle to effectively capture the distinctive attributes of a target domain. To that end, we propose a novel method -- Context-aware Language-driven TTA (COLA). The proposed method incorporates a lightweight context-aware module that consists of three key components: a task-aware adapter, a context-aware unit, and a residual connection unit for exploring task-specific knowledge, domain-specific knowledge from the VLM and prior knowledge of the VLM, respectively. It is worth noting that the context-aware module can be seamlessly integrated into a frozen VLM, ensuring both minimal effort and parameter efficiency. Additionally, we introduce a Class-Balanced Pseudo-labeling (CBPL) strategy to mitigate the adverse effects caused by class imbalance. We demonstrate the effectiveness of our method not only in TTA scenarios but also in class generalisation tasks. The source code is available at https://github.com/NUDT-Bai-Group/COLA-TTA.
中文:测试时适应(TTA)方法虽能应对分布偏移并保护数据隐私,但通常依赖共享标签空间;本文提出的COLA方法利用视觉语言模型,通过轻量级上下文感知模块和类平衡伪标签策略,无需共享标签即可适应多个目标域,提升了TTA和类泛化任务的性能。
English: Test-time adaptation (TTA) addresses distribution shifts and data privacy but often requires shared label spaces, whereas the proposed COLA method uses a vision-language model with a lightweight context-aware module and class-balanced pseudo-labeling to adapt to multiple target domains without shared labels, enhancing both TTA and class generalization tasks.

Authors:Florinel Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu
Title: PRNU-Bench: A Novel Benchmark and Model for PRNU-Based Camera Identification
Abstract:
We propose a novel benchmark for camera identification via Photo Response Non-Uniformity (PRNU) estimation. The benchmark comprises 13K photos taken with 120+ cameras, where the training and test photos are taken in different scenarios, enabling ``in-the-wild'' evaluation. In addition, we propose a novel PRNU-based camera identification model that employs a hybrid architecture, comprising a denoising autoencoder to estimate the PRNU signal and a convolutional network that can perform 1:N verification of camera devices. Instead of using a conventional approach based on contrastive learning, our method takes the Hadamard product between reference and query PRNU signals as input. This novel design leads to significantly better results compared with state-of-the-art models based on denoising autoencoders and contrastive learning. We release our dataset and code at: https://github.com/CroitoruAlin/PRNU-Bench.
Chinese: 我们提出了一个基于光响应非均匀性估计的相机识别新基准,包含来自120多台相机的1.3万张照片,并采用结合去噪自编码器和卷积网络的混合模型,显著提升了识别性能。
English: We introduce a new benchmark for camera identification using PRNU estimation, featuring 13,000 photos from over 120 cameras and a hybrid model that combines a denoising autoencoder with a convolutional network for improved accuracy.

Authors:Yuxuan Li, Yicheng Zhang, Wenhao Tang, Yimian Dai, Ming-Ming Cheng, Xiang Li, Jian Yang
Title: Visual Instruction Pretraining for Domain-Specific Foundation Models
Abstract:
Modern computer vision is converging on a closed loop in which perception, reasoning and generation mutually reinforce each other. However, this loop remains incomplete: the top-down influence of high-level reasoning on the foundational learning of low-level perceptual features is not yet underexplored. This paper addresses this gap by proposing a new paradigm for pretraining foundation models in downstream domains. We introduce Visual insTruction Pretraining (ViTP), a novel approach that directly leverages reasoning to enhance perception. ViTP embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data curated from target downstream domains. ViTP is powered by our proposed Visual Robustness Learning (VRL), which compels the ViT to learn robust and domain-relevant features from a sparse set of visual tokens. Extensive experiments on 16 challenging remote sensing and medical imaging benchmarks demonstrate that ViTP establishes new state-of-the-art performance across a diverse range of downstream tasks. The code is available at https://github.com/zcablii/ViTP.
中文摘要:本文提出视觉指令预训练(ViTP)新范式,通过将推理融入感知来增强基础模型预训练,在遥感和医学影像等多个挑战性基准测试中实现了最先进的性能。
English Summary: This paper introduces Visual Instruction Pretraining (ViTP), a novel paradigm that integrates reasoning into perception to enhance foundation model pretraining, achieving state-of-the-art results across multiple challenging benchmarks in remote sensing and medical imaging.

Authors:Dian Jin, Yanghao Zhou, Jinxing Zhou, Jiaqi Ma, Ruohao Guo, Dan Guo
Title: SimToken: A Simple Baseline for Referring Audio-Visual Segmentation
Abstract:
Referring Audio-Visual Segmentation (Ref-AVS) aims to segment specific objects in videos based on natural language expressions involving audio, vision, and text information. This task poses significant challenges in cross-modal reasoning and fine-grained object localization. In this paper, we propose a simple framework, SimToken, that integrates a multimodal large language model (MLLM) with the Segment Anything Model (SAM). The MLLM is guided to generate a special semantic token representing the referred object. This compact token, enriched with contextual information from all modalities, acts as a prompt to guide SAM to segment objectsacross video frames. To further improve semantic learning, we introduce a novel target-consistent semantic alignment loss that aligns token embeddings from different expressions but referring to the same object. Experiments on the Ref-AVS benchmark demonstrate that our approach achieves superior performance compared to existing methods.
中文:SimToken框架通过整合多模态大语言模型与Segment Anything Model,利用生成的特殊语义令牌指导视频对象分割,并借助目标一致性语义对齐损失提升性能,在Ref-AVS基准测试中表现优异。
English: The proposed SimToken framework integrates a multimodal large language model with the Segment Anything Model to segment video objects based on audio-visual-text references, achieving superior performance through a novel semantic alignment loss.

Authors:Zihan Zheng, Zhenlong Wu, Houqiang Zhong, Yuan Tian, Ning Cao, Lan Xu, Jiangchao Yao, Xiaoyun Zhang, Qiang Hu, Wenjun Zhang
Title: 4DGCPro: Efficient Hierarchical 4D Gaussian Compression for Progressive Volumetric Video Streaming
Abstract:
Achieving seamless viewing of high-fidelity volumetric video, comparable to 2D video experiences, remains an open challenge. Existing volumetric video compression methods either lack the flexibility to adjust quality and bitrate within a single model for efficient streaming across diverse networks and devices, or struggle with real-time decoding and rendering on lightweight mobile platforms. To address these challenges, we introduce 4DGCPro, a novel hierarchical 4D Gaussian compression framework that facilitates real-time mobile decoding and high-quality rendering via progressive volumetric video streaming in a single bitstream. Specifically, we propose a perceptually-weighted and compression-friendly hierarchical 4D Gaussian representation with motion-aware adaptive grouping to reduce temporal redundancy, preserve coherence, and enable scalable multi-level detail streaming. Furthermore, we present an end-to-end entropy-optimized training scheme, which incorporates layer-wise rate-distortion (RD) supervision and attribute-specific entropy modeling for efficient bitstream generation. Extensive experiments show that 4DGCPro enables flexible quality and multiple bitrate within a single model, achieving real-time decoding and rendering on mobile devices while outperforming existing methods in RD performance across multiple datasets. Project Page: https://mediax-sjtu.github.io/4DGCPro
中文摘要:4DGCPro提出了一种分层4D高斯压缩框架,通过运动感知分组和熵优化训练实现实时移动端解码与渐进式流传输,在率失真性能上超越现有方法。
English Summary: 4DGCPro introduces a hierarchical 4D Gaussian compression framework enabling real-time mobile decoding and progressive streaming through motion-aware grouping and entropy-optimized training, outperforming existing methods in rate-distortion performance.

Authors:Qinghua Lin, Guang-Hai Liu, Zuoyong Li, Yang Li, Yuting Jiang, Xiang Wu
Title: Multimodal Medical Image Classification via Synergistic Learning Pre-training
Abstract:
Multimodal pathological images are usually in clinical diagnosis, but computer vision-based multimodal image-assisted diagnosis faces challenges with modality fusion, especially in the absence of expert-annotated data. To achieve the modality fusion in multimodal images with label scarcity, we propose a novel ``pretraining + fine-tuning" framework for multimodal semi-supervised medical image classification. Specifically, we propose a synergistic learning pretraining framework of consistency, reconstructive, and aligned learning. By treating one modality as an augmented sample of another modality, we implement a self-supervised learning pre-train, enhancing the baseline model's feature representation capability. Then, we design a fine-tuning method for multimodal fusion. During the fine-tuning stage, we set different encoders to extract features from the original modalities and provide a multimodal fusion encoder for fusion modality. In addition, we propose a distribution shift method for multimodal fusion features, which alleviates the prediction uncertainty and overfitting risks caused by the lack of labeled samples. We conduct extensive experiments on the publicly available gastroscopy image datasets Kvasir and Kvasirv2. Quantitative and qualitative results demonstrate that the proposed method outperforms the current state-of-the-art classification methods. The code will be released at: https://github.com/LQH89757/MICS.
Chinese: 本研究提出了一种新颖的“预训练+微调”框架,通过协同学习增强特征表示,解决了多模态图像融合的难题,并在胃镜数据集上实现了最先进的分类性能。
English: This study introduces a novel "pretraining + fine-tuning" framework for multimodal semi-supervised medical image classification, which enhances feature representation through synergistic learning and addresses modality fusion challenges, achieving state-of-the-art performance on gastroscopy datasets.

Authors:Xingqi Wang, Yiming Cui, Xin Yao, Shijin Wang, Guoping Hu, Xiaoyu Qin
Title: ChartHal: A Fine-grained Framework Evaluating Hallucination of Large Vision Language Models in Chart Understanding
Abstract:
Large Vision-Language Models (LVLMs) have recently demonstrated remarkable progress, yet hallucination remains a critical barrier, particularly in chart understanding, which requires sophisticated perceptual and cognitive abilities as well as rigorous factual accuracy. While prior work has investigated hallucinations and chart comprehension independently, their intersection remains largely unexplored. To address this gap, we present ChartHal, a benchmark that features a fine-grained taxonomy of hallucination scenarios in chart understanding, along with a human-validated dataset of 1,062 samples. Our evaluation shows that state-of-the-art LVLMs suffer from severe hallucinations on ChartHal, including proprietary models such as GPT-5 and o4-mini, which achieve only 34.46% and 22.79% accuracy, respectively. Further analysis reveals that questions involving information absent from or contradictory to charts are especially likely to trigger hallucinations, underscoring the urgent need for more robust mitigation strategies. Code and data are available at https://github.com/ymcui/ChartHal .
中文摘要:大型视觉语言模型在图表理解中存在严重幻觉问题,ChartHal基准测试显示即使GPT-5和o4-mini等先进模型准确率也极低,凸显了改进缓解策略的迫切需求。
English Summary: Large Vision-Language Models exhibit severe hallucination issues in chart understanding, as demonstrated by the ChartHal benchmark where even advanced models like GPT-5 and o4-mini show low accuracy, highlighting the need for better mitigation strategies.

Authors:Gunjan Chhablani, Xiaomeng Ye, Muhammad Zubair Irshad, Zsolt Kira
Title: EmbodiedSplat: Personalized Real-to-Sim-to-Real Navigation with Gaussian Splats from a Mobile Device
Abstract:
The field of Embodied AI predominantly relies on simulation for training and evaluation, often using either fully synthetic environments that lack photorealism or high-fidelity real-world reconstructions captured with expensive hardware. As a result, sim-to-real transfer remains a major challenge. In this paper, we introduce EmbodiedSplat, a novel approach that personalizes policy training by efficiently capturing the deployment environment and fine-tuning policies within the reconstructed scenes. Our method leverages 3D Gaussian Splatting (GS) and the Habitat-Sim simulator to bridge the gap between realistic scene capture and effective training environments. Using iPhone-captured deployment scenes, we reconstruct meshes via GS, enabling training in settings that closely approximate real-world conditions. We conduct a comprehensive analysis of training strategies, pre-training datasets, and mesh reconstruction techniques, evaluating their impact on sim-to-real predictivity in real-world scenarios. Experimental results demonstrate that agents fine-tuned with EmbodiedSplat outperform both zero-shot baselines pre-trained on large-scale real-world datasets (HM3D) and synthetically generated datasets (HSSD), achieving absolute success rate improvements of 20% and 40% on real-world Image Navigation task. Moreover, our approach yields a high sim-vs-real correlation (0.87-0.97) for the reconstructed meshes, underscoring its effectiveness in adapting policies to diverse environments with minimal effort. Project page: https://gchhablani.github.io/embodied-splat.

Authors:Buyin Deng, Lingxin Huang, Kai Luo, Fei Teng, Kailun Yang
Title: DepTR-MOT: Unveiling the Potential of Depth-Informed Trajectory Refinement for Multi-Object Tracking
Abstract:
Visual Multi-Object Tracking (MOT) is a crucial component of robotic perception, yet existing Tracking-By-Detection (TBD) methods often rely on 2D cues, such as bounding boxes and motion modeling, which struggle under occlusions and close-proximity interactions. Trackers relying on these 2D cues are particularly unreliable in robotic environments, where dense targets and frequent occlusions are common. While depth information has the potential to alleviate these issues, most existing MOT datasets lack depth annotations, leading to its underexploited role in the domain. To unveil the potential of depth-informed trajectory refinement, we introduce DepTR-MOT, a DETR-based detector enhanced with instance-level depth information. Specifically, we propose two key innovations: (i) foundation model-based instance-level soft depth label supervision, which refines depth prediction, and (ii) the distillation of dense depth maps to maintain global depth consistency. These strategies enable DepTR-MOT to output instance-level depth during inference, without requiring foundation models and without additional computational cost. By incorporating depth cues, our method enhances the robustness of the TBD paradigm, effectively resolving occlusion and close-proximity challenges. Experiments on both the QuadTrack and DanceTrack datasets demonstrate the effectiveness of our approach, achieving HOTA scores of 27.59 and 44.47, respectively. In particular, results on QuadTrack, a robotic platform MOT dataset, highlight the advantages of our method in handling occlusion and close-proximity challenges in robotic tracking. The source code will be made publicly available at https://github.com/warriordby/DepTR-MOT.
中文: DepTR-MOT提出了一种基于DETR的多目标跟踪器,通过实例级深度信息增强,在不增加计算成本的情况下有效解决了机器人环境中遮挡和近距离交互的跟踪难题。
English: DepTR-MOT introduces a DETR-based multi-object tracker enhanced with instance-level depth information, effectively addressing occlusion and proximity challenges in robotic environments by incorporating depth cues without additional computational costs.

Authors:Ranran Huang, Krystian Mikolajczyk
Title: SPFSplatV2: Efficient Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views
Abstract:
We introduce SPFSplatV2, an efficient feed-forward framework for 3D Gaussian splatting from sparse multi-view images, requiring no ground-truth poses during training and inference. It employs a shared feature extraction backbone, enabling simultaneous prediction of 3D Gaussian primitives and camera poses in a canonical space from unposed inputs. A masked attention mechanism is introduced to efficiently estimate target poses during training, while a reprojection loss enforces pixel-aligned Gaussian primitives, providing stronger geometric constraints. We further demonstrate the compatibility of our training framework with different reconstruction architectures, resulting in two model variants. Remarkably, despite the absence of pose supervision, our method achieves state-of-the-art performance in both in-domain and out-of-domain novel view synthesis, even under extreme viewpoint changes and limited image overlap, and surpasses recent methods that rely on geometric supervision for relative pose estimation. By eliminating dependence on ground-truth poses, our method offers the scalability to leverage larger and more diverse datasets. Code and pretrained models will be available on our project page: https://ranrhuang.github.io/spfsplatv2/.

Authors:Jinchao Ge, Tengfei Cheng, Biao Wu, Zeyu Zhang, Shiya Huang, Judith Bishop, Gillian Shepherd, Meng Fang, Ling Chen, Yang Zhao
Title: VaseVQA: Multimodal Agent and Benchmark for Ancient Greek Pottery
Abstract:
Analyzing cultural-heritage artifacts remains challenging for MLLMs: general models lack domain expertise, and SFT often overfits superficial patterns, yielding brittle reasoning for authentication and historical attribution. This raises the question of how to equip MLLMs with robust, expert-level reasoning for ancient Greek pottery. We present VaseVL, an SFT-then-RL system that turns evaluation into supervision: we construct a taxonomy of question types, probe the SFT model to localize type-specific performance gaps, and optimize with type-conditioned, compositionality-oriented rewards targeting those gaps. We also release VaseVQA, a comprehensive benchmark of 31,773 images designed to probe deep understanding. Experiments show state-of-the-art results on style classification and historical attribution with marked gains in compositional robustness over SFT-only baselines, validating diagnosis-guided, taxonomy-conditioned reward engineering and providing a reusable resource for future research. Code and dataset will be available at https://github.com/AIGeeksGroup/VaseVQA.
中文总结:VaseVL系统通过诊断引导的强化学习提升多模态大模型对文物的专业推理能力,在风格分类和历史归属任务中实现最优性能,并发布了可复用的评测数据集。
English Summary: VaseVL is a system that enhances MLLMs' reasoning for cultural artifacts through targeted reinforcement learning, achieving state-of-the-art performance in authentication tasks while introducing a comprehensive benchmark dataset.

Authors:Kabir Hamzah Muhammad, Marawan Elbatel, Yi Qin, Xiaomeng Li
Title: Echo-Path: Pathology-Conditioned Echo Video Generation
Abstract:
Cardiovascular diseases (CVDs) remain the leading cause of mortality globally, and echocardiography is critical for diagnosis of both common and congenital cardiac conditions. However, echocardiographic data for certain pathologies are scarce, hindering the development of robust automated diagnosis models. In this work, we propose Echo-Path, a novel generative framework to produce echocardiogram videos conditioned on specific cardiac pathologies. Echo-Path can synthesize realistic ultrasound video sequences that exhibit targeted abnormalities, focusing here on atrial septal defect (ASD) and pulmonary arterial hypertension (PAH). Our approach introduces a pathology-conditioning mechanism into a state-of-the-art echo video generator, allowing the model to learn and control disease-specific structural and motion patterns in the heart. Quantitative evaluation demonstrates that the synthetic videos achieve low distribution distances, indicating high visual fidelity. Clinically, the generated echoes exhibit plausible pathology markers. Furthermore, classifiers trained on our synthetic data generalize well to real data and, when used to augment real training sets, it improves downstream diagnosis of ASD and PAH by 7\% and 8\% respectively. Code, weights and dataset are available here https://github.com/Marshall-mk/EchoPathv1
中文:Echo-Path框架通过生成特定心脏病变的逼真超声心动图视频来解决数据稀缺问题,利用合成数据增强使ASD和PAH的自动诊断准确率分别提升7%和8%。
English: The proposed Echo-Path framework generates realistic echocardiogram videos with targeted cardiac pathologies to address data scarcity, improving automated diagnosis of conditions like ASD and PAH by 7-8% through synthetic data augmentation.

Authors:Bowen Qin, Chen Yue, Fang Yin, Hui Wang, JG Yao, Jiakang Liu, Jing-Shu Zheng, Miguel Hu Chen, Richeng Xuan, Shibei Meng, Shiqi Zhou, Teng Dai, Tong-Shuai Ren, Wei Cui, Xi Yang, Xialin Du, Xiaojing Xu, Xue Sun, Xuejing Li, Yaming Liu, Yesheng Liu, Ying Liu, Yonghua Lin, Yu Zhao, Yunduo Zhang, Yuwen Luo, Zheqi He, Zhiyuan He, Zhongyuan Wang
Title: FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions
Abstract:
We conduct a moderate-scale contamination-free (to some extent) evaluation of current large reasoning models (LRMs) with some preliminary findings. We also release ROME, our evaluation benchmark for vision language models intended to test reasoning from visual clues. We attach links to the benchmark, evaluation data, and other updates on this website: https://flageval-baai.github.io/LRM-Eval/

Authors:Yuhao Tian, Zheming Yang
Title: SAEC: Scene-Aware Enhanced Edge-Cloud Collaborative Industrial Vision Inspection with Multimodal LLM
Abstract:
Industrial vision inspection requires high accuracy under stringent resource constraints, yet existing approaches face a fundamental trade-off. Multimodal LLMs (MLLMs) deliver strong reasoning capabilities but incur prohibitive computational costs, while lightweight edge models often fail on complex cases. In this paper, we present SAEC, a scene-aware enhanced edge-cloud collaborative industrial vision inspection framework with MLLM. The framework is composed of three synergistic components: (1) Efficient MLLM Fine-Tuning for Complex Defect Inspection, (2) Lightweight Multiscale Scene-Complexity Estimation, and (3) Adaptive Edge-Cloud Scheduler. Together, these modules enable robust defect detection by tailoring multimodal reasoning to scene complexity and dynamically balancing computation between edge and cloud resources. Experimental results on MVTec AD and KSDD2 datasets demonstrate that SAEC attains 85.11% and 82.72% accuracy, surpassing Qwen by 22.1% and 20.8%, and LLaVA by 33.3% and 31.6%. It also reduces runtime by up to 22.4% and cuts energy per correct decision by 40%-74%. The code is available at https://github.com/YuHao-Tian/SAEC.
中文:SAEC是一种创新的边云协同框架,通过基于场景复杂度动态分配任务,显著提升了工业视觉检测的准确性和效率,超越了现有模型。
English: SAEC is a novel edge-cloud collaborative framework that enhances industrial vision inspection by dynamically allocating tasks based on scene complexity, achieving higher accuracy and efficiency than existing models.

Authors:Lingzhao Kong, Jiacheng Lin, Siyu Li, Kai Luo, Zhiyong Li, Kailun Yang
Title: CoBEVMoE: Heterogeneity-aware Feature Fusion with Dynamic Mixture-of-Experts for Collaborative Perception
Abstract:
Collaborative perception aims to extend sensing coverage and improve perception accuracy by sharing information among multiple agents. However, due to differences in viewpoints and spatial positions, agents often acquire heterogeneous observations. Existing intermediate fusion methods primarily focus on aligning similar features, often overlooking the perceptual diversity among agents. To address this limitation, we propose CoBEVMoE, a novel collaborative perception framework that operates in the Bird's Eye View (BEV) space and incorporates a Dynamic Mixture-of-Experts (DMoE) architecture. In DMoE, each expert is dynamically generated based on the input features of a specific agent, enabling it to extract distinctive and reliable cues while attending to shared semantics. This design allows the fusion process to explicitly model both feature similarity and heterogeneity across agents. Furthermore, we introduce a Dynamic Expert Metric Loss (DEML) to enhance inter-expert diversity and improve the discriminability of the fused representation. Extensive experiments on the OPV2V and DAIR-V2X-C datasets demonstrate that CoBEVMoE achieves state-of-the-art performance. Specifically, it improves the IoU for Camera-based BEV segmentation by +1.5% on OPV2V and the AP@50 for LiDAR-based 3D object detection by +3.0% on DAIR-V2X-C, verifying the effectiveness of expert-based heterogeneous feature modeling in multi-agent collaborative perception. The source code will be made publicly available at https://github.com/godk0509/CoBEVMoE.
Chinese: 提出的CoBEVMoE框架通过动态专家混合架构,在协同感知中同时建模智能体间的特征相似性与异质性,在基准数据集上实现了最优性能。
English: The proposed CoBEVMoE framework enhances collaborative perception by dynamically modeling both feature similarities and heterogeneities across agents through a Dynamic Mixture-of-Experts architecture, achieving state-of-the-art performance on benchmark datasets.

Authors:Yuzhu Li, An Sui, Fuping Wu, Xiahai Zhuang
Title: Uncertainty-Supervised Interpretable and Robust Evidential Segmentation
Abstract:
Uncertainty estimation has been widely studied in medical image segmentation as a tool to provide reliability, particularly in deep learning approaches. However, previous methods generally lack effective supervision in uncertainty estimation, leading to low interpretability and robustness of the predictions. In this work, we propose a self-supervised approach to guide the learning of uncertainty. Specifically, we introduce three principles about the relationships between the uncertainty and the image gradients around boundaries and noise. Based on these principles, two uncertainty supervision losses are designed. These losses enhance the alignment between model predictions and human interpretation. Accordingly, we introduce novel quantitative metrics for evaluating the interpretability and robustness of uncertainty. Experimental results demonstrate that compared to state-of-the-art approaches, the proposed method can achieve competitive segmentation performance and superior results in out-of-distribution (OOD) scenarios while significantly improving the interpretability and robustness of uncertainty estimation. Code is available via https://github.com/suiannaius/SURE.
Chinese: 本研究提出了一种自监督的医学图像分割不确定性估计方法,通过新的监督损失函数提升了解释性和鲁棒性,在分布外场景中取得了优异结果和竞争力表现。
English: This study introduces a self-supervised method for uncertainty estimation in medical image segmentation, using novel supervision losses to enhance interpretability and robustness, achieving competitive performance and superior results in out-of-distribution scenarios.

Authors:Jie Chen, Yuhong Feng, Tao Dai, Mingzhe Liu, Hongtao Chen, Zhaoxi He, Jiancong Bai
Title: SFN-YOLO: Towards Free-Range Poultry Detection via Scale-aware Fusion Networks
Abstract:
Detecting and localizing poultry is essential for advancing smart poultry farming. Despite the progress of detection-centric methods, challenges persist in free-range settings due to multiscale targets, obstructions, and complex or dynamic backgrounds. To tackle these challenges, we introduce an innovative poultry detection approach named SFN-YOLO that utilizes scale-aware fusion. This approach combines detailed local features with broader global context to improve detection in intricate environments. Furthermore, we have developed a new expansive dataset (M-SCOPE) tailored for varied free-range conditions. Comprehensive experiments demonstrate our model achieves an mAP of 80.7% with just 7.2M parameters, which is 35.1% fewer than the benchmark, while retaining strong generalization capability across different domains. The efficient and real-time detection capabilities of SFN-YOLO support automated smart poultry farming. The code and dataset can be accessed at https://github.com/chenjessiee/SFN-YOLO.
中文摘要:SFN-YOLO通过尺度感知融合方法,在复杂散养环境中结合局部特征与全局上下文来提升家禽检测效果,以更少参数实现80.7%的mAP准确率,为智慧养殖提供自动化支持。
English Summary: SFN-YOLO introduces a scale-aware fusion approach that enhances poultry detection in complex free-range environments by integrating local features with global context, achieving 80.7% mAP with reduced parameters while supporting automated smart farming.

Authors:Binhua Huang, Ni Wang, Arjun Pakrashi, Soumyabrata Dev
Title: MoCLIP-Lite: Efficient Video Recognition by Fusing CLIP with Motion Vectors
Abstract:
Video action recognition is a fundamental task in computer vision, but state-of-the-art models are often computationally expensive and rely on extensive video pre-training. In parallel, large-scale vision-language models like Contrastive Language-Image Pre-training (CLIP) offer powerful zero-shot capabilities on static images, while motion vectors (MV) provide highly efficient temporal information directly from compressed video streams. To synergize the strengths of these paradigms, we propose MoCLIP-Lite, a simple yet powerful two-stream late fusion framework for efficient video recognition. Our approach combines features from a frozen CLIP image encoder with features from a lightweight, supervised network trained on raw MV. During fusion, both backbones are frozen, and only a tiny Multi-Layer Perceptron (MLP) head is trained, ensuring extreme efficiency. Through comprehensive experiments on the UCF101 dataset, our method achieves a remarkable 89.2% Top-1 accuracy, significantly outperforming strong zero-shot (65.0%) and MV-only (66.5%) baselines. Our work provides a new, highly efficient baseline for video understanding that effectively bridges the gap between large static models and dynamic, low-cost motion cues. Our code and models are available at https://github.com/microa/MoCLIP-Lite.
Chinese: MoCLIP-Lite是一种高效的双流视频识别框架,通过结合冻结的CLIP图像编码器和轻量级运动矢量网络,在UCF101数据集上实现了89.2%的准确率,且仅需极少的训练成本。
English: MoCLIP-Lite is an efficient two-stream video recognition framework that combines a frozen CLIP image encoder with a lightweight motion vector network, achieving 89.2% accuracy on UCF101 while requiring minimal training.

Authors:Zipeng Wang, Dan Xu
Title: HyRF: Hybrid Radiance Fields for Memory-efficient and High-quality Novel View Synthesis
Abstract:
Recently, 3D Gaussian Splatting (3DGS) has emerged as a powerful alternative to NeRF-based approaches, enabling real-time, high-quality novel view synthesis through explicit, optimizable 3D Gaussians. However, 3DGS suffers from significant memory overhead due to its reliance on per-Gaussian parameters to model view-dependent effects and anisotropic shapes. While recent works propose compressing 3DGS with neural fields, these methods struggle to capture high-frequency spatial variations in Gaussian properties, leading to degraded reconstruction of fine details. We present Hybrid Radiance Fields (HyRF), a novel scene representation that combines the strengths of explicit Gaussians and neural fields. HyRF decomposes the scene into (1) a compact set of explicit Gaussians storing only critical high-frequency parameters and (2) grid-based neural fields that predict remaining properties. To enhance representational capacity, we introduce a decoupled neural field architecture, separately modeling geometry (scale, opacity, rotation) and view-dependent color. Additionally, we propose a hybrid rendering scheme that composites Gaussian splatting with a neural field-predicted background, addressing limitations in distant scene representation. Experiments demonstrate that HyRF achieves state-of-the-art rendering quality while reducing model size by over 20 times compared to 3DGS and maintaining real-time performance. Our project page is available at https://wzpscott.github.io/hyrf/.

Authors:Yuhong Feng, Hongtao Chen, Qi Zhang, Jie Chen, Zhaoxi He, Mingzhe Liu, Jianghai Liao
Title: A Dual-Modulation Framework for RGB-T Crowd Counting via Spatially Modulated Attention and Adaptive Fusion
Abstract:
Accurate RGB-Thermal (RGB-T) crowd counting is crucial for public safety in challenging conditions. While recent Transformer-based methods excel at capturing global context, their inherent lack of spatial inductive bias causes attention to spread to irrelevant background regions, compromising crowd localization precision. Furthermore, effectively bridging the gap between these distinct modalities remains a major hurdle. To tackle this, we propose the Dual Modulation Framework, comprising two modules: Spatially Modulated Attention (SMA), which improves crowd localization by using a learnable Spatial Decay Mask to penalize attention between distant tokens and prevent focus from spreading to the background; and Adaptive Fusion Modulation (AFM), which implements a dynamic gating mechanism to prioritize the most reliable modality for adaptive cross-modal fusion. Extensive experiments on RGB-T crowd counting datasets demonstrate the superior performance of our method compared to previous works. Code available at https://github.com/Cht2924/RGBT-Crowd-Counting.
中文: 提出的双重调制框架通过空间调制注意力提升人群定位精度,并采用自适应融合调制实现动态跨模态整合,在RGB-T人群计数数据集上取得了领先的性能表现。
English: The proposed Dual Modulation Framework enhances RGB-Thermal crowd counting by introducing Spatially Modulated Attention to improve localization precision and Adaptive Fusion Modulation for dynamic cross-modal integration, achieving superior performance on benchmark datasets.

Authors:Kihyun Kim, Michalis Lazarou, Tania Stathaki
Title: Enhanced Detection of Tiny Objects in Aerial Images
Abstract:
While one-stage detectors like YOLOv8 offer fast training speed, they often under-perform on detecting small objects as a trade-off. This becomes even more critical when detecting tiny objects in aerial imagery due to low-resolution targets and cluttered backgrounds. To address this, we introduce three enhancement strategies -- input image resolution adjustment, data augmentation, and attention mechanisms -- that can be easily implemented on YOLOv8. We demonstrate that image size enlargement and the proper use of augmentation can lead to enhancement. Additionally, we designed a Mixture of Orthogonal Neural-modules Network (MoonNet) pipeline which consists of attention-augmented CNNs. Two well-known attention modules, the Squeeze-and-Excitation Block (SE Block) and the Convolutional Block Attention Module (CBAM), were integrated into the backbone of YOLOv8 with an increased number of channels, and the MoonNet backbone obtained improved detection accuracy compared to the original YOLOv8. MoonNet further proved its adaptability and potential by achieving state-of-the-art performance on a tiny-object benchmark when integrated with the YOLC model. Our codes are available at: https://github.com/Kihyun11/MoonNet
中文: 本文针对YOLOv8在航拍图像小目标检测中的不足,提出三种改进策略并设计MoonNet注意力增强网络,在微小目标检测基准上取得了最先进的性能表现。
English: This paper addresses YOLOv8's limitations in detecting small objects in aerial imagery by proposing three enhancement strategies and introducing MoonNet, an attention-augmented CNN backbone that achieves state-of-the-art performance on tiny-object detection benchmarks.

Authors:Yao Du, Jiarong Guo, Xiaomeng Li
Title: CardiacCLIP: Video-based CLIP Adaptation for LVEF Prediction in a Few-shot Manner
Abstract:
Echocardiography is a vital non-invasive modality for cardiac assessment, with left ventricular ejection fraction (LVEF) serving as a key indicator of heart function. Existing LVEF estimation methods depend on large-scale annotated video datasets, which are costly and limit adaptability across various clinical settings. Recent vision-language models for echocardiography, such as EchoCLIP, apply image-to-text pretraining but fail to capture crucial temporal dynamics and localized cardiac structures essential for accurate diagnosis. To address these challenges, we propose CardiacCLIP, a video-based framework that enhances LVEF prediction through attention-based frame aggregation and multi-resolution input scaling. Specifically, we introduce MFL (Multi Frame Learning), a novel attention-based mechanism for selectively fusing informative frames, and EchoZoom, a multi-scale feature extraction strategy that refines spatial representations of cardiac structures. As a novel adaptation of CLIP models for few-shot echocardiogram video analysis, our approach significantly improves diagnostic accuracy, reducing MAE by 2.07 on the EchoNet-Dynamic dataset under 1-shot setting. The code is available at https://github.com/xmed-lab/CardiacCLIP.
中文:提出的CardiacCLIP框架通过注意力机制筛选关键帧并结合多尺度特征提取,显著提升了超声心动图中左心室射血分数的预测精度,在EchoNet-Dynamic数据集上平均绝对误差降低2.07。
English: The proposed CardiacCLIP framework enhances LVEF prediction in echocardiography by integrating attention-based frame selection and multi-scale feature extraction, achieving significantly improved accuracy with a 2.07 MAE reduction on EchoNet-Dynamic.

Authors:Hang Du, Jiayang Zhang, Guoshun Nan, Wendi Deng, Zhenyan Chen, Chenyang Zhang, Wang Xiao, Shan Huang, Yuqi Pan, Tao Qi, Sicong Leng
Title: From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning
Abstract:
Multi-image Interleaved Reasoning aims to improve Multi-modal Large Language Models (MLLMs) ability to jointly comprehend and reason across multiple images and their associated textual contexts, introducing unique challenges beyond single-image or non-interleaved multi-image tasks. While current multi-image benchmarks overlook interleaved textual contexts and neglect distinct relationships between individual images and their associated texts, enabling models to reason over multi-image interleaved data may significantly enhance their comprehension of complex scenes and better capture cross-modal correlations. To bridge this gap, we introduce a novel benchmark MIR, requiring joint reasoning over multiple images accompanied by interleaved textual contexts to accurately associate image regions with corresponding texts and logically connect information across images. To enhance MLLMs ability to comprehend multi-image interleaved data, we introduce reasoning steps for each instance within the benchmark and propose a stage-wise curriculum learning strategy. This strategy follows an "easy to hard" approach, progressively guiding models from simple to complex scenarios, thereby enhancing their ability to handle challenging tasks. Extensive experiments benchmarking multiple MLLMs demonstrate that our method significantly enhances models reasoning performance on MIR and other established benchmarks. We believe that MIR will encourage further research into multi-image interleaved reasoning, facilitating advancements in MLLMs capability to handle complex inter-modal tasks.Our code and dataset are available at https://github.com/Shelly-coder239/MIRBench.
中文: MIR基准通过要求模型结合交错文本联合分析多张图像,并采用渐进式课程学习策略,显著提升了多模态大语言模型处理复杂跨模态任务的推理能力。
English: The MIR benchmark advances multi-modal reasoning by requiring models to jointly analyze multiple images with interleaved texts, using a progressive curriculum strategy that significantly improves performance on complex cross-modal tasks.

Authors:Wenxuan Fang, Jili Fan, Chao Wang, Xiantao Hu, Jiangwei Weng, Ying Tai, Jian Yang, Jun Li
Title: When Color-Space Decoupling Meets Diffusion for Adverse-Weather Image Restoration
Abstract:
Adverse Weather Image Restoration (AWIR) is a highly challenging task due to the unpredictable and dynamic nature of weather-related degradations. Traditional task-specific methods often fail to generalize to unseen or complex degradation types, while recent prompt-learning approaches depend heavily on the degradation estimation capabilities of vision-language models, resulting in inconsistent restorations. In this paper, we propose \textbf{LCDiff}, a novel framework comprising two key components: \textit{Lumina-Chroma Decomposition Network} (LCDN) and \textit{Lumina-Guided Diffusion Model} (LGDM). LCDN processes degraded images in the YCbCr color space, separately handling degradation-related luminance and degradation-invariant chrominance components. This decomposition effectively mitigates weather-induced degradation while preserving color fidelity. To further enhance restoration quality, LGDM leverages degradation-related luminance information as a guiding condition, eliminating the need for explicit degradation prompts. Additionally, LGDM incorporates a \textit{Dynamic Time Step Loss} to optimize the denoising network, ensuring a balanced recovery of both low- and high-frequency features in the image. Finally, we present DriveWeather, a comprehensive all-weather driving dataset designed to enable robust evaluation. Extensive experiments demonstrate that our approach surpasses state-of-the-art methods, setting a new benchmark in AWIR. The dataset and code are available at: https://github.com/fiwy0527/LCDiff.
中文: 提出的LCDiff框架通过在YCbCr色彩空间分解亮度和色度分量,并采用亮度引导的扩散模型与动态时间步长优化,有效恢复恶劣天气下的图像退化,在新型DriveWeather数据集上的实验表明其性能超越现有最佳方法。
English: The proposed LCDiff framework effectively restores weather-degraded images by decomposing luminance and chrominance components in YCbCr space and using luminance-guided diffusion with dynamic time step optimization, outperforming existing methods as validated on the new DriveWeather dataset.

Authors:Feng Han, Chao Gong, Zhipeng Wei, Jingjing Chen, Yu-Gang Jiang
Title: VCE: Safe Autoregressive Image Generation via Visual Contrast Exploitation
Abstract:
Recently, autoregressive image generation models have wowed audiences with their remarkable capability in creating surprisingly realistic images. Models such as GPT-4o and LlamaGen can not only produce images that faithfully mimic renowned artistic styles like Ghibli, Van Gogh, or Picasso, but also potentially generate Not-Safe-For-Work (NSFW) content, raising significant concerns regarding copyright infringement and ethical use. Despite these concerns, methods to safeguard autoregressive text-to-image models remain underexplored. Previous concept erasure methods, primarily designed for diffusion models that operate in denoising latent space, are not directly applicable to autoregressive models that generate images token by token. To address this critical gap, we propose Visual Contrast Exploitation (VCE), a novel framework comprising: (1) an innovative contrastive image pair construction paradigm that precisely decouples unsafe concepts from their associated content semantics, and (2) a sophisticated DPO-based training approach that enhances the model's ability to identify and leverage visual contrastive features from image pairs, enabling precise concept erasure. Our comprehensive experiments across three challenging tasks-artist style erasure, explicit content erasure, and object removal-demonstrate that our method effectively secures the model, achieving state-of-the-art results while erasing unsafe concepts and maintaining the integrity of unrelated safe concepts. The code and models are available at https://github.com/Maplebb/VCE.
中文: 近期自回归图像生成模型虽能创作逼真图像,却引发版权与不良内容担忧;为此提出视觉对比利用(VCE)框架,通过对比图像对和基于DPO的训练,在消除危险概念的同时完美保留安全内容,实现精准防护。
English: Recent autoregressive image generation models like GPT-4o and LlamaGen produce highly realistic images but raise concerns about copyright and NSFW content, prompting the development of Visual Contrast Exploitation (VCE), a novel framework that effectively erases unsafe concepts while preserving safe ones through contrastive image pairs and DPO-based training.

Authors:Quanzhu Niu, Dengxian Gong, Shihao Chen, Tao Zhang, Yikang Zhou, Haobo Yuan, Lu Qi, Xiangtai Li, Shunping Ji
Title: The 1st Solution for 7th LSVOS RVOS Track: SaSaSa2VA
Abstract:
Referring video object segmentation (RVOS) requires segmenting and tracking objects in videos conditioned on natural-language expressions, demanding fine-grained understanding of both appearance and motion. Building on Sa2VA, which couples a Multi-modal Large Language Model (MLLM) with the video segmentation model SAM2, we identify two key bottlenecks that limit segmentation performance: sparse frame sampling and reliance on a single [SEG] token for an entire video. We propose Segmentation Augmented and Selective Averaged Sa2VA SaSaSa2VA to address these issues. On the 7th LSVOS Challenge (RVOS track), SaSaSa2VA achieves a $J\&F$ of 67.45, ranking first and surpassing the runner-up by 2.80 points. This result and ablation studies demonstrate that efficient segmentation augmentation and test-time ensembling substantially enhance grounded MLLMs for RVOS. The code is released in Sa2VA repository: https://github.com/magic-research/Sa2VA.
中文:提出的SaSaSa2VA模型通过解决稀疏帧采样和单一标记限制来增强视频对象分割,借助分割增强和测试时集成方法,在RVOS挑战赛中取得了最佳性能。
English: The proposed SaSaSa2VA model enhances video object segmentation by addressing sparse frame sampling and single-token limitations, achieving top performance in the RVOS challenge through segmentation augmentation and test-time ensembling.

Authors:Leiyu Wang, Biao Jin, Feng Huang, Liqiong Chen, Zhengyong Wang, Xiaohai He, Honggang Chen
Title: MO R-CNN: Multispectral Oriented R-CNN for Object Detection in Remote Sensing Image
Abstract:
Oriented object detection for multi-spectral imagery faces significant challenges due to differences both within and between modalities. Although existing methods have improved detection accuracy through complex network architectures, their high computational complexity and memory consumption severely restrict their performance. Motivated by the success of large kernel convolutions in remote sensing, we propose MO R-CNN, a lightweight framework for multi-spectral oriented detection featuring heterogeneous feature extraction network (HFEN), single modality supervision (SMS), and condition-based multimodal label fusion (CMLF). HFEN leverages inter-modal differences to adaptively align, merge, and enhance multi-modal features. SMS constrains multi-scale features and enables the model to learn from multiple modalities. CMLF fuses multimodal labels based on specific rules, providing the model with a more robust and consistent supervisory signal. Experiments on the DroneVehicle, VEDAI and OGSOD datasets prove the superiority of our method. The source code is available at:https://github.com/Iwill-github/MORCNN.
中文摘要:提出的MO R-CNN框架通过异构特征提取、单模态监督和条件式标签融合等轻量化组件,有效解决多光谱定向目标检测难题,在多个基准数据集上验证了优越性能。
English Summary: The proposed MO R-CNN framework addresses multi-spectral oriented object detection challenges through lightweight components including heterogeneous feature extraction, single modality supervision, and conditional label fusion, demonstrating superior performance on benchmark datasets.

Authors:Yuheng Shi, Xiaohuan Pei, Minjing Dong, Chang Xu
Title: Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception
Abstract:
Multimodal Large Language Models (MLLMs) require high-resolution visual information to perform fine-grained perception, yet processing entire high-resolution images is computationally prohibitive. While recent methods leverage a Region-of-Interest (RoI) mechanism to focus on salient areas, they typically present a difficult trade-off: training-based approaches depend on large-scale annotated datasets, while training-free methods that utilize the model's internal attention are computationally inefficient and less accurate, requiring either multi-pass prefill stages or reliance on the slow auto-regressive decoding process. In this paper, we propose an efficient, annotation-free Self-Distilled Region Proposal Network (SD-RPN) that resolves this trade-off. The SD-RPN is built around a pipeline that transforms the noisy attention maps from the MLLM's middle layers into high-quality pseudo-RoI labels by explicitly denoising the signal and resolving ambiguity. We use these labels to train a lightweight Region Proposal Network (RPN) that learns a more precise localization. This RPN is also highly efficient, predicting the RoI in a single forward pass using features from the MLLM's middle layers, decoupling RoI identification from the auto-regressive generation and avoiding costly multi-pass operations.To validate our approach, we integrate the framework into the LLaVA-1.5 architecture. Despite being trained on only a few (e.g. 10K) question-answer pairs, our method demonstrates exceptional data efficiency and generalization, achieving over a 10% absolute accuracy improvement on unseen benchmarks, including TextVQA, DocVQA, and V-Star. Our work presents a practical and scalable solution for enhancing the fine-grained perception of MLLMs without requiring costly supervision or full model fine-tuning. Code is available at https://github.com/YuHengsss/SD-RPN.
中文: 本文提出一种自蒸馏区域提议网络(SD-RPN),通过将多模态大语言模型的中间层注意力图转化为高质量区域提议,无需大量标注数据即可显著提升细粒度感知能力,在多项基准测试中实现超过10%的准确率提升。
English: This paper introduces a Self-Distilled Region Proposal Network (SD-RPN) that efficiently enhances multimodal large language models' fine-grained perception by generating high-quality region proposals from denoised attention maps, achieving significant accuracy improvements without costly annotations or full model fine-tuning.

Authors:Yijun Yuan, Zhuoguang Chen, Kenan Li, Weibang Wang, Hang Zhao
Title: SLAM-Former: Putting SLAM into One Transformer
Abstract:
We present SLAM-Former, a novel neural approach that integrates full SLAM capabilities into a single transformer. Similar to traditional SLAM systems, SLAM-Former comprises both a frontend and a backend that operate in tandem. The frontend processes sequential monocular images in real-time for incremental mapping and tracking, while the backend performs global refinement to ensure a geometrically consistent result. This alternating execution allows the frontend and backend to mutually promote one another, enhancing overall system performance. Comprehensive experimental results demonstrate that SLAM-Former achieves superior or highly competitive performance compared to state-of-the-art dense SLAM methods.

Authors:Youwei Pang, Xiaoqi Zhao, Lihe Zhang, Huchuan Lu, Georges El Fakhri, Xiaofeng Liu, Shijian Lu
Title: Rethinking Evaluation of Infrared Small Target Detection
Abstract:
As an essential vision task, infrared small target detection (IRSTD) has seen significant advancements through deep learning. However, critical limitations in current evaluation protocols impede further progress. First, existing methods rely on fragmented pixel- and target-level specific metrics, which fails to provide a comprehensive view of model capabilities. Second, an excessive emphasis on overall performance scores obscures crucial error analysis, which is vital for identifying failure modes and improving real-world system performance. Third, the field predominantly adopts dataset-specific training-testing paradigms, hindering the understanding of model robustness and generalization across diverse infrared scenarios. This paper addresses these issues by introducing a hybrid-level metric incorporating pixel- and target-level performance, proposing a systematic error analysis method, and emphasizing the importance of cross-dataset evaluation. These aim to offer a more thorough and rational hierarchical analysis framework, ultimately fostering the development of more effective and robust IRSTD models. An open-source toolkit has be released to facilitate standardized benchmarking.
中文摘要:本文针对红外小目标检测评估方法的三大局限,提出了融合像素与目标级指标的混合度量体系、系统化误差分析方法及跨数据集评估方案,旨在建立更全面的分层分析框架以促进模型发展。
English Summary: This paper identifies key limitations in current infrared small target detection evaluation methods and proposes a comprehensive framework integrating hybrid-level metrics, systematic error analysis, and cross-dataset evaluation to advance model development.

Authors:Md. Atabuzzaman, Ali Asgarov, Chris Thomas
Title: Benchmarking and Mitigating MCQA Selection Bias of Large Vision-Language Models
Abstract:
Large Vision-Language Models (LVLMs) have achieved strong performance on vision-language tasks, particularly Visual Question Answering (VQA). While prior work has explored unimodal biases in VQA, the problem of selection bias in Multiple-Choice Question Answering (MCQA), where models may favor specific option tokens (e.g., "A") or positions, remains underexplored. In this paper, we investigate both the presence and nature of selection bias in LVLMs through fine-grained MCQA benchmarks spanning easy, medium, and hard difficulty levels, defined by the semantic similarity of the options. We further propose an inference-time logit-level debiasing method that estimates an ensemble bias vector from general and contextual prompts and applies confidence-adaptive corrections to the model's output. Our method mitigates bias without retraining and is compatible with frozen LVLMs. Extensive experiments across several state-of-the-art models reveal consistent selection biases that intensify with task difficulty, and show that our mitigation approach significantly reduces bias while improving accuracy in challenging settings. This work offers new insights into the limitations of LVLMs in MCQA and presents a practical approach to improve their robustness in fine-grained visual reasoning. Datasets and code are available at: https://github.com/Atabuzzaman/Selection-Bias-of-LVLMs
大型视觉语言模型在多选题问答中存在随任务难度加剧的选择偏差,我们提出的对数级去偏方法无需重新训练即可有效缓解偏差并提升准确率。
Large Vision-Language Models exhibit consistent selection bias in Multiple-Choice Question Answering that escalates with task difficulty, and our proposed logit-level debiasing method effectively mitigates this bias without retraining while enhancing accuracy.

Authors:Kai Jiang, Zhengyan Shi, Dell Zhang, Hongyuan Zhang, Xuelong Li
Title: Mixture of Noise for Pre-Trained Model-Based Class-Incremental Learning
Abstract:
Class Incremental Learning (CIL) aims to continuously learn new categories while retaining the knowledge of old ones. Pre-trained models (PTMs) show promising capabilities in CIL. However, existing approaches that apply lightweight fine-tuning to backbones still induce parameter drift, thereby compromising the generalization capability of pre-trained models. Parameter drift can be conceptualized as a form of noise that obscures critical patterns learned for previous tasks. However, recent researches have shown that noise is not always harmful. For example, the large number of visual patterns learned from pre-training can be easily abused by a single task, and introducing appropriate noise can suppress some low-correlation features, thus leaving a margin for future tasks. To this end, we propose learning beneficial noise for CIL guided by information theory and propose Mixture of Noise (Min), aiming to mitigate the degradation of backbone generalization from adapting new tasks. Specifically, task-specific noise is learned from high-dimension features of new tasks. Then, a set of weights is adjusted dynamically for optimal mixture of different task noise. Finally, Min embeds the beneficial noise into the intermediate features to mask the response of inefficient patterns. Extensive experiments on six benchmark datasets demonstrate that Min achieves state-of-the-art performance in most incremental settings, with particularly outstanding results in 50-steps incremental settings. This shows the significant potential for beneficial noise in continual learning. Code is available at https://github.com/ASCIIJK/MiN-NeurIPS2025.
Chinese: 提出的噪声混合方法依据信息理论从新任务中学习有益噪声,通过动态混合并嵌入特征来缓解参数漂移,从而在类增量学习中保持预训练模型的泛化能力,在多个基准测试中取得了领先的性能。
English: The proposed Mixture of Noise (Min) method leverages information theory to learn beneficial noise from new tasks, which is dynamically mixed and embedded into features to mitigate parameter drift and preserve the generalization of pre-trained models in class incremental learning, achieving state-of-the-art performance across multiple benchmarks.

Authors:Pan Liu, Jinshi Liu
Title: When Confidence Fails: Revisiting Pseudo-Label Selection in Semi-supervised Semantic Segmentation
Abstract:
While significant advances exist in pseudo-label generation for semi-supervised semantic segmentation, pseudo-label selection remains understudied. Existing methods typically use fixed confidence thresholds to retain high-confidence predictions as pseudo-labels. However, these methods cannot cope with network overconfidence tendency, where correct and incorrect predictions overlap significantly in high-confidence regions, making separation challenging and amplifying model cognitive bias. Meanwhile, the direct discarding of low-confidence predictions disrupts spatial-semantic continuity, causing critical context loss. We propose Confidence Separable Learning (CSL) to address these limitations. CSL formulates pseudo-label selection as a convex optimization problem within the confidence distribution feature space, establishing sample-specific decision boundaries to distinguish reliable from unreliable predictions. Additionally, CSL introduces random masking of reliable pixels to guide the network in learning contextual relationships from low-reliability regions, thereby mitigating the adverse effects of discarding uncertain predictions. Extensive experimental results on the Pascal, Cityscapes, and COCO benchmarks show that CSL performs favorably against state-of-the-art methods. Code and model weights are available at https://github.com/PanLiuCSU/CSL.
Chinese: 本文提出置信度可分离学习(CSL)方法,通过将伪标签选择构建为置信度分布特征空间中的凸优化问题来建立样本特定决策边界,并采用随机掩码从低可靠性区域学习上下文关系,在多个基准测试中展现出优于现有方法的性能。
English: This paper introduces Confidence Separable Learning (CSL), a novel method that formulates pseudo-label selection as a convex optimization problem to establish sample-specific decision boundaries and employs random masking to learn from low-reliability regions, demonstrating superior performance on major benchmarks compared to existing approaches.

Authors:Suorong Yang, Hongchao Yang, Suhan Guo, Furao Shen, Jian Zhao
Title: IPF-RDA: An Information-Preserving Framework for Robust Data Augmentation
Abstract:
Data augmentation is widely utilized as an effective technique to enhance the generalization performance of deep models. However, data augmentation may inevitably introduce distribution shifts and noises, which significantly constrain the potential and deteriorate the performance of deep networks. To this end, we propose a novel information-preserving framework, namely IPF-RDA, to enhance the robustness of data augmentations in this paper. IPF-RDA combines the proposal of (i) a new class-discriminative information estimation algorithm that identifies the points most vulnerable to data augmentation operations and corresponding importance scores; And (ii) a new information-preserving scheme that preserves the critical information in the augmented samples and ensures the diversity of augmented data adaptively. We divide data augmentation methods into three categories according to the operation types and integrate these approaches into our framework accordingly. After being integrated into our framework, the robustness of data augmentation methods can be enhanced and their full potential can be unleashed. Extensive experiments demonstrate that although being simple, IPF-RDA consistently improves the performance of numerous commonly used state-of-the-art data augmentation methods with popular deep models on a variety of datasets, including CIFAR-10, CIFAR-100, Tiny-ImageNet, CUHK03, Market1501, Oxford Flower, and MNIST, where its performance and scalability are stressed. The implementation is available at https://github.com/Jackbrocp/IPF-RDA.
中文摘要:本文提出IPF-RDA框架,通过识别易受数据增强影响的关键点并保留重要信息,有效提升多种数据增强方法的鲁棒性,在多个数据集上显著改善深度模型的性能表现。
English Summary: The paper introduces IPF-RDA, a novel framework that enhances data augmentation robustness by identifying vulnerable data points and preserving critical information, consistently improving performance across various datasets and models.

Authors:Wenxin Li, Kunyu Peng, Di Wen, Ruiping Liu, Mengfei Duan, Kai Luo, Kailun Yang
Title: Segment-to-Act: Label-Noise-Robust Action-Prompted Video Segmentation Towards Embodied Intelligence
Abstract:
Embodied intelligence relies on accurately segmenting objects actively involved in interactions. Action-based video object segmentation addresses this by linking segmentation with action semantics, but it depends on large-scale annotations and prompts that are costly, inconsistent, and prone to multimodal noise such as imprecise masks and referential ambiguity. To date, this challenge remains unexplored. In this work, we take the first step by studying action-based video object segmentation under label noise, focusing on two sources: textual prompt noise (category flips and within-category noun substitutions) and mask annotation noise (perturbed object boundaries to mimic imprecise supervision). Our contributions are threefold. First, we introduce two types of label noises for the action-based video object segmentation task. Second, we build up the first action-based video object segmentation under a label noise benchmark ActiSeg-NL and adapt six label-noise learning strategies to this setting, and establish protocols for evaluating them under textual, boundary, and mixed noise. Third, we provide a comprehensive analysis linking noise types to failure modes and robustness gains, and we introduce a Parallel Mask Head Mechanism (PMHM) to address mask annotation noise. Qualitative evaluations further reveal characteristic failure modes, including boundary leakage and mislocalization under boundary perturbations, as well as occasional identity substitutions under textual flips. Our comparative analysis reveals that different learning strategies exhibit distinct robustness profiles, governed by a foreground-background trade-off where some achieve balanced performance while others prioritize foreground accuracy at the cost of background precision. The established benchmark and source code will be made publicly available at https://github.com/mylwx/ActiSeg-NL.
中文摘要:本研究首次针对标签噪声下的动作视频对象分割建立基准,通过调整学习策略和引入并行掩码头机制,有效应对文本和掩码标注噪声,提升了模型的鲁棒性。
English Summary: This study introduces the first benchmark for action-based video object segmentation under label noise, addressing both textual and mask annotation noise through adapted learning strategies and a novel Parallel Mask Head Mechanism to enhance robustness.

Authors:Weiran Chen, Guiqian Zhu, Ying Li, Yi Ji, Chunping Liu
Title: DA-Font: Few-Shot Font Generation via Dual-Attention Hybrid Integration
Abstract:
Few-shot font generation aims to create new fonts with a limited number of glyph references. It can be used to significantly reduce the labor cost of manual font design. However, due to the variety and complexity of font styles, the results generated by existing methods often suffer from visible defects, such as stroke errors, artifacts and blurriness. To address these issues, we propose DA-Font, a novel framework which integrates a Dual-Attention Hybrid Module (DAHM). Specifically, we introduce two synergistic attention blocks: the component attention block that leverages component information from content images to guide the style transfer process, and the relation attention block that further refines spatial relationships through interacting the content feature with both original and stylized component-wise representations. These two blocks collaborate to preserve accurate character shapes and stylistic textures. Moreover, we also design a corner consistency loss and an elastic mesh feature loss to better improve geometric alignment. Extensive experiments show that our DA-Font outperforms the state-of-the-art methods across diverse font styles and characters, demonstrating its effectiveness in enhancing structural integrity and local fidelity. The source code can be found at \href{https://github.com/wrchen2001/DA-Font}{\textit{https://github.com/wrchen2001/DA-Font}}.
Chinese: DA-Font通过双注意力机制和优化的损失函数,有效解决了少样本字体生成中的笔画错误和结构失真问题,显著提升了字体生成的视觉质量。
English: DA-Font introduces a dual-attention framework and specialized loss functions to overcome defects in few-shot font generation, achieving superior results in style and structural accuracy.

Authors:Yue Ma, Zexuan Yan, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, Zhifeng Li, Wei Liu, Linfeng Zhang, Qifeng Chen
Title: Follow-Your-Emoji-Faster: Towards Efficient, Fine-Controllable, and Expressive Freestyle Portrait Animation
Abstract:
We present Follow-Your-Emoji-Faster, an efficient diffusion-based framework for freestyle portrait animation driven by facial landmarks. The main challenges in this task are preserving the identity of the reference portrait, accurately transferring target expressions, and maintaining long-term temporal consistency while ensuring generation efficiency. To address identity preservation and accurate expression retargeting, we enhance Stable Diffusion with two key components: a expression-aware landmarks as explicit motion signals, which improve motion alignment, support exaggerated expressions, and reduce identity leakage; and a fine-grained facial loss that leverages both expression and facial masks to better capture subtle expressions and faithfully preserve the reference appearance. With these components, our model supports controllable and expressive animation across diverse portrait types, including real faces, cartoons, sculptures, and animals. However, diffusion-based frameworks typically struggle to efficiently generate long-term stable animation results, which remains a core challenge in this task. To address this, we propose a progressive generation strategy for stable long-term animation, and introduce a Taylor-interpolated cache, achieving a 2.6X lossless acceleration. These two strategies ensure that our method produces high-quality results efficiently, making it user-friendly and accessible. Finally, we introduce EmojiBench++, a more comprehensive benchmark comprising diverse portraits, driving videos, and landmark sequences. Extensive evaluations on EmojiBench++ demonstrate that Follow-Your-Emoji-Faster achieves superior performance in both animation quality and controllability. The code, training dataset and benchmark will be found in https://follow-your-emoji.github.io/.
中文: 我们提出Follow-Your-Emoji-Faster框架,通过增强运动信号和面部损失,在保持身份特征和表情精度的同时,高效生成多样化肖像动画,并实现无损加速,展现出卓越的动画质量与控制性。
English: We introduce Follow-Your-Emoji-Faster, an efficient diffusion-based framework that uses facial landmarks to animate diverse portraits while preserving identity and expression accuracy through enhanced motion signals and facial loss, achieving high-quality results with accelerated performance.

Authors:Junjie Zhou, Haijun Xiong, Junhao Lu, Ziyu Lin, Bin Feng
Title: CGTGait: Collaborative Graph and Transformer for Gait Emotion Recognition
Abstract:
Skeleton-based gait emotion recognition has received significant attention due to its wide-ranging applications. However, existing methods primarily focus on extracting spatial and local temporal motion information, failing to capture long-range temporal representations. In this paper, we propose \textbf{CGTGait}, a novel framework that collaboratively integrates graph convolution and transformers to extract discriminative spatiotemporal features for gait emotion recognition. Specifically, CGTGait consists of multiple CGT blocks, where each block employs graph convolution to capture frame-level spatial topology and the transformer to model global temporal dependencies. Additionally, we introduce a Bidirectional Cross-Stream Fusion (BCSF) module to effectively aggregate posture and motion spatiotemporal features, facilitating the exchange of complementary information between the two streams. We evaluate our method on two widely used datasets, Emotion-Gait and ELMD, demonstrating that our CGTGait achieves state-of-the-art or at least competitive performance while reducing computational complexity by approximately \textbf{82.2\%} (only requiring 0.34G FLOPs) during testing. Code is available at \small{https://github.com/githubzjj1/CGTGait.}
中文: CGTGait框架结合图卷积与Transformer提取步态情感识别的时空特征,在显著降低82.2%计算量的同时实现了最优性能。
English: The proposed CGTGait framework integrates graph convolution and transformers to capture spatiotemporal features for gait emotion recognition, achieving state-of-the-art performance while reducing computational complexity by 82.2%.

Authors:Shipeng Liu, Zhonglin Zhang, Dengfeng Chen, Liang Zhao
Title: Describe-to-Score: Text-Guided Efficient Image Complexity Assessment
Abstract:
Accurately assessing image complexity (IC) is critical for computer vision, yet most existing methods rely solely on visual features and often neglect high-level semantic information, limiting their accuracy and generalization. We introduce vision-text fusion for IC modeling. This approach integrates visual and textual semantic features, increasing representational diversity. It also reduces the complexity of the hypothesis space, which enhances both accuracy and generalization in complexity assessment. We propose the D2S (Describe-to-Score) framework, which generates image captions with a pre-trained vision-language model. We propose the feature alignment and entropy distribution alignment mechanisms, D2S guides semantic information to inform complexity assessment while bridging the gap between vision and text modalities. D2S utilizes multi-modal information during training but requires only the vision branch during inference, thereby avoiding multi-modal computational overhead and enabling efficient assessment. Experimental results demonstrate that D2S outperforms existing methods on the IC9600 dataset and maintains competitiveness on no-reference image quality assessment (NR-IQA) benchmark, validating the effectiveness and efficiency of multi-modal fusion in complexity-related tasks. Code is available at: https://github.com/xauat-liushipeng/D2S
中文摘要:D2S框架通过视觉-文本融合整合视觉与语义特征,在提升图像复杂度评估准确性和泛化能力的同时,保持了推理阶段的高效性。
English Summary: The D2S framework enhances image complexity assessment by integrating visual and textual semantic features through vision-text fusion, improving accuracy and generalization while maintaining computational efficiency during inference.

Authors:Minji Heo, Simon S. Woo
Title: FakeChain: Exposing Shallow Cues in Multi-Step Deepfake Detection
Abstract:
Multi-step or hybrid deepfakes, created by sequentially applying different deepfake creation methods such as Face-Swapping, GAN-based generation, and Diffusion methods, can pose an emerging and unforseen technical challenge for detection models trained on single-step forgeries. While prior studies have mainly focused on detecting isolated single manipulation, little is known about the detection model behavior under such compositional, hybrid, and complex manipulation pipelines. In this work, we introduce \textbf{FakeChain}, a large-scale benchmark comprising 1-, 2-, and 3-Step forgeries synthesized using five state-of-the-art representative generators. Using this approach, we analyze detection performance and spectral properties across hybrid manipulation at different step, along with varying generator combinations and quality settings. Surprisingly, our findings reveal that detection performance highly depends on the final manipulation type, with F1-score dropping by up to \textbf{58.83\%} when it differs from training distribution. This clearly demonstrates that detectors rely on last-stage artifacts rather than cumulative manipulation traces, limiting generalization. Such findings highlight the need for detection models to explicitly consider manipulation history and sequences. Our results highlight the importance of benchmarks such as FakeChain, reflecting growing synthesis complexity and diversity in real-world scenarios. Our sample code is available here\footnote{https://github.com/minjihh/FakeChain}.
中文: 通过组合不同生成方法创建的多步骤混合深度伪造对检测模型构成重大挑战,这些模型因依赖最终阶段痕迹而非累积操作特征而难以泛化。
English: Multi-step hybrid deepfakes created by combining different generation methods present significant challenges to detection models, which often fail to generalize due to reliance on final-stage artifacts rather than cumulative manipulation traces.

Authors:Antonio Scardace, Lemuel Puglisi, Francesco Guarnera, Sebastiano Battiato, Daniele Ravì
Title: A Novel Metric for Detecting Memorization in Generative Models for Brain MRI Synthesis
Abstract:
Deep generative models have emerged as a transformative tool in medical imaging, offering substantial potential for synthetic data generation. However, recent empirical studies highlight a critical vulnerability: these models can memorize sensitive training data, posing significant risks of unauthorized patient information disclosure. Detecting memorization in generative models remains particularly challenging, necessitating scalable methods capable of identifying training data leakage across large sets of generated samples. In this work, we propose DeepSSIM, a novel self-supervised metric for quantifying memorization in generative models. DeepSSIM is trained to: i) project images into a learned embedding space and ii) force the cosine similarity between embeddings to match the ground-truth SSIM (Structural Similarity Index) scores computed in the image space. To capture domain-specific anatomical features, training incorporates structure-preserving augmentations, allowing DeepSSIM to estimate similarity reliably without requiring precise spatial alignment. We evaluate DeepSSIM in a case study involving synthetic brain MRI data generated by a Latent Diffusion Model (LDM) trained under memorization-prone conditions, using 2,195 MRI scans from two publicly available datasets (IXI and CoRR). Compared to state-of-the-art memorization metrics, DeepSSIM achieves superior performance, improving F1 scores by an average of +52.03% over the best existing method. Code and data of our approach are publicly available at the following link: https://github.com/brAIn-science/DeepSSIM.
Chinese: DeepSSIM是一种新颖的自监督度量方法,通过将图像投影至与结构相似性对齐的嵌入空间,有效量化医学影像生成模型中的记忆效应,在检测训练数据泄露方面展现出优于现有方法的性能。
English: DeepSSIM is a novel self-supervised metric that effectively quantifies memorization in medical imaging generative models by projecting images into an embedding space aligned with structural similarity, demonstrating superior performance over existing methods in detecting training data leakage.

Authors:Ji Soo Lee, Byungoh Ko, Jaewon Cho, Howoong Lee, Jaewoon Byun, Hyunwoo J. Kim
Title: Captioning for Text-Video Retrieval via Dual-Group Direct Preference Optimization
Abstract:
In text-video retrieval, auxiliary captions are often used to enhance video understanding, bridging the gap between the modalities. While recent advances in multi-modal large language models (MLLMs) have enabled strong zero-shot caption generation, we observe that such captions tend to be generic and indistinguishable across visually similar videos, limiting their utility for fine-grained retrieval. Moreover, conventional captioning approaches are typically evaluated using language generation metrics, such as BLEU, which are not typically tailored for retrieval tasks that require making discriminative distinctions between candidates. To address this, we propose $\textbf{CaRe-DPO}$, a retrieval framework that directly optimizes caption generation using retrieval relevance scores. At its core is Dual-Group Direct Preference Optimization (DG-DPO), a novel learning strategy that supervises captioning by modeling preferences across groups of distinct video and caption pairs. In addition, we present an MLLM-based retrieval model that incorporates role-embeddings to better distinguish between textual inputs with different functional roles, such as an auxiliary caption and a text query. Through extensive experiments, we demonstrate that CaRe-DPO significantly enhances retrieval performance by effectively leveraging auxiliary knowledge to generate fine-grained captions for retrieval. Code is available at https://github.com/mlvlab/CaReDPO.
中文: 提出的CaRe-DPO框架通过检索相关性分数直接优化字幕生成,并采用跨视频-字幕对建模偏好的新颖学习策略,有效提升细粒度文本-视频检索性能。
English: The proposed CaRe-DPO framework enhances text-video retrieval by directly optimizing caption generation through retrieval relevance scores and a novel learning strategy that models preferences across video-caption pairs, significantly improving fine-grained retrieval performance.

Authors:Zirui Wang, Jiayi Zhang, Tianwei Guan, Yuhan Zhou, Xingyuan Li, Minjing Dong, Jinyuan Liu
Title: Efficient Rectified Flow for Image Fusion
Abstract:
Image fusion is a fundamental and important task in computer vision, aiming to combine complementary information from different modalities to fuse images. In recent years, diffusion models have made significant developments in the field of image fusion. However, diffusion models often require complex computations and redundant inference time, which reduces the applicability of these methods. To address this issue, we propose RFfusion, an efficient one-step diffusion model for image fusion based on Rectified Flow. We incorporate Rectified Flow into the image fusion task to straighten the sampling path in the diffusion model, achieving one-step sampling without the need for additional training, while still maintaining high-quality fusion results. Furthermore, we propose a task-specific variational autoencoder (VAE) architecture tailored for image fusion, where the fusion operation is embedded within the latent space to further reduce computational complexity. To address the inherent discrepancy between conventional reconstruction-oriented VAE objectives and the requirements of image fusion, we introduce a two-stage training strategy. This approach facilitates the effective learning and integration of complementary information from multi-modal source images, thereby enabling the model to retain fine-grained structural details while significantly enhancing inference efficiency. Extensive experiments demonstrate that our method outperforms other state-of-the-art methods in terms of both inference speed and fusion quality. Code is available at https://github.com/zirui0625/RFfusion.
中文: RFfusion 是一种高效的图像融合一步扩散模型,通过结合整流流和任务特定的变分自编码器,在保持高质量融合效果的同时显著提升了推理速度。
English: RFfusion is an efficient one-step diffusion model for image fusion that integrates Rectified Flow and a task-specific VAE to enhance inference speed while maintaining high-quality results.

Authors:Guangze Zheng, Shijie Lin, Haobo Zuo, Si Si, Ming-Shan Wang, Changhong Fu, Jia Pan
Title: Lattice Boltzmann Model for Learning Real-World Pixel Dynamicity
Abstract:
This work proposes the Lattice Boltzmann Model (LBM) to learn real-world pixel dynamicity for visual tracking. LBM decomposes visual representations into dynamic pixel lattices and solves pixel motion states through collision-streaming processes. Specifically, the high-dimensional distribution of the target pixels is acquired through a multilayer predict-update network to estimate the pixel positions and visibility. The predict stage formulates lattice collisions among the spatial neighborhood of target pixels and develops lattice streaming within the temporal visual context. The update stage rectifies the pixel distributions with online visual representations. Compared with existing methods, LBM demonstrates practical applicability in an online and real-time manner, which can efficiently adapt to real-world visual tracking tasks. Comprehensive evaluations of real-world point tracking benchmarks such as TAP-Vid and RoboTAP validate LBM's efficiency. A general evaluation of large-scale open-world object tracking benchmarks such as TAO, BFT, and OVT-B further demonstrates LBM's real-world practicality.

Authors:Burak Satar, Zhixin Ma, Patrick A. Irawan, Wilfried A. Mulyawan, Jing Jiang, Ee-Peng Lim, Chong-Wah Ngo
Title: Seeing Culture: A Benchmark for Visual Reasoning and Grounding
Abstract:
Multimodal vision-language models (VLMs) have made substantial progress in various tasks that require a combined understanding of visual and textual content, particularly in cultural understanding tasks, with the emergence of new cultural datasets. However, these datasets frequently fall short of providing cultural reasoning while underrepresenting many cultures. In this paper, we introduce the Seeing Culture Benchmark (SCB), focusing on cultural reasoning with a novel approach that requires VLMs to reason on culturally rich images in two stages: i) selecting the correct visual option with multiple-choice visual question answering (VQA), and ii) segmenting the relevant cultural artifact as evidence of reasoning. Visual options in the first stage are systematically organized into three types: those originating from the same country, those from different countries, or a mixed group. Notably, all options are derived from a singular category for each type. Progression to the second stage occurs only after a correct visual option is chosen. The SCB benchmark comprises 1,065 images that capture 138 cultural artifacts across five categories from seven Southeast Asia countries, whose diverse cultures are often overlooked, accompanied by 3,178 questions, of which 1,093 are unique and meticulously curated by human annotators. Our evaluation of various VLMs reveals the complexities involved in cross-modal cultural reasoning and highlights the disparity between visual reasoning and spatial grounding in culturally nuanced scenarios. The SCB serves as a crucial benchmark for identifying these shortcomings, thereby guiding future developments in the field of cultural reasoning. https://github.com/buraksatar/SeeingCulture
中文摘要:Seeing Culture Benchmark(SCB)通过两阶段评估方法,要求视觉语言模型先回答文化选择题再分割相关文物,利用1065张东南亚多元文化图像解决了现有数据集文化推理能力不足的问题。
English Summary: The Seeing Culture Benchmark (SCB) introduces a two-stage evaluation method requiring vision-language models to first answer cultural multiple-choice questions and then segment relevant artifacts, addressing the lack of cultural reasoning in existing datasets through 1,065 culturally diverse Southeast Asian images.

Authors:Haijin Zeng, Xuan Lu, Yurong Zhang, Yongyong Chen, Jingyong Su, Jie Liu
Title: SlowFast-SCI: Slow-Fast Deep Unfolding Learning for Spectral Compressive Imaging
Abstract:
Humans learn in two complementary ways: a slow, cumulative process that builds broad, general knowledge, and a fast, on-the-fly process that captures specific experiences. Existing deep-unfolding methods for spectral compressive imaging (SCI) mirror only the slow component-relying on heavy pre-training with many unfolding stages-yet they lack the rapid adaptation needed to handle new optical configurations. As a result, they falter on out-of-distribution cameras, especially in bespoke spectral setups unseen during training. This depth also incurs heavy computation and slow inference. To bridge this gap, we introduce SlowFast-SCI, a dual-speed framework seamlessly integrated into any deep unfolding network beyond SCI systems. During slow learning, we pre-train or reuse a priors-based backbone and distill it via imaging guidance into a compact fast-unfolding model. In the fast learning stage, lightweight adaptation modules are embedded within each block and trained self-supervised at test time via a dual-domain loss-without retraining the backbone. To the best of our knowledge, SlowFast-SCI is the first test-time adaptation-driven deep unfolding framework for efficient, self-adaptive spectral reconstruction. Its dual-stage design unites offline robustness with on-the-fly per-sample calibration-yielding over 70% reduction in parameters and FLOPs, up to 5.79 dB PSNR improvement on out-of-distribution data, preserved cross-domain adaptability, and a 4x faster adaptation speed. In addition, its modularity integrates with any deep-unfolding network, paving the way for self-adaptive, field-deployable imaging and expanded computational imaging modalities. Code and models are available at https://github.com/XuanLu11/SlowFast-SCI.
中文:SlowFast-SCI提出了一种双速框架,将预训练的鲁棒性与轻量级测试时自适应相结合,在分布外光谱成像数据上实现了显著效率提升和性能改进。
English: SlowFast-SCI introduces a dual-speed framework that combines pre-trained robustness with lightweight test-time adaptation, achieving significant efficiency gains and improved performance on out-of-distribution spectral imaging data.

Authors:Joe Barrow
Title: CommonForms: A Large, Diverse Dataset for Form Field Detection
Abstract:
This paper introduces CommonForms, a web-scale dataset for form field detection. It casts the problem of form field detection as object detection: given an image of a page, predict the location and type (Text Input, Choice Button, Signature) of form fields. The dataset is constructed by filtering Common Crawl to find PDFs that have fillable elements. Starting with 8 million documents, the filtering process is used to arrive at a final dataset of roughly 55k documents that have over 450k pages. Analysis shows that the dataset contains a diverse mixture of languages and domains; one third of the pages are non-English, and among the 14 classified domains, no domain makes up more than 25% of the dataset. In addition, this paper presents a family of form field detectors, FFDNet-Small and FFDNet-Large, which attain a very high average precision on the CommonForms test set. Each model cost less than $500 to train. Ablation results show that high-resolution inputs are crucial for high-quality form field detection, and that the cleaning process improves data efficiency over using all PDFs that have fillable fields in Common Crawl. A qualitative analysis shows that they outperform a popular, commercially available PDF reader that can prepare forms. Unlike the most popular commercially available solutions, FFDNet can predict checkboxes in addition to text and signature fields. This is, to our knowledge, the first large scale dataset released for form field detection, as well as the first open source models. The dataset, models, and code will be released at https://github.com/jbarrow/commonforms
中文: 本文提出了基于网络PDF构建的大规模表单字段检测数据集CommonForms,并开发了FFDNet模型系列,以低于500美元的训练成本实现高精度检测,其性能优于商业解决方案。
English: This paper introduces CommonForms, a large-scale dataset for form field detection built from web PDFs, and presents FFDNet models that achieve high precision at low cost, outperforming commercial solutions.

Authors:Mohamed Eltahir, Osamah Sarraj, Abdulrahman Alfrihidi, Taha Alshatiri, Mohammed Khurd, Mohammed Bremoo, Tanveer Hussain
Title: AutoArabic: A Three-Stage Framework for Localizing Video-Text Retrieval Benchmarks
Abstract:
Video-to-text and text-to-video retrieval are dominated by English benchmarks (e.g. DiDeMo, MSR-VTT) and recent multilingual corpora (e.g. RUDDER), yet Arabic remains underserved, lacking localized evaluation metrics. We introduce a three-stage framework, AutoArabic, utilizing state-of-the-art large language models (LLMs) to translate non-Arabic benchmarks into Modern Standard Arabic, reducing the manual revision required by nearly fourfold. The framework incorporates an error detection module that automatically flags potential translation errors with 97% accuracy. Applying the framework to DiDeMo, a video retrieval benchmark produces DiDeMo-AR, an Arabic variant with 40,144 fluent Arabic descriptions. An analysis of the translation errors is provided and organized into an insightful taxonomy to guide future Arabic localization efforts. We train a CLIP-style baseline with identical hyperparameters on the Arabic and English variants of the benchmark, finding a moderate performance gap (about 3 percentage points at Recall@1), indicating that Arabic localization preserves benchmark difficulty. We evaluate three post-editing budgets (zero/ flagged-only/ full) and find that performance improves monotonically with more post-editing, while the raw LLM output (zero-budget) remains usable. To ensure reproducibility to other languages, we made the code available at https://github.com/Tahaalshatiri/AutoArabic.
中文:AutoArabic框架利用先进的大型语言模型将视频文本基准自动翻译成阿拉伯语,准确率高,创建了如DiDeMo-AR等本地化数据集,在保持基准难度的同时显著减少了人工编辑需求。
English: The AutoArabic framework uses advanced LLMs to automatically translate video-text benchmarks into Arabic with high accuracy, creating localized datasets like DiDeMo-AR that preserve benchmark difficulty while significantly reducing manual editing needs.

Authors:Zhengri Wu, Yiran Wang, Yu Wen, Zeyu Zhang, Biao Wu, Hao Tang
Title: StereoAdapter: Adapting Stereo Depth Estimation to Underwater Scenes
Abstract:
Underwater stereo depth estimation provides accurate 3D geometry for robotics tasks such as navigation, inspection, and mapping, offering metric depth from low-cost passive cameras while avoiding the scale ambiguity of monocular methods. However, existing approaches face two critical challenges: (i) parameter-efficiently adapting large vision foundation encoders to the underwater domain without extensive labeled data, and (ii) tightly fusing globally coherent but scale-ambiguous monocular priors with locally metric yet photometrically fragile stereo correspondences. To address these challenges, we propose StereoAdapter, a parameter-efficient self-supervised framework that integrates a LoRA-adapted monocular foundation encoder with a recurrent stereo refinement module. We further introduce dynamic LoRA adaptation for efficient rank selection and pre-training on the synthetic UW-StereoDepth-40K dataset to enhance robustness under diverse underwater conditions. Comprehensive evaluations on both simulated and real-world benchmarks show improvements of 6.11% on TartanAir and 5.12% on SQUID compared to state-of-the-art methods, while real-world deployment with the BlueROV2 robot further demonstrates the consistent robustness of our approach. Code: https://github.com/AIGeeksGroup/StereoAdapter. Website: https://aigeeksgroup.github.io/StereoAdapter.
中文: StereoAdapter是一种参数高效的自监督框架,通过结合LoRA适配的单目基础编码器和立体细化模块,在多种水下环境中提升了深度估计的鲁棒性并实现了最先进的性能。
English: StereoAdapter is a parameter-efficient self-supervised framework that integrates a LoRA-adapted monocular foundation encoder with stereo refinement, achieving state-of-the-art underwater depth estimation performance while enhancing robustness across diverse conditions.

Authors:Yunsoo Kim, Michal W. S. Ong, Alex Shavick, Honghan Wu, Adam P. Levine
Title: HARE: an entity and relation centric evaluation framework for histopathology reports
Abstract:
Medical domain automated text generation is an active area of research and development; however, evaluating the clinical quality of generated reports remains a challenge, especially in instances where domain-specific metrics are lacking, e.g. histopathology. We propose HARE (Histopathology Automated Report Evaluation), a novel entity and relation centric framework, composed of a benchmark dataset, a named entity recognition (NER) model, a relation extraction (RE) model, and a novel metric, which prioritizes clinically relevant content by aligning critical histopathology entities and relations between reference and generated reports. To develop the HARE benchmark, we annotated 813 de-identified clinical diagnostic histopathology reports and 652 histopathology reports from The Cancer Genome Atlas (TCGA) with domain-specific entities and relations. We fine-tuned GatorTronS, a domain-adapted language model to develop HARE-NER and HARE-RE which achieved the highest overall F1-score (0.915) among the tested models. The proposed HARE metric outperformed traditional metrics including ROUGE and Meteor, as well as radiology metrics such as RadGraph-XL, with the highest correlation and the best regression to expert evaluations (higher than the second best method, GREEN, a large language model based radiology report evaluator, by Pearson $r = 0.168$, Spearman $ρ= 0.161$, Kendall $τ= 0.123$, $R^2 = 0.176$, $RMSE = 0.018$). We release HARE, datasets, and the models at https://github.com/knowlab/HARE to foster advancements in histopathology report generation, providing a robust framework for improving the quality of reports.
中文: HARE框架提出了一种新颖的基于实体和关系的组织病理学报告评估方法,通过关注临床相关内容,在专家评估相关性上显著优于现有指标。
English: The HARE framework introduces a novel entity and relation-based evaluation system for histopathology report generation, outperforming existing metrics by prioritizing clinically relevant content and demonstrating superior correlation with expert assessments.

Authors:Huaiyu Chen, Fahed Hassanat, Robert Laganiere, Martin Bouchard
Title: mRadNet: A Compact Radar Object Detector with MetaFormer
Abstract:
Frequency-modulated continuous wave radars have gained increasing popularity in the automotive industry. Its robustness against adverse weather conditions makes it a suitable choice for radar object detection in advanced driver assistance systems. These real-time embedded systems have requirements for the compactness and efficiency of the model, which have been largely overlooked in previous work. In this work, we propose mRadNet, a novel radar object detection model with compactness in mind. mRadNet employs a U-net style architecture with MetaFormer blocks, in which separable convolution and attention token mixers are used to capture both local and global features effectively. More efficient token embedding and merging strategies are introduced to further facilitate the lightweight design. The performance of mRadNet is validated on the CRUW dataset, improving state-of-the-art performance with the least number of parameters and FLOPs.
Chinese: 提出的mRadNet模型采用U-net架构与MetaFormer模块,实现了紧凑高效的雷达目标检测系统,在CRUW数据集上以最少的参数和计算量取得了领先的性能。
English: The proposed mRadNet model introduces a compact and efficient radar object detection system using a U-net architecture with MetaFormer blocks, achieving state-of-the-art performance on the CRUW dataset with minimal parameters and computational cost.

Authors:Xiaoqi Zhao, Youwei Pang, Chenyang Yu, Lihe Zhang, Huchuan Lu, Shijian Lu, Georges El Fakhri, Xiaofeng Liu
Title: UniMRSeg: Unified Modality-Relax Segmentation via Hierarchical Self-Supervised Compensation
Abstract:
Multi-modal image segmentation faces real-world deployment challenges from incomplete/corrupted modalities degrading performance. While existing methods address training-inference modality gaps via specialized per-combination models, they introduce high deployment costs by requiring exhaustive model subsets and model-modality matching. In this work, we propose a unified modality-relax segmentation network (UniMRSeg) through hierarchical self-supervised compensation (HSSC). Our approach hierarchically bridges representation gaps between complete and incomplete modalities across input, feature and output levels. % First, we adopt modality reconstruction with the hybrid shuffled-masking augmentation, encouraging the model to learn the intrinsic modality characteristics and generate meaningful representations for missing modalities through cross-modal fusion. % Next, modality-invariant contrastive learning implicitly compensates the feature space distance among incomplete-complete modality pairs. Furthermore, the proposed lightweight reverse attention adapter explicitly compensates for the weak perceptual semantics in the frozen encoder. Last, UniMRSeg is fine-tuned under the hybrid consistency constraint to ensure stable prediction under all modality combinations without large performance fluctuations. Without bells and whistles, UniMRSeg significantly outperforms the state-of-the-art methods under diverse missing modality scenarios on MRI-based brain tumor segmentation, RGB-D semantic segmentation, RGB-D/T salient object segmentation. The code will be released at https://github.com/Xiaoqi-Zhao-DLUT/UniMRSeg.
Chinese: 本文提出UniMRSeg统一模态松弛分割网络,通过分层自监督补偿方法解决模态缺失导致的性能下降问题,在多种分割任务中均取得了最优性能。
English: This paper introduces UniMRSeg, a unified modality-relax segmentation network that addresses performance degradation from incomplete modalities through hierarchical self-supervised compensation, achieving state-of-the-art results across multiple segmentation tasks.

Authors:Vatsal Malaviya, Agneet Chatterjee, Maitreya Patel, Yezhou Yang, Chitta Baral
Title: AcT2I: Evaluating and Improving Action Depiction in Text-to-Image Models
Abstract:
Text-to-Image (T2I) models have recently achieved remarkable success in generating images from textual descriptions. However, challenges still persist in accurately rendering complex scenes where actions and interactions form the primary semantic focus. Our key observation in this work is that T2I models frequently struggle to capture nuanced and often implicit attributes inherent in action depiction, leading to generating images that lack key contextual details. To enable systematic evaluation, we introduce AcT2I, a benchmark designed to evaluate the performance of T2I models in generating images from action-centric prompts. We experimentally validate that leading T2I models do not fare well on AcT2I. We further hypothesize that this shortcoming arises from the incomplete representation of the inherent attributes and contextual dependencies in the training corpora of existing T2I models. We build upon this by developing a training-free, knowledge distillation technique utilizing Large Language Models to address this limitation. Specifically, we enhance prompts by incorporating dense information across three dimensions, observing that injecting prompts with temporal details significantly improves image generation accuracy, with our best model achieving an increase of 72%. Our findings highlight the limitations of current T2I methods in generating images that require complex reasoning and demonstrate that integrating linguistic knowledge in a systematic way can notably advance the generation of nuanced and contextually accurate images.
中文: 当前文本到图像模型在生成以动作为核心的场景时因缺乏对细微属性的充分表征而表现不佳,但我们提出的无需训练方法利用大语言模型增强提示的时间细节,使生成准确率提升了72%。
English: Current Text-to-Image models struggle with accurately generating action-centric scenes due to insufficient representation of nuanced attributes, but our proposed training-free method using Large Language Models to enrich prompts with temporal details achieves a 72% improvement in generation accuracy.

Authors:Shen Cheng, Haipeng Li, Haibin Huang, Xiaohong Liu, Shuaicheng Liu
Title: Blind-Spot Guided Diffusion for Self-supervised Real-World Denoising
Abstract:
In this work, we present Blind-Spot Guided Diffusion, a novel self-supervised framework for real-world image denoising. Our approach addresses two major challenges: the limitations of blind-spot networks (BSNs), which often sacrifice local detail and introduce pixel discontinuities due to spatial independence assumptions, and the difficulty of adapting diffusion models to self-supervised denoising. We propose a dual-branch diffusion framework that combines a BSN-based diffusion branch, generating semi-clean images, with a conventional diffusion branch that captures underlying noise distributions. To enable effective training without paired data, we use the BSN-based branch to guide the sampling process, capturing noise structure while preserving local details. Extensive experiments on the SIDD and DND datasets demonstrate state-of-the-art performance, establishing our method as a highly effective self-supervised solution for real-world denoising. Code and pre-trained models are released at: https://github.com/Sumching/BSGD.
中文: 本文提出盲点引导扩散方法,通过双分支框架结合盲点网络和扩散模型解决自监督图像去噪难题,在标准数据集上取得了领先性能。
English: This paper introduces Blind-Spot Guided Diffusion, a self-supervised dual-branch framework that overcomes blind-spot network limitations and adapts diffusion models for real-world image denoising, achieving state-of-the-art results on benchmark datasets.

Authors:Bhavesh Sandbhor, Bheeshm Sharma, Balamurugan Palaniappan
Title: SLaM-DiMM: Shared Latent Modeling for Diffusion Based Missing Modality Synthesis in MRI
Abstract:
Brain MRI scans are often found in four modalities, consisting of T1-weighted with and without contrast enhancement (T1ce and T1w), T2-weighted imaging (T2w), and Flair. Leveraging complementary information from these different modalities enables models to learn richer, more discriminative features for understanding brain anatomy, which could be used in downstream tasks such as anomaly detection. However, in clinical practice, not all MRI modalities are always available due to various reasons. This makes missing modality generation a critical challenge in medical image analysis. In this paper, we propose SLaM-DiMM, a novel missing modality generation framework that harnesses the power of diffusion models to synthesize any of the four target MRI modalities from other available modalities. Our approach not only generates high-fidelity images but also ensures structural coherence across the depth of the volume through a dedicated coherence enhancement mechanism. Qualitative and quantitative evaluations on the BraTS-Lighthouse-2025 Challenge dataset demonstrate the effectiveness of the proposed approach in synthesizing anatomically plausible and structurally consistent results. Code is available at https://github.com/BheeshmSharma/SLaM-DiMM-MICCAI-BraTS-Challenge-2025.
中文: 本文提出SLaM-DiMM框架,利用扩散模型从现有模态合成缺失的MRI模态,通过一致性增强机制确保结构连贯性和高保真度,并在BraTS-Lighthouse-2025数据集上验证了其有效性。
English: The paper introduces SLaM-DiMM, a diffusion-based framework that synthesizes missing MRI modalities from available ones while ensuring structural coherence and high fidelity, validated on the BraTS-Lighthouse-2025 dataset.

Authors:Shiyu Fang, Yiming Cui, Haoyang Liang, Chen Lv, Peng Hang, Jian Sun
Title: CoReVLA: A Dual-Stage End-to-End Autonomous Driving Framework for Long-Tail Scenarios via Collect-and-Refine
Abstract:
Autonomous Driving (AD) systems have made notable progress, but their performance in long-tail, safety-critical scenarios remains limited. These rare cases contribute a disproportionate number of accidents. Vision-Language Action (VLA) models have strong reasoning abilities and offer a potential solution, but their effectiveness is limited by the lack of high-quality data and inefficient learning in such conditions. To address these challenges, we propose CoReVLA, a continual learning end-to-end autonomous driving framework that improves the performance in long-tail scenarios through a dual-stage process of data Collection and behavior Refinement. First, the model is jointly fine-tuned on a mixture of open-source driving QA datasets, allowing it to acquire a foundational understanding of driving scenarios. Next, CoReVLA is deployed within the Cave Automatic Virtual Environment (CAVE) simulation platform, where driver takeover data is collected from real-time interactions. Each takeover indicates a long-tail scenario that CoReVLA fails to handle reliably. Finally, the model is refined via Direct Preference Optimization (DPO), allowing it to learn directly from human preferences and thereby avoid reward hacking caused by manually designed rewards. Extensive open-loop and closed-loop experiments demonstrate that the proposed CoReVLA model can accurately perceive driving scenarios and make appropriate decisions. On the Bench2Drive benchmark, CoReVLA achieves a Driving Score (DS) of 72.18 and a Success Rate (SR) of 50%, outperforming state-of-the-art methods by 7.96 DS and 15% SR under long-tail, safety-critical scenarios. Furthermore, case studies demonstrate the model's ability to continually improve its performance in similar failure-prone scenarios by leveraging past takeover experiences. All codea and preprocessed datasets are available at: https://github.com/FanGShiYuu/CoReVLA
中文: 提出的CoReVLA框架通过数据收集和行为优化的双阶段持续学习过程,有效提升了自动驾驶在长尾场景中的性能,利用基于仿真的驾驶员接管数据和直接偏好优化方法,在基准测试中取得了优于现有技术的表现。
English: The proposed CoReVLA framework enhances autonomous driving performance in long-tail scenarios through a dual-stage continual learning process of data collection and behavior refinement, achieving superior results on benchmarks by leveraging simulation-based driver takeovers and direct preference optimization.

Authors:Katharina Eckstein, Constantin Ulrich, Michael Baumgartner, Jessica Kächele, Dimitrios Bounias, Tassilo Wald, Ralf Floca, Klaus H. Maier-Hein
Title: The Missing Piece: A Case for Pre-Training in 3D Medical Object Detection
Abstract:
Large-scale pre-training holds the promise to advance 3D medical object detection, a crucial component of accurate computer-aided diagnosis. Yet, it remains underexplored compared to segmentation, where pre-training has already demonstrated significant benefits. Existing pre-training approaches for 3D object detection rely on 2D medical data or natural image pre-training, failing to fully leverage 3D volumetric information. In this work, we present the first systematic study of how existing pre-training methods can be integrated into state-of-the-art detection architectures, covering both CNNs and Transformers. Our results show that pre-training consistently improves detection performance across various tasks and datasets. Notably, reconstruction-based self-supervised pre-training outperforms supervised pre-training, while contrastive pre-training provides no clear benefit for 3D medical object detection. Our code is publicly available at: https://github.com/MIC-DKFZ/nnDetection-finetuning.
中文: 大规模预训练显著提升3D医学目标检测性能,其中基于重建的自监督方法效果最佳,而对比式预训练则收效甚微。
English: Large-scale pre-training significantly enhances 3D medical object detection, with reconstruction-based self-supervised methods proving most effective, while contrastive pre-training shows limited benefits.

Authors:David Calhas, Arlindo L. Oliveira
Title: Deep Feedback Models
Abstract:
Deep Feedback Models (DFMs) are a new class of stateful neural networks that combine bottom up input with high level representations over time. This feedback mechanism introduces dynamics into otherwise static architectures, enabling DFMs to iteratively refine their internal state and mimic aspects of biological decision making. We model this process as a differential equation solved through a recurrent neural network, stabilized via exponential decay to ensure convergence. To evaluate their effectiveness, we measure DFMs under two key conditions: robustness to noise and generalization with limited data. In both object recognition and segmentation tasks, DFMs consistently outperform their feedforward counterparts, particularly in low data or high noise regimes. In addition, DFMs translate to medical imaging settings, while being robust against various types of noise corruption. These findings highlight the importance of feedback in achieving stable, robust, and generalizable learning. Code is available at https://github.com/DCalhas/deep_feedback_models.
中文: 深度反馈模型通过引入反馈机制实现内部状态的迭代优化,在噪声鲁棒性和小样本泛化能力上显著优于前馈网络,尤其在物体识别和医学影像任务中表现突出。
English: Deep Feedback Models (DFMs) enhance neural networks by integrating feedback mechanisms for iterative state refinement, demonstrating superior performance in noise resilience and data efficiency for tasks like object recognition and medical imaging.

Authors:Jiahao Li, Xinhong Chen, Zhengmin Jiang, Qian Zhou, Yung-Hui Li, Jianping Wang
Title: Global Regulation and Excitation via Attention Tuning for Stereo Matching
Abstract:
Stereo matching achieves significant progress with iterative algorithms like RAFT-Stereo and IGEV-Stereo. However, these methods struggle in ill-posed regions with occlusions, textureless, or repetitive patterns, due to a lack of global context and geometric information for effective iterative refinement. To enable the existing iterative approaches to incorporate global context, we propose the Global Regulation and Excitation via Attention Tuning (GREAT) framework which encompasses three attention modules. Specifically, Spatial Attention (SA) captures the global context within the spatial dimension, Matching Attention (MA) extracts global context along epipolar lines, and Volume Attention (VA) works in conjunction with SA and MA to construct a more robust cost-volume excited by global context and geometric details. To verify the universality and effectiveness of this framework, we integrate it into several representative iterative stereo-matching methods and validate it through extensive experiments, collectively denoted as GREAT-Stereo. This framework demonstrates superior performance in challenging ill-posed regions. Applied to IGEV-Stereo, among all published methods, our GREAT-IGEV ranks first on the Scene Flow test set, KITTI 2015, and ETH3D leaderboards, and achieves second on the Middlebury benchmark. Code is available at https://github.com/JarvisLee0423/GREAT-Stereo.
中文摘要:GREAT框架通过整合三种注意力模块,为迭代式立体匹配方法引入全局上下文和几何信息,显著提升了在复杂区域的匹配性能,并在多个基准测试中取得领先成绩。
English Summary: The GREAT framework enhances iterative stereo-matching methods by integrating three attention modules to capture global context and geometric information, significantly improving performance in challenging regions and achieving top results on multiple benchmarks.

Authors:Liwei Liao, Xufeng Li, Xiaoyun Zheng, Boning Liu, Feng Gao, Ronggang Wang
Title: Zero-Shot Visual Grounding in 3D Gaussians via View Retrieval
Abstract:
3D Visual Grounding (3DVG) aims to locate objects in 3D scenes based on text prompts, which is essential for applications such as robotics. However, existing 3DVG methods encounter two main challenges: first, they struggle to handle the implicit representation of spatial textures in 3D Gaussian Splatting (3DGS), making per-scene training indispensable; second, they typically require larges amounts of labeled data for effective training. To this end, we propose \underline{G}rounding via \underline{V}iew \underline{R}etrieval (GVR), a novel zero-shot visual grounding framework for 3DGS to transform 3DVG as a 2D retrieval task that leverages object-level view retrieval to collect grounding clues from multiple views, which not only avoids the costly process of 3D annotation, but also eliminates the need for per-scene training. Extensive experiments demonstrate that our method achieves state-of-the-art visual grounding performance while avoiding per-scene training, providing a solid foundation for zero-shot 3DVG research. Video demos can be found in https://github.com/leviome/GVR_demos.
中文: 提出的GVR框架通过将3D物体定位转化为2D视图检索任务,为3D高斯溅射实现了零样本视觉定位,不仅无需逐场景训练和大量标注数据,还取得了最先进的性能表现。
English: The proposed GVR framework introduces a zero-shot visual grounding method for 3D Gaussian Splatting by transforming 3D object localization into a 2D view retrieval task, eliminating the need for per-scene training and extensive annotations while achieving state-of-the-art performance.

Authors:Haotian Zhang, Han Guo, Keyan Chen, Hao Chen, Zhengxia Zou, Zhenwei Shi
Title: FoBa: A Foreground-Background co-Guided Method and New Benchmark for Remote Sensing Semantic Change Detection
Abstract:
Despite the remarkable progress achieved in remote sensing semantic change detection (SCD), two major challenges remain. At the data level, existing SCD datasets suffer from limited change categories, insufficient change types, and a lack of fine-grained class definitions, making them inadequate to fully support practical applications. At the methodological level, most current approaches underutilize change information, typically treating it as a post-processing step to enhance spatial consistency, which constrains further improvements in model performance. To address these issues, we construct a new benchmark for remote sensing SCD, LevirSCD. Focused on the Beijing area, the dataset covers 16 change categories and 210 specific change types, with more fine-grained class definitions (e.g., roads are divided into unpaved and paved roads). Furthermore, we propose a foreground-background co-guided SCD (FoBa) method, which leverages foregrounds that focus on regions of interest and backgrounds enriched with contextual information to guide the model collaboratively, thereby alleviating semantic ambiguity while enhancing its ability to detect subtle changes. Considering the requirements of bi-temporal interaction and spatial consistency in SCD, we introduce a Gated Interaction Fusion (GIF) module along with a simple consistency loss to further enhance the model's detection performance. Extensive experiments on three datasets (SECOND, JL1, and the proposed LevirSCD) demonstrate that FoBa achieves competitive results compared to current SOTA methods, with improvements of 1.48%, 3.61%, and 2.81% in the SeK metric, respectively. Our code and dataset are available at https://github.com/zmoka-zht/FoBa.
Chinese: 该研究提出了LevirSCD遥感语义变化检测数据集,包含细粒度变化类别,并开发了FoBa方法,通过前景-背景协同引导和门控交互融合模块,在三个数据集上相比现有方法取得了显著性能提升。
English: The study introduces LevirSCD, a new remote sensing semantic change detection dataset with fine-grained categories, and proposes the FoBa method that uses foreground-background co-guidance and a Gated Interaction Fusion module to significantly improve detection performance over existing approaches.

Authors:Chang Soo Lim, Joonyoung Moon, Donghyeon Cho
Title: Enriched Feature Representation and Motion Prediction Module for MOSEv2 Track of 7th LSVOS Challenge: 3rd Place Solution
Abstract:
Video object segmentation (VOS) is a challenging task with wide applications such as video editing and autonomous driving. While Cutie provides strong query-based segmentation and SAM2 offers enriched representations via a pretrained ViT encoder, each has limitations in feature capacity and temporal modeling. In this report, we propose a framework that integrates their complementary strengths by replacing the encoder of Cutie with the ViT encoder of SAM2 and introducing a motion prediction module for temporal stability. We further adopt an ensemble strategy combining Cutie, SAM2, and our variant, achieving 3rd place in the MOSEv2 track of the 7th LSVOS Challenge. We refer to our final model as SCOPE (SAM2-CUTIE Object Prediction Ensemble). This demonstrates the effectiveness of enriched feature representation and motion prediction for robust video object segmentation. The code is available at https://github.com/2025-LSVOS-3rd-place/MOSEv2_3rd_place.
Chinese: 提出的SCOPE框架通过将SAM2的ViT编码器和运动预测模块整合到Cutie中,增强了视频对象分割能力,并采用集成策略在LSVOS挑战赛中荣获第三名。
English: The proposed SCOPE framework enhances video object segmentation by integrating SAM2's enriched ViT encoder and a motion prediction module into Cutie, achieving third place in the LSVOS Challenge through an ensemble strategy.

Authors:Yang Li, Tingfa Xu, Shuyan Bai, Peifu Liu, Jianan Li
Title: MCOD: The First Challenging Benchmark for Multispectral Camouflaged Object Detection
Abstract:
Camouflaged Object Detection (COD) aims to identify objects that blend seamlessly into natural scenes. Although RGB-based methods have advanced, their performance remains limited under challenging conditions. Multispectral imagery, providing rich spectral information, offers a promising alternative for enhanced foreground-background discrimination. However, existing COD benchmark datasets are exclusively RGB-based, lacking essential support for multispectral approaches, which has impeded progress in this area. To address this gap, we introduce MCOD, the first challenging benchmark dataset specifically designed for multispectral camouflaged object detection. MCOD features three key advantages: (i) Comprehensive challenge attributes: It captures real-world difficulties such as small object sizes and extreme lighting conditions commonly encountered in COD tasks. (ii) Diverse real-world scenarios: The dataset spans a wide range of natural environments to better reflect practical applications. (iii) High-quality pixel-level annotations: Each image is manually annotated with precise object masks and corresponding challenge attribute labels. We benchmark eleven representative COD methods on MCOD, observing a consistent performance drop due to increased task difficulty. Notably, integrating multispectral modalities substantially alleviates this degradation, highlighting the value of spectral information in enhancing detection robustness. We anticipate MCOD will provide a strong foundation for future research in multispectral camouflaged object detection. The dataset is publicly accessible at https://github.com/yl2900260-bit/MCOD.
中文: 针对基于RGB的伪装目标检测的局限性,MCOD基准数据集应运而生,它采用多光谱图像,包含多样化挑战和高质量标注,通过整合光谱信息显著提升了检测的鲁棒性。
English: To address the limitations of RGB-based camouflaged object detection, the MCOD benchmark dataset is introduced, featuring multispectral imagery with diverse challenges and high-quality annotations, which significantly improves detection robustness through spectral information integration.

Authors:Fangyuan Mao, Shuo Wang, Jilin Mei, Chen Min, Shun Lu, Fuyang Liu, Yu Hu
Title: UNIV: Unified Foundation Model for Infrared and Visible Modalities
Abstract:
The demand for joint RGB-visible and infrared perception is growing rapidly, particularly to achieve robust performance under diverse weather conditions. Although pre-trained models for RGB-visible and infrared data excel in their respective domains, they often underperform in multimodal scenarios, such as autonomous vehicles equipped with both sensors. To address this challenge, we propose a biologically inspired UNified foundation model for Infrared and Visible modalities (UNIV), featuring two key innovations. First, we introduce Patch-wise Cross-modality Contrastive Learning (PCCL), an attention-guided distillation framework that mimics retinal horizontal cells' lateral inhibition, which enables effective cross-modal feature alignment while remaining compatible with any transformer-based architecture. Second, our dual-knowledge preservation mechanism emulates the retina's bipolar cell signal routing - combining LoRA adapters (2% added parameters) with synchronous distillation to prevent catastrophic forgetting, thereby replicating the retina's photopic (cone-driven) and scotopic (rod-driven) functionality. To support cross-modal learning, we introduce the MVIP dataset, the most comprehensive visible-infrared benchmark to date. It contains 98,992 precisely aligned image pairs spanning diverse scenarios. Extensive experiments demonstrate UNIV's superior performance on infrared tasks (+1.7 mIoU in semantic segmentation and +0.7 mAP in object detection) while maintaining 99%+ of the baseline performance on visible RGB tasks. Our code is available at https://github.com/fangyuanmao/UNIV.
中文: UNIV基础模型通过仿生创新解决了RGB可见光与红外感知的多模态性能差距:采用注意力引导的跨模态对比学习实现特征对齐,结合双知识保留机制防止灾难性遗忘,在提升红外任务性能的同时保持了RGB任务的基准表现。
English: The UNIV foundation model addresses multimodal performance gaps in RGB-visible and infrared perception through biologically inspired innovations—Patch-wise Cross-modality Contrastive Learning for feature alignment and a dual-knowledge preservation mechanism to prevent catastrophic forgetting—demonstrating superior infrared task performance while maintaining RGB capability.

Authors:Zheng Wang, Hong Liu, Zheng Wang, Danyi Li, Min Cen, Baptiste Magnier, Li Liang, Liansheng Wang
Title: Enhancing WSI-Based Survival Analysis with Report-Auxiliary Self-Distillation
Abstract:
Survival analysis based on Whole Slide Images (WSIs) is crucial for evaluating cancer prognosis, as they offer detailed microscopic information essential for predicting patient outcomes. However, traditional WSI-based survival analysis usually faces noisy features and limited data accessibility, hindering their ability to capture critical prognostic features effectively. Although pathology reports provide rich patient-specific information that could assist analysis, their potential to enhance WSI-based survival analysis remains largely unexplored. To this end, this paper proposes a novel Report-auxiliary self-distillation (Rasa) framework for WSI-based survival analysis. First, advanced large language models (LLMs) are utilized to extract fine-grained, WSI-relevant textual descriptions from original noisy pathology reports via a carefully designed task prompt. Next, a self-distillation-based pipeline is designed to filter out irrelevant or redundant WSI features for the student model under the guidance of the teacher model's textual knowledge. Finally, a risk-aware mix-up strategy is incorporated during the training of the student model to enhance both the quantity and diversity of the training data. Extensive experiments carried out on our collected data (CRC) and public data (TCGA-BRCA) demonstrate the superior effectiveness of Rasa against state-of-the-art methods. Our code is available at https://github.com/zhengwang9/Rasa.
中文: 本文提出Rasa框架,通过大型语言模型从病理报告中提取相关文本特征,并采用自蒸馏和风险感知混合策略优化全切片图像特征筛选与数据多样性,在CRC和TCGA-BRCA数据集上实现了卓越的癌症生存分析性能。
English: This paper introduces the Rasa framework, which enhances WSI-based cancer survival analysis by leveraging LLMs to extract relevant textual features from pathology reports and employing self-distillation with a risk-aware mix-up strategy to improve feature selection and data diversity, achieving superior performance on CRC and TCGA-BRCA datasets.

Authors:Zinan Lin, Enshu Liu, Xuefei Ning, Junyi Zhu, Wenyu Wang, Sergey Yekhanin
Title: Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification
Abstract:
Generative modeling, representation learning, and classification are three core problems in machine learning (ML), yet their state-of-the-art (SoTA) solutions remain largely disjoint. In this paper, we ask: Can a unified principle address all three? Such unification could simplify ML pipelines and foster greater synergy across tasks. We introduce Latent Zoning Network (LZN) as a step toward this goal. At its core, LZN creates a shared Gaussian latent space that encodes information across all tasks. Each data type (e.g., images, text, labels) is equipped with an encoder that maps samples to disjoint latent zones, and a decoder that maps latents back to data. ML tasks are expressed as compositions of these encoders and decoders: for example, label-conditional image generation uses a label encoder and image decoder; image embedding uses an image encoder; classification uses an image encoder and label decoder. We demonstrate the promise of LZN in three increasingly complex scenarios: (1) LZN can enhance existing models (image generation): When combined with the SoTA Rectified Flow model, LZN improves FID on CIFAR10 from 2.76 to 2.59-without modifying the training objective. (2) LZN can solve tasks independently (representation learning): LZN can implement unsupervised representation learning without auxiliary loss functions, outperforming the seminal MoCo and SimCLR methods by 9.3% and 0.2%, respectively, on downstream linear classification on ImageNet. (3) LZN can solve multiple tasks simultaneously (joint generation and classification): With image and label encoders/decoders, LZN performs both tasks jointly by design, improving FID and achieving SoTA classification accuracy on CIFAR10. The code and trained models are available at https://github.com/microsoft/latent-zoning-networks. The project website is at https://zinanlin.me/blogs/latent_zoning_networks.html.
中文: 潜在分区网络(LZN)通过构建共享高斯潜空间和任务专用编解码器,统一了生成建模、表征学习和分类三大机器学习核心任务,在保持训练目标不变的前提下实现了多项性能提升。
English: The Latent Zoning Network (LZN) unifies generative modeling, representation learning, and classification by creating a shared Gaussian latent space with task-specific encoders and decoders, demonstrating improved performance across diverse machine learning tasks without modifying core training objectives.

Authors:Shilong Bao, Qianqian Xu, Feiran Li, Boyu Han, Zhiyong Yang, Xiaochun Cao, Qingming Huang
Title: Towards Size-invariant Salient Object Detection: A Generic Evaluation and Optimization Approach
Abstract:
This paper investigates a fundamental yet underexplored issue in Salient Object Detection (SOD): the size-invariant property for evaluation protocols, particularly in scenarios when multiple salient objects of significantly different sizes appear within a single image. We first present a novel perspective to expose the inherent size sensitivity of existing widely used SOD metrics. Through careful theoretical derivations, we show that the evaluation outcome of an image under current SOD metrics can be essentially decomposed into a sum of several separable terms, with the contribution of each term being directly proportional to its corresponding region size. Consequently, the prediction errors would be dominated by the larger regions, while smaller yet potentially more semantically important objects are often overlooked, leading to biased performance assessments and practical degradation. To address this challenge, a generic Size-Invariant Evaluation (SIEva) framework is proposed. The core idea is to evaluate each separable component individually and then aggregate the results, thereby effectively mitigating the impact of size imbalance across objects. Building upon this, we further develop a dedicated optimization framework (SIOpt), which adheres to the size-invariant principle and significantly enhances the detection of salient objects across a broad range of sizes. Notably, SIOpt is model-agnostic and can be seamlessly integrated with a wide range of SOD backbones. Theoretically, we also present generalization analysis of SOD methods and provide evidence supporting the validity of our new evaluation protocols. Finally, comprehensive experiments speak to the efficacy of our proposed approach. The code is available at https://github.com/Ferry-Li/SI-SOD.
本文针对显著目标检测中评估指标对尺寸的敏感性问题,提出了一个尺寸不变性评估框架,通过独立评估各组件来消除尺寸偏差,确保不同大小目标的公平检测。
This paper identifies and addresses the size bias in Salient Object Detection metrics by proposing a Size-Invariant Evaluation framework that ensures balanced assessment across objects of varying sizes.

Authors:Tian Lan, Yiming Zheng, Jianxin Yin
Title: Diffusion-Based Cross-Modal Feature Extraction for Multi-Label Classification
Abstract:
Multi-label classification has broad applications and depends on powerful representations capable of capturing multi-label interactions. We introduce \textit{Diff-Feat}, a simple but powerful framework that extracts intermediate features from pre-trained diffusion-Transformer models for images and text, and fuses them for downstream tasks. We observe that for vision tasks, the most discriminative intermediate feature along the diffusion process occurs at the middle step and is located in the middle block in Transformer. In contrast, for language tasks, the best feature occurs at the noise-free step and is located in the deepest block. In particular, we observe a striking phenomenon across varying datasets: a mysterious "Layer $12$" consistently yields the best performance on various downstream classification tasks for images (under DiT-XL/2-256$\times$256). We devise a heuristic local-search algorithm that pinpoints the locally optimal "image-text"$\times$"block-timestep" pair among a few candidates, avoiding an exhaustive grid search. A simple fusion-linear projection followed by addition-of the selected representations yields state-of-the-art performance: 98.6\% mAP on MS-COCO-enhanced and 45.7\% mAP on Visual Genome 500, surpassing strong CNN, graph, and Transformer baselines by a wide margin. t-SNE and clustering metrics further reveal that \textit{Diff-Feat} forms tighter semantic clusters than unimodal counterparts. The code is available at https://github.com/lt-0123/Diff-Feat.
Chinese: Diff-Feat框架通过提取预训练扩散Transformer模型的图像和文本中间特征并进行融合,采用启发式搜索寻找最优特征对,在多标签分类任务中实现了最先进的性能。
English: The Diff-Feat framework extracts and fuses intermediate features from pre-trained diffusion-Transformer models for images and text, achieving state-of-the-art multi-label classification performance through a heuristic search for optimal feature pairs.

Authors:Wei Chen, Tongguan Wang, Feiyue Xue, Junkai Li, Hui Liu, Ying Sha
Title: Beyond Words: Enhancing Desire, Emotion, and Sentiment Recognition with Non-Verbal Cues
Abstract:
Desire, as an intention that drives human behavior, is closely related to both emotion and sentiment. Multimodal learning has advanced sentiment and emotion recognition, but multimodal approaches specially targeting human desire understanding remain underexplored. And existing methods in sentiment analysis predominantly emphasize verbal cues and overlook images as complementary non-verbal cues. To address these gaps, we propose a Symmetrical Bidirectional Multimodal Learning Framework for Desire, Emotion, and Sentiment Recognition, which enforces mutual guidance between text and image modalities to effectively capture intention-related representations in the image. Specifically, low-resolution images are used to obtain global visual representations for cross-modal alignment, while high resolution images are partitioned into sub-images and modeled with masked image modeling to enhance the ability to capture fine-grained local features. A text-guided image decoder and an image-guided text decoder are introduced to facilitate deep cross-modal interaction at both local and global representations of image information. Additionally, to balance perceptual gains with computation cost, a mixed-scale image strategy is adopted, where high-resolution images are cropped into sub-images for masked modeling. The proposed approach is evaluated on MSED, a multimodal dataset that includes a desire understanding benchmark, as well as emotion and sentiment recognition. Experimental results indicate consistent improvements over other state-of-the-art methods, validating the effectiveness of our proposed method. Specifically, our method outperforms existing approaches, achieving F1-score improvements of 1.1% in desire understanding, 0.6% in emotion recognition, and 0.9% in sentiment analysis. Our code is available at: https://github.com/especiallyW/SyDES.
中文: 本文提出了一种对称双向多模态学习框架,通过文本与图像模态的相互引导来增强欲望、情感和情绪识别能力,在三个任务上均实现了性能提升并超越了现有最佳方法。
English: This paper introduces a Symmetrical Bidirectional Multimodal Learning Framework that enhances desire, emotion, and sentiment recognition by enabling mutual guidance between text and image modalities, achieving state-of-the-art performance improvements across all three tasks.

Authors:Abdarahmane Traore, Éric Hervet, Andy Couturier
Title: SmolRGPT: Efficient Spatial Reasoning for Warehouse Environments with 600M Parameters
Abstract:
Recent advances in vision-language models (VLMs) have enabled powerful multimodal reasoning, but state-of-the-art approaches typically rely on extremely large models with prohibitive computational and memory requirements. This makes their deployment challenging in resource-constrained environments such as warehouses, robotics, and industrial applications, where both efficiency and robust spatial understanding are critical. In this work, we present SmolRGPT, a compact vision-language architecture that explicitly incorporates region-level spatial reasoning by integrating both RGB and depth cues. SmolRGPT employs a three-stage curriculum that progressively align visual and language features, enables spatial relationship understanding, and adapts to task-specific datasets. We demonstrate that with only 600M parameters, SmolRGPT achieves competitive results on challenging warehouse spatial reasoning benchmarks, matching or exceeding the performance of much larger alternatives. These findings highlight the potential for efficient, deployable multimodal intelligence in real-world settings without sacrificing core spatial reasoning capabilities. The code of the experimentation will be available at: https://github.com/abtraore/SmolRGPT
中文: SmolRGPT是一种紧凑的视觉语言模型,通过融合RGB和深度信息实现高效空间推理,仅用6亿参数即可获得优异性能,适用于资源受限的实际应用场景。
English: SmolRGPT is a compact vision-language model that integrates RGB and depth cues for efficient spatial reasoning, achieving competitive performance with only 600M parameters while enabling deployment in resource-constrained environments.

Authors:Yulin Wang, Yang Yue, Yang Yue, Huanqian Wang, Haojun Jiang, Yizeng Han, Zanlin Ni, Yifan Pu, Minglei Shi, Rui Lu, Qisen Yang, Andrew Zhao, Zhuofan Xia, Shiji Song, Gao Huang
Title: Emulating Human-like Adaptive Vision for Efficient and Flexible Machine Visual Perception
Abstract:
Human vision is highly adaptive, efficiently sampling intricate environments by sequentially fixating on task-relevant regions. In contrast, prevailing machine vision models passively process entire scenes at once, resulting in excessive resource demands scaling with spatial-temporal input resolution and model size, yielding critical limitations impeding both future advancements and real-world application. Here we introduce AdaptiveNN, a general framework aiming to drive a paradigm shift from 'passive' to 'active, adaptive' vision models. AdaptiveNN formulates visual perception as a coarse-to-fine sequential decision-making process, progressively identifying and attending to regions pertinent to the task, incrementally combining information across fixations, and actively concluding observation when sufficient. We establish a theory integrating representation learning with self-rewarding reinforcement learning, enabling end-to-end training of the non-differentiable AdaptiveNN without additional supervision on fixation locations. We assess AdaptiveNN on 17 benchmarks spanning 9 tasks, including large-scale visual recognition, fine-grained discrimination, visual search, processing images from real driving and medical scenarios, language-driven embodied AI, and side-by-side comparisons with humans. AdaptiveNN achieves up to 28x inference cost reduction without sacrificing accuracy, flexibly adapts to varying task demands and resource budgets without retraining, and provides enhanced interpretability via its fixation patterns, demonstrating a promising avenue toward efficient, flexible, and interpretable computer vision. Furthermore, AdaptiveNN exhibits closely human-like perceptual behaviors in many cases, revealing its potential as a valuable tool for investigating visual cognition. Code is available at https://github.com/LeapLabTHU/AdaptiveNN.
中文摘要:AdaptiveNN提出了一种主动视觉框架,通过模拟人眼注视机制实现从粗到精的序列化视觉处理,在保持精度的同时大幅降低计算成本,并在多任务中展现出类人的感知特性与良好可解释性。
English Summary: AdaptiveNN introduces an active vision framework that mimics human eye movements to process visual information sequentially, significantly reducing computational costs while maintaining accuracy and enhancing interpretability across diverse tasks.

Authors:Dinura Dissanayake, Ahmed Heakl, Omkar Thawakar, Noor Ahsan, Ritesh Thawkar, Ketan More, Jean Lahoud, Rao Anwer, Hisham Cholakkal, Ivan Laptev, Fahad Shahbaz Khan, Salman Khan
Title: How Good are Foundation Models in Step-by-Step Embodied Reasoning?
Abstract:
Embodied agents operating in the physical world must make decisions that are not only effective but also safe, spatially coherent, and grounded in context. While recent advances in large multimodal models (LMMs) have shown promising capabilities in visual understanding and language generation, their ability to perform structured reasoning for real-world embodied tasks remains underexplored. In this work, we aim to understand how well foundation models can perform step-by-step reasoning in embodied environments. To this end, we propose the Foundation Model Embodied Reasoning (FoMER) benchmark, designed to evaluate the reasoning capabilities of LMMs in complex embodied decision-making scenarios. Our benchmark spans a diverse set of tasks that require agents to interpret multimodal observations, reason about physical constraints and safety, and generate valid next actions in natural language. We present (i) a large-scale, curated suite of embodied reasoning tasks, (ii) a novel evaluation framework that disentangles perceptual grounding from action reasoning, and (iii) empirical analysis of several leading LMMs under this setting. Our benchmark includes over 1.1k samples with detailed step-by-step reasoning across 10 tasks and 8 embodiments, covering three different robot types. Our results highlight both the potential and current limitations of LMMs in embodied reasoning, pointing towards key challenges and opportunities for future research in robot intelligence. Our data and code will be made publicly available.
Chinese: 本研究提出FoMER基准,用于评估大型多模态模型在具身环境中的逐步推理能力,揭示了其在多模态理解和安全行动生成任务中的潜力与局限。
English: This study introduces the Foundation Model Embodied Reasoning (FoMER) benchmark to assess large multimodal models' step-by-step reasoning abilities in embodied environments, revealing their potential and limitations in tasks requiring multimodal interpretation and safe action generation.

Authors:Wenda Qin, Andrea Burns, Bryan A. Plummer, Margrit Betke
Title: Walk and Read Less: Improving the Efficiency of Vision-and-Language Navigation via Tuning-Free Multimodal Token Pruning
Abstract:
Large models achieve strong performance on Vision-and-Language Navigation (VLN) tasks, but are costly to run in resource-limited environments. Token pruning offers appealing tradeoffs for efficiency with minimal performance loss by reducing model input size, but prior work overlooks VLN-specific challenges. For example, information loss from pruning can effectively increase computational cost due to longer walks. Thus, the inability to identify uninformative tokens undermines the supposed efficiency gains from pruning. To address this, we propose Navigation-Aware Pruning (NAP), which uses navigation-specific traits to simplify the pruning process by pre-filtering tokens into foreground and background. For example, image views are filtered based on whether the agent can navigate in that direction. We also extract navigation-relevant instructions using a Large Language Model. After filtering, we focus pruning on background tokens, minimizing information loss. To further help avoid increases in navigation length, we discourage backtracking by removing low-importance navigation nodes. Experiments on standard VLN benchmarks show NAP significantly outperforms prior work, preserving higher success rates while saving more than 50% FLOPS.
中文摘要:本研究提出的导航感知剪枝(NAP)方法通过基于导航特性筛选背景标记进行定向剪枝,在保持高任务成功率的同时实现超过50%的计算效率提升。
English Summary: The proposed Navigation-Aware Pruning (NAP) method enhances vision-and-language navigation efficiency by selectively pruning background tokens using navigation-specific criteria, achieving over 50% computational savings while maintaining high task success rates.

Authors:Di Wen, Kunyu Peng, Junwei Zheng, Yufan Chen, Yitain Shi, Jiale Wei, Ruiping Liu, Kailun Yang, Rainer Stiefelhagen
Title: MICA: Multi-Agent Industrial Coordination Assistant
Abstract:
Industrial workflows demand adaptive and trustworthy assistance that can operate under limited computing, connectivity, and strict privacy constraints. In this work, we present MICA (Multi-Agent Industrial Coordination Assistant), a perception-grounded and speech-interactive system that delivers real-time guidance for assembly, troubleshooting, part queries, and maintenance. MICA coordinates five role-specialized language agents, audited by a safety checker, to ensure accurate and compliant support. To achieve robust step understanding, we introduce Adaptive Step Fusion (ASF), which dynamically blends expert reasoning with online adaptation from natural speech feedback. Furthermore, we establish a new multi-agent coordination benchmark across representative task categories and propose evaluation metrics tailored to industrial assistance, enabling systematic comparison of different coordination topologies. Our experiments demonstrate that MICA consistently improves task success, reliability, and responsiveness over baseline structures, while remaining deployable on practical offline hardware. Together, these contributions highlight MICA as a step toward deployable, privacy-preserving multi-agent assistants for dynamic factory environments. The source code will be made publicly available at https://github.com/Kratos-Wen/MICA.
中文:MICA是一种面向工业辅助的语音交互多智能体系统,通过自适应推理与安全审核机制,在保障隐私和硬件限制的前提下,显著提升了任务执行的成功率与可靠性。
English: MICA is a speech-interactive multi-agent system designed for industrial assistance, integrating adaptive reasoning and safety checks to enhance task success and reliability while operating under privacy and hardware constraints.

Authors:Jialiang Kang, Han Shu, Wenshuo Li, Yingjie Zhai, Xinghao Chen
Title: ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding
Abstract:
Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), yet its application to vision-language models (VLMs) remains underexplored, with existing methods achieving only modest speedups (<1.5x). This gap is increasingly significant as multimodal capabilities become central to large-scale models. We hypothesize that large VLMs can effectively filter redundant image information layer by layer without compromising textual comprehension, whereas smaller draft models struggle to do so. To address this, we introduce Vision-Aware Speculative Decoding (ViSpec), a novel framework tailored for VLMs. ViSpec employs a lightweight vision adaptor module to compress image tokens into a compact representation, which is seamlessly integrated into the draft model's attention mechanism while preserving original image positional information. Additionally, we extract a global feature vector for each input image and augment all subsequent text tokens with this feature to enhance multimodal coherence. To overcome the scarcity of multimodal datasets with long assistant responses, we curate a specialized training dataset by repurposing existing datasets and generating extended outputs using the target VLM with modified prompts. Our training strategy mitigates the risk of the draft model exploiting direct access to the target model's hidden states, which could otherwise lead to shortcut learning when training solely on target model outputs. Extensive experiments validate ViSpec, achieving, to our knowledge, the first substantial speedup in VLM speculative decoding. Code is available at https://github.com/KangJialiang/ViSpec.
Chinese: 视觉感知推测解码(ViSpec)是一种新颖框架,通过轻量级视觉适配器压缩图像标记并融合全局特征,首次实现了视觉语言模型推测解码的显著加速。
English: Vision-Aware Speculative Decoding (ViSpec) is a novel framework that accelerates vision-language models by using a lightweight vision adaptor to compress image tokens and integrating global features, achieving the first substantial speedup in this domain.

Authors:Abhishek Basu, Fahad Shamshad, Ashshak Sharifdeen, Karthik Nandakumar, Muhammad Haris Khan
Title: Calibration-Aware Prompt Learning for Medical Vision-Language Models
Abstract:
Medical Vision-Language Models (Med-VLMs) have demonstrated remarkable performance across diverse medical imaging tasks by leveraging large-scale image-text pretraining. However, their confidence calibration is largely unexplored, and so remains a significant challenge. As such, miscalibrated predictions can lead to overconfident errors, undermining clinical trust and decision-making reliability. To address this, we introduce CalibPrompt, the first framework to calibrate Med-VLMs during prompt tuning. CalibPrompt optimizes a small set of learnable prompts with carefully designed calibration objectives under scarce labeled data regime. First, we study a regularizer that attempts to align the smoothed accuracy with the predicted model confidences. Second, we introduce an angular separation loss to maximize textual feature proximity toward improving the reliability in confidence estimates of multimodal Med-VLMs. Extensive experiments on four publicly available Med-VLMs and five diverse medical imaging datasets reveal that CalibPrompt consistently improves calibration without drastically affecting clean accuracy. Our code is available at https://github.com/iabh1shekbasu/CalibPrompt.
中文: CalibPrompt是一种创新框架,通过在提示调优中优化医学视觉语言模型的置信度校准,提高了可靠性且不影响准确性。
English: CalibPrompt is a novel framework that enhances the confidence calibration of Medical Vision-Language Models during prompt tuning, improving reliability without significantly compromising accuracy.

Authors:Silvio Mazzucco, Carl Persson, Mattia Segu, Pier Luigi Dovesi, Federico Tombari, Luc Van Gool, Matteo Poggi
Title: Lost in Translation? Vocabulary Alignment for Source-Free Adaptation in Open-Vocabulary Semantic Segmentation
Abstract:
We introduce VocAlign, a novel source-free domain adaptation framework specifically designed for VLMs in open-vocabulary semantic segmentation. Our method adopts a student-teacher paradigm enhanced with a vocabulary alignment strategy, which improves pseudo-label generation by incorporating additional class concepts. To ensure efficiency, we use Low-Rank Adaptation (LoRA) to fine-tune the model, preserving its original capabilities while minimizing computational overhead. In addition, we propose a Top-K class selection mechanism for the student model, which significantly reduces memory requirements while further improving adaptation performance. Our approach achieves a notable 6.11 mIoU improvement on the CityScapes dataset and demonstrates superior performance on zero-shot segmentation benchmarks, setting a new standard for source-free adaptation in the open-vocabulary setting.
中文: VocAlign提出了一种针对开放词汇语义分割的无源域自适应框架,采用师生范式结合词汇对齐和LoRA微调,在CityScapes数据集上实现了6.11 mIoU提升,同时显著降低了计算和内存需求。
English: VocAlign introduces a source-free domain adaptation framework for VLMs in open-vocabulary semantic segmentation, using a student-teacher paradigm with vocabulary alignment and LoRA fine-tuning to achieve a 6.11 mIoU improvement on CityScapes while reducing computational and memory demands.

Authors:Luca Bartolomei, Enrico Mannocci, Fabio Tosi, Matteo Poggi, Stefano Mattoccia
Title: Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation
Abstract:
Event cameras capture sparse, high-temporal-resolution visual information, making them particularly suitable for challenging environments with high-speed motion and strongly varying lighting conditions. However, the lack of large datasets with dense ground-truth depth annotations hinders learning-based monocular depth estimation from event data. To address this limitation, we propose a cross-modal distillation paradigm to generate dense proxy labels leveraging a Vision Foundation Model (VFM). Our strategy requires an event stream spatially aligned with RGB frames, a simple setup even available off-the-shelf, and exploits the robustness of large-scale VFMs. Additionally, we propose to adapt VFMs, either a vanilla one like Depth Anything v2 (DAv2), or deriving from it a novel recurrent architecture to infer depth from monocular event cameras. We evaluate our approach with synthetic and real-world datasets, demonstrating that i) our cross-modal paradigm achieves competitive performance compared to fully supervised methods without requiring expensive depth annotations, and ii) our VFM-based models achieve state-of-the-art performance.
中文: 本文提出一种跨模态蒸馏方法,利用视觉基础模型从事件数据生成密集代理深度标签,无需昂贵标注即可实现与全监督方法相媲美的单目深度估计性能,并达到当前最优水平。
English: This paper introduces a cross-modal distillation approach using Vision Foundation Models to generate dense proxy depth labels from event data, enabling competitive monocular depth estimation without costly annotations while achieving state-of-the-art results.

Authors:Zhaoyang Liu, Jingjing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, Shenglong Ye, Qingyun Li, Xuan Dong, Yue Yu, Chenyu Lu, YunXiang Mo, Yao Yan, Zeyue Tian, Xiao Zhang, Yuan Huang, Yiqian Liu, Weijie Su, Gen Luo, Xiangyu Yue, Biqing Qi, Kai Chen, Bowen Zhou, Yu Qiao, Qifeng Chen, Wenhai Wang
Title: ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data
Abstract:
Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose computer use agents. We will release data, models, and code to advance future research: https://github.com/OpenGVLab/ScaleCUA.
中文:ScaleCUA推出了大规模数据集和模型,支持计算机使用代理跨平台操作,并在多项基准测试中创下最新性能记录。
English: ScaleCUA introduces a large-scale dataset and model for computer use agents, enabling cross-platform operation and achieving state-of-the-art performance across multiple benchmarks.

Authors:Fangjinhua Wang, Qingshan Xu, Yew-Soon Ong, Marc Pollefeys
Title: Lightweight and Accurate Multi-View Stereo with Confidence-Aware Diffusion Model
Abstract:
To reconstruct the 3D geometry from calibrated images, learning-based multi-view stereo (MVS) methods typically perform multi-view depth estimation and then fuse depth maps into a mesh or point cloud. To improve the computational efficiency, many methods initialize a coarse depth map and then gradually refine it in higher resolutions. Recently, diffusion models achieve great success in generation tasks. Starting from a random noise, diffusion models gradually recover the sample with an iterative denoising process. In this paper, we propose a novel MVS framework, which introduces diffusion models in MVS. Specifically, we formulate depth refinement as a conditional diffusion process. Considering the discriminative characteristic of depth estimation, we design a condition encoder to guide the diffusion process. To improve efficiency, we propose a novel diffusion network combining lightweight 2D U-Net and convolutional GRU. Moreover, we propose a novel confidence-based sampling strategy to adaptively sample depth hypotheses based on the confidence estimated by diffusion model. Based on our novel MVS framework, we propose two novel MVS methods, DiffMVS and CasDiffMVS. DiffMVS achieves competitive performance with state-of-the-art efficiency in run-time and GPU memory. CasDiffMVS achieves state-of-the-art performance on DTU, Tanks & Temples and ETH3D. Code is available at: https://github.com/cvg/diffmvs.
Chinese: 本文提出了一种新颖的多视图立体框架,通过引入扩散模型将深度优化构建为条件去噪过程,所提出的DiffMVS和CasDiffMVS方法在运行效率和三维重建精度上均达到了业界领先水平。
English: This paper introduces a novel multi-view stereo framework that integrates diffusion models to refine depth estimation through a conditional denoising process, achieving state-of-the-art efficiency and performance with two proposed methods, DiffMVS and CasDiffMVS.

Authors:Yuming Jiang, Siteng Huang, Shengke Xue, Yaxi Zhao, Jun Cen, Sicong Leng, Kehan Li, Jiayan Guo, Kexiang Wang, Mingxiu Chen, Fan Wang, Deli Zhao, Xin Li
Title: RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation
Abstract:
This paper presents RynnVLA-001, a vision-language-action(VLA) model built upon large-scale video generative pretraining from human demonstrations. We propose a novel two-stage pretraining methodology. The first stage, Ego-Centric Video Generative Pretraining, trains an Image-to-Video model on 12M ego-centric manipulation videos to predict future frames conditioned on an initial frame and a language instruction. The second stage, Human-Centric Trajectory-Aware Modeling, extends this by jointly predicting future keypoint trajectories, thereby effectively bridging visual frame prediction with action prediction. Furthermore, to enhance action representation, we propose ActionVAE, a variational autoencoder that compresses sequences of actions into compact latent embeddings, reducing the complexity of the VLA output space. When finetuned on the same downstream robotics datasets, RynnVLA-001 achieves superior performance over state-of-the-art baselines, demonstrating that the proposed pretraining strategy provides a more effective initialization for VLA models.
中文: 本文提出RynnVLA-001视觉语言动作模型,采用结合视频生成与轨迹预测的双阶段预训练方法,并通过ActionVAE压缩动作表示,在机器人任务中实现了最优性能。
English: This paper introduces RynnVLA-001, a vision-language-action model that employs a novel two-stage pretraining approach combining video generation and trajectory prediction, enhanced by an ActionVAE for compact action representation, achieving state-of-the-art performance in robotics tasks.

Authors:Pierre Fernandez, Tomáš Souček, Nikola Jovanović, Hady Elsahar, Sylvestre-Alvise Rebuffi, Valeriu Lacatusu, Tuan Tran, Alexandre Mourachko
Title: Geometric Image Synchronization with Deep Watermarking
Abstract:
Synchronization is the task of estimating and inverting geometric transformations (e.g., crop, rotation) applied to an image. This work introduces SyncSeal, a bespoke watermarking method for robust image synchronization, which can be applied on top of existing watermarking methods to enhance their robustness against geometric transformations. It relies on an embedder network that imperceptibly alters images and an extractor network that predicts the geometric transformation to which the image was subjected. Both networks are end-to-end trained to minimize the error between the predicted and ground-truth parameters of the transformation, combined with a discriminator to maintain high perceptual quality. We experimentally validate our method on a wide variety of geometric and valuemetric transformations, demonstrating its effectiveness in accurately synchronizing images. We further show that our synchronization can effectively upgrade existing watermarking methods to withstand geometric transformations to which they were previously vulnerable.
中文: SyncSeal是一种定制水印方法,通过嵌入器和提取器网络预测并逆转几何变换,有效提升现有水印技术对抗此类干扰的鲁棒性,同时保持图像质量。
English: SyncSeal is a specialized watermarking technique that enhances the robustness of existing methods against geometric transformations by using embedder and extractor networks to predict and invert these changes while maintaining image quality.

Authors:Zaiquan Yang, Yuhao Liu, Gerhard Hancke, Rynson W. H. Lau
Title: Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding
Abstract:
Spatio-temporal video grounding (STVG) aims at localizing the spatio-temporal tube of a video, as specified by the input text query. In this paper, we utilize multimodal large language models (MLLMs) to explore a zero-shot solution in STVG. We reveal two key insights about MLLMs: (1) MLLMs tend to dynamically assign special tokens, referred to as \textit{grounding tokens}, for grounding the text query; and (2) MLLMs often suffer from suboptimal grounding due to the inability to fully integrate the cues in the text query (\textit{e.g.}, attributes, actions) for inference. Based on these insights, we propose a MLLM-based zero-shot framework for STVG, which includes novel decomposed spatio-temporal highlighting (DSTH) and temporal-augmented assembling (TAS) strategies to unleash the reasoning ability of MLLMs. The DSTH strategy first decouples the original query into attribute and action sub-queries for inquiring the existence of the target both spatially and temporally. It then uses a novel logit-guided re-attention (LRA) module to learn latent variables as spatial and temporal prompts, by regularizing token predictions for each sub-query. These prompts highlight attribute and action cues, respectively, directing the model's attention to reliable spatial and temporal related visual regions. In addition, as the spatial grounding by the attribute sub-query should be temporally consistent, we introduce the TAS strategy to assemble the predictions using the original video frames and the temporal-augmented frames as inputs to help improve temporal consistency. We evaluate our method on various MLLMs, and show that it outperforms SOTA methods on three common STVG benchmarks. The code will be available at https://github.com/zaiquanyang/LLaVA_Next_STVG.
中文摘要:本文提出了一种基于多模态大语言模型的零样本时空视频定位框架,通过创新的解耦时空高亮和时间增强组装策略来提升模型推理能力,在多个基准测试中实现了最优性能。
English Summary: This paper introduces a zero-shot framework for spatio-temporal video grounding using multimodal large language models, employing novel decomposed spatio-temporal highlighting and temporal-augmented assembling strategies to enhance reasoning capabilities and achieve state-of-the-art performance on benchmarks.

Authors:Ali Nazari, Bardiya Kariminia, Mohsen Ebrahimi Moghaddam
Title: A Race Bias Free Face Aging Model for Reliable Kinship Verification
Abstract:
The age gap in kinship verification addresses the time difference between the photos of the parent and the child. Moreover, their same-age photos are often unavailable, and face aging models are racially biased, which impacts the likeness of photos. Therefore, we propose a face aging GAN model, RA-GAN, consisting of two new modules, RACEpSp and a feature mixer, to produce racially unbiased images. The unbiased synthesized photos are used in kinship verification to investigate the results of verifying same-age parent-child images. The experiments demonstrate that our RA-GAN outperforms SAM-GAN on an average of 13.14\% across all age groups, and CUSP-GAN in the 60+ age group by 9.1\% in terms of racial accuracy. Moreover, RA-GAN can preserve subjects' identities better than SAM-GAN and CUSP-GAN across all age groups. Additionally, we demonstrate that transforming parent and child images from the KinFaceW-I and KinFaceW-II datasets to the same age can enhance the verification accuracy across all age groups. The accuracy increases with our RA-GAN for the kinship relationships of father-son and father-daughter, mother-son, and mother-daughter, which are 5.22, 5.12, 1.63, and 0.41, respectively, on KinFaceW-I. Additionally, the accuracy for the relationships of father-daughter, father-son, and mother-son is 2.9, 0.39, and 1.6 on KinFaceW-II, respectively. The code is available at~\href{https://github.com/bardiya2254kariminia/An-Age-Transformation-whitout-racial-bias-for-Kinship-verification}{Github}
中文: 本研究提出了RA-GAN这一无种族偏见的面部老化模型,通过生成同龄亲子图像显著提升了亲属关系验证的准确性,并在身份特征保留方面优于现有技术。
English: The study introduces RA-GAN, a racially unbiased face aging model that enhances kinship verification by generating same-age parent-child images, achieving superior accuracy and identity preservation compared to existing methods.

Authors:Pak-Hei Yeung, Jayroop Ramesh, Pengfei Lyu, Ana Namburete, Jagath Rajapakse
Title: Semi-Supervised 3D Medical Segmentation from 2D Natural Images Pretrained Model
Abstract:
This paper explores the transfer of knowledge from general vision models pretrained on 2D natural images to improve 3D medical image segmentation. We focus on the semi-supervised setting, where only a few labeled 3D medical images are available, along with a large set of unlabeled images. To tackle this, we propose a model-agnostic framework that progressively distills knowledge from a 2D pretrained model to a 3D segmentation model trained from scratch. Our approach, M&N, involves iterative co-training of the two models using pseudo-masks generated by each other, along with our proposed learning rate guided sampling that adaptively adjusts the proportion of labeled and unlabeled data in each training batch to align with the models' prediction accuracy and stability, minimizing the adverse effect caused by inaccurate pseudo-masks. Extensive experiments on multiple publicly available datasets demonstrate that M&N achieves state-of-the-art performance, outperforming thirteen existing semi-supervised segmentation approaches under all different settings. Importantly, ablation studies show that M&N remains model-agnostic, allowing seamless integration with different architectures. This ensures its adaptability as more advanced models emerge. The code is available at https://github.com/pakheiyeung/M-N.
中文: 本文提出M&N框架,通过迭代协同训练和自适应数据采样,将2D预训练视觉模型的知识迁移至3D医学图像分割,在半监督设定下实现了最优性能。
English: This paper introduces M&N, a model-agnostic framework that transfers knowledge from 2D pretrained vision models to enhance 3D medical image segmentation through iterative co-training and adaptive data sampling, achieving state-of-the-art results in semi-supervised settings.

Authors:Gengliang Li, Rongyu Chen, Bin Li, Linlin Yang, Guodong Ding
Title: MedFact-R1: Towards Factual Medical Reasoning via Pseudo-Label Augmentation
Abstract:
Ensuring factual consistency and reliable reasoning remains a critical challenge for medical vision-language models. We introduce MEDFACT-R1, a two-stage framework that integrates external knowledge grounding with reinforcement learning to improve the factual medical reasoning. The first stage uses pseudo-label supervised fine-tuning (SFT) to incorporate external factual expertise; while the second stage applies Group Relative Policy Optimization (GRPO) with four tailored factual reward signals to encourage self-consistent reasoning. Across three public medical QA benchmarks, MEDFACT-R1 delivers up to 22.5% absolute improvement in factual accuracy over previous state-of-the-art methods. Ablation studies highlight the necessity of pseudo-label SFT cold start and validate the contribution of each GRPO reward, underscoring the synergy between knowledge grounding and RL-driven reasoning for trustworthy medical AI. Codes are released at https://github.com/Garfieldgengliang/MEDFACT-R1.
中文: MEDFACT-R1通过整合外部知识基础与强化学习的两阶段框架,显著提升医学视觉语言模型的事实准确性,相比现有最优方法绝对改进幅度达22.5%。
English: MEDFACT-R1 is a two-stage framework combining external knowledge grounding with reinforcement learning to significantly enhance factual accuracy in medical vision-language models, achieving up to 22.5% improvement over prior methods.

Authors:Chau Pham, Quan Dao, Mahesh Bhosale, Yunjie Tian, Dimitris Metaxas, David Doermann
Title: AutoEdit: Automatic Hyperparameter Tuning for Image Editing
Abstract:
Recent advances in diffusion models have revolutionized text-guided image editing, yet existing editing methods face critical challenges in hyperparameter identification. To get the reasonable editing performance, these methods often require the user to brute-force tune multiple interdependent hyperparameters, such as inversion timesteps and attention modification. This process incurs high computational costs due to the huge hyperparameter search space. We consider searching optimal editing's hyperparameters as a sequential decision-making task within the diffusion denoising process. Specifically, we propose a reinforcement learning framework, which establishes a Markov Decision Process that dynamically adjusts hyperparameters across denoising steps, integrating editing objectives into a reward function. The method achieves time efficiency through proximal policy optimization while maintaining optimal hyperparameter configurations. Experiments demonstrate significant reduction in search time and computational overhead compared to existing brute-force approaches, advancing the practical deployment of a diffusion-based image editing framework in the real world. Codes can be found at https://github.com/chaupham1709/AutoEdit.git.
中文摘要:该研究提出了一种强化学习框架,在扩散去噪过程中动态优化图像编辑的超参数,相比暴力搜索方法显著降低了计算成本和搜索时间。
English Summary: The study introduces a reinforcement learning framework that dynamically optimizes hyperparameters in diffusion-based image editing, significantly reducing computational costs and search time compared to brute-force methods.

Authors:Shenghao Zhu, Yifei Chen, Weihong Chen, Shuo Jiang, Guanyu Zhou, Yuanhan Wang, Feiwei Qin, Changmiao Wang, Qiyuan Tian
Title: No Modality Left Behind: Adapting to Missing Modalities via Knowledge Distillation for Brain Tumor Segmentation
Abstract:
Accurate brain tumor segmentation is essential for preoperative evaluation and personalized treatment. Multi-modal MRI is widely used due to its ability to capture complementary tumor features across different sequences. However, in clinical practice, missing modalities are common, limiting the robustness and generalizability of existing deep learning methods that rely on complete inputs, especially under non-dominant modality combinations. To address this, we propose AdaMM, a multi-modal brain tumor segmentation framework tailored for missing-modality scenarios, centered on knowledge distillation and composed of three synergistic modules. The Graph-guided Adaptive Refinement Module explicitly models semantic associations between generalizable and modality-specific features, enhancing adaptability to modality absence. The Bi-Bottleneck Distillation Module transfers structural and textural knowledge from teacher to student models via global style matching and adversarial feature alignment. The Lesion-Presence-Guided Reliability Module predicts prior probabilities of lesion types through an auxiliary classification task, effectively suppressing false positives under incomplete inputs. Extensive experiments on the BraTS 2018 and 2024 datasets demonstrate that AdaMM consistently outperforms existing methods, exhibiting superior segmentation accuracy and robustness, particularly in single-modality and weak-modality configurations. In addition, we conduct a systematic evaluation of six categories of missing-modality strategies, confirming the superiority of knowledge distillation and offering practical guidance for method selection and future research. Our source code is available at https://github.com/Quanato607/AdaMM.
中文摘要:提出的AdaMM框架通过知识蒸馏和三个协同模块,有效解决了多模态MRI中模态缺失的脑肿瘤分割问题,在BraTS数据集上相比现有方法展现出更优的分割精度和鲁棒性。
English Summary: The proposed AdaMM framework enhances brain tumor segmentation in missing-modality MRI scenarios through knowledge distillation and three synergistic modules, demonstrating superior accuracy and robustness on BraTS datasets compared to existing methods.

Authors:Xiao Wu, Ting-Zhu Huang, Liang-Jian Deng, Yanyuan Qiao, Imran Razzak, Yutong Xie
Title: A Knowledge-driven Adaptive Collaboration of LLMs for Enhancing Medical Decision-making
Abstract:
Medical decision-making often involves integrating knowledge from multiple clinical specialties, typically achieved through multidisciplinary teams. Inspired by this collaborative process, recent work has leveraged large language models (LLMs) in multi-agent collaboration frameworks to emulate expert teamwork. While these approaches improve reasoning through agent interaction, they are limited by static, pre-assigned roles, which hinder adaptability and dynamic knowledge integration. To address these limitations, we propose KAMAC, a Knowledge-driven Adaptive Multi-Agent Collaboration framework that enables LLM agents to dynamically form and expand expert teams based on the evolving diagnostic context. KAMAC begins with one or more expert agents and then conducts a knowledge-driven discussion to identify and fill knowledge gaps by recruiting additional specialists as needed. This supports flexible, scalable collaboration in complex clinical scenarios, with decisions finalized through reviewing updated agent comments. Experiments on two real-world medical benchmarks demonstrate that KAMAC significantly outperforms both single-agent and advanced multi-agent methods, particularly in complex clinical scenarios (i.e., cancer prognosis) requiring dynamic, cross-specialty expertise. Our code is publicly available at: https://github.com/XiaoXiao-Woo/KAMAC.
中文: 本文提出KAMAC知识驱动自适应多智能体协作框架,通过动态组建专家团队克服静态角色分配局限,在癌症预后等复杂医疗场景中显著优于现有方法。
English: This paper introduces KAMAC, a Knowledge-driven Adaptive Multi-Agent Collaboration framework that dynamically forms expert teams using large language models to address limitations in static role assignments, significantly outperforming existing methods in complex medical scenarios like cancer prognosis.

Authors:Chaoyin She, Ruifang Lu, Lida Chen, Wei Wang, Qinghua Huang
Title: EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence
Abstract:
Ultrasound imaging has become the preferred imaging modality for early cancer screening due to its advantages of non-ionizing radiation, low cost, and real-time imaging capabilities. However, conventional ultrasound diagnosis heavily relies on physician expertise, presenting challenges of high subjectivity and low diagnostic efficiency. Vision-language models (VLMs) offer promising solutions for this issue, but existing general-purpose models demonstrate limited knowledge in ultrasound medical tasks, with poor generalization in multi-organ lesion recognition and low efficiency across multi-task diagnostics. To address these limitations, we propose EchoVLM, a vision-language model specifically designed for ultrasound medical imaging. The model employs a Mixture of Experts (MoE) architecture trained on data spanning seven anatomical regions. This design enables the model to perform multiple tasks, including ultrasound report generation, diagnosis and visual question-answering (VQA). The experimental results demonstrated that EchoVLM achieved significant improvements of 10.15 and 4.77 points in BLEU-1 scores and ROUGE-1 scores respectively compared to Qwen2-VL on the ultrasound report generation task. These findings suggest that EchoVLM has substantial potential to enhance diagnostic accuracy in ultrasound imaging, thereby providing a viable technical solution for future clinical applications. Source code and model weights are available at https://github.com/Asunatan/EchoVLM.
中文: EchoVLM是一种专为超声医学影像设计的视觉语言模型,采用混合专家架构,在七个解剖区域数据上训练,显著提升了超声报告生成和多任务诊断性能,为临床应用提供了可行的技术解决方案。
English: EchoVLM is a specialized vision-language model using a Mixture of Experts architecture that significantly improves ultrasound report generation and multi-task diagnostics across seven anatomical regions, enhancing diagnostic accuracy for clinical applications.

Authors:Xingwu Zhang, Guanxuan Li, Zhuocheng Zhang, Zijun Long
Title: RoboEye: Enhancing 2D Robotic Object Identification with Selective 3D Geometric Keypoint Matching
Abstract:
The rapidly growing number of product categories in large-scale e-commerce makes accurate object identification for automated packing in warehouses substantially more difficult. As the catalog grows, intra-class variability and a long tail of rare or visually similar items increase, and when combined with diverse packaging, cluttered containers, frequent occlusion, and large viewpoint changes-these factors amplify discrepancies between query and reference images, causing sharp performance drops for methods that rely solely on 2D appearance features. Thus, we propose RoboEye, a two-stage identification framework that dynamically augments 2D semantic features with domain-adapted 3D reasoning and lightweight adapters to bridge training deployment gaps. In the first stage, we train a large vision model to extract 2D features for generating candidate rankings. A lightweight 3D-feature-awareness module then estimates 3D feature quality and predicts whether 3D re-ranking is necessary, preventing performance degradation and avoiding unnecessary computation. When invoked, the second stage uses our robot 3D retrieval transformer, comprising a 3D feature extractor that produces geometry-aware dense features and a keypoint-based matcher that computes keypoint-correspondence confidences between query and reference images instead of conventional cosine-similarity scoring. Experiments show that RoboEye improves Recall@1 by 7.1% over the prior state of the art (RoboLLM). Moreover, RoboEye operates using only RGB images, avoiding reliance on explicit 3D inputs and reducing deployment costs. The code used in this paper is publicly available at: https://github.com/longkukuhi/RoboEye.
中文: RoboEye是一个两阶段识别框架,通过结合3D推理增强2D语义特征,提升电商仓库中的物体识别准确率,在仅使用RGB图像降低成本的同时,将Recall@1指标较现有最佳方法提高了7.1%。
English: RoboEye is a two-stage identification framework that enhances 2D semantic features with 3D reasoning to improve object recognition in e-commerce warehouses, achieving a 7.1% increase in Recall@1 over previous methods while using only RGB images to reduce costs.

Authors:Zhuokang Shen, Kaisen Zhang, Bohan Jia, Yuan Fang, Zhou Yu, Shaohui Lin
Title: DF-LLaVA: Unlocking MLLM's potential for Synthetic Image Detection via Prompt-Guided Knowledge Injection
Abstract:
With the increasing prevalence of synthetic images, evaluating image authenticity and locating forgeries accurately while maintaining human interpretability remains a challenging task. Existing detection models primarily focus on simple authenticity classification, ultimately providing only a forgery probability or binary judgment, which offers limited explanatory insights into image authenticity. Moreover, while MLLM-based detection methods can provide more interpretable results, they still lag behind expert models in terms of pure authenticity classification accuracy. To address this, we propose DF-LLaVA, a simple yet effective framework that unlocks the intrinsic discrimination potential of MLLMs. Our approach first extracts latent knowledge from MLLMs and then injects it into training via prompts. This framework allows LLaVA to achieve outstanding detection accuracy exceeding expert models while still maintaining the interpretability offered by MLLMs. Extensive experiments confirm the superiority of our DF-LLaVA, achieving both high accuracy and explainability in synthetic image detection. Code is available online at: https://github.com/Eliot-Shen/DF-LLaVA.
中文: DF-LLaVA通过提取多模态大模型的潜在知识并结合提示训练,开发了一个既能超越专家模型检测精度、又能保持可解释性的合成图像检测框架。
English: DF-LLaVA is a novel framework that enhances MLLMs' discrimination ability through latent knowledge extraction and prompt-based training, achieving superior detection accuracy and interpretability in synthetic image detection.

Authors:Shangrong Wu, Yanghong Zhou, Yang Chen, Feng Zhang, P. Y. Mok
Title: Chain-of-Thought Re-ranking for Image Retrieval Tasks
Abstract:
Image retrieval remains a fundamental yet challenging problem in computer vision. While recent advances in Multimodal Large Language Models (MLLMs) have demonstrated strong reasoning capabilities, existing methods typically employ them only for evaluation, without involving them directly in the ranking process. As a result, their rich multimodal reasoning abilities remain underutilized, leading to suboptimal performance. In this paper, we propose a novel Chain-of-Thought Re-Ranking (CoTRR) method to address this issue. Specifically, we design a listwise ranking prompt that enables MLLM to directly participate in re-ranking candidate images. This ranking process is grounded in an image evaluation prompt, which assesses how well each candidate aligns with users query. By allowing MLLM to perform listwise reasoning, our method supports global comparison, consistent reasoning, and interpretable decision-making - all of which are essential for accurate image retrieval. To enable structured and fine-grained analysis, we further introduce a query deconstruction prompt, which breaks down the original query into multiple semantic components. Extensive experiments on five datasets demonstrate the effectiveness of our CoTRR method, which achieves state-of-the-art performance across three image retrieval tasks, including text-to-image retrieval (TIR), composed image retrieval (CIR) and chat-based image retrieval (Chat-IR). Our code is available at https://github.com/freshfish15/CoTRR .
Chinese: 本文提出链式思维重排序方法,通过多模态大语言模型直接参与图像重排序,结合列表推理和查询解构技术,在多项图像检索任务中实现了最优性能。
English: The paper introduces Chain-of-Thought Re-Ranking (CoTRR), a novel method that leverages Multimodal Large Language Models to directly re-rank images through listwise reasoning and query deconstruction, achieving state-of-the-art results in multiple image retrieval tasks.

Authors:Kazuma Nagata, Naoshi Kaneko
Title: DACoN: DINO for Anime Paint Bucket Colorization with Any Number of Reference Images
Abstract:
Automatic colorization of line drawings has been widely studied to reduce the labor cost of hand-drawn anime production. Deep learning approaches, including image/video generation and feature-based correspondence, have improved accuracy but struggle with occlusions, pose variations, and viewpoint changes. To address these challenges, we propose DACoN, a framework that leverages foundation models to capture part-level semantics, even in line drawings. Our method fuses low-resolution semantic features from foundation models with high-resolution spatial features from CNNs for fine-grained yet robust feature extraction. In contrast to previous methods that rely on the Multiplex Transformer and support only one or two reference images, DACoN removes this constraint, allowing any number of references. Quantitative and qualitative evaluations demonstrate the benefits of using multiple reference images, achieving superior colorization performance. Our code and model are available at https://github.com/kzmngt/DACoN.
中文: DACoN是一种创新框架,通过融合基础模型的部分层级语义与CNN空间特征,解决了自动动漫线稿上色中的遮挡和视角变化难题,并支持多参考图像以实现更优着色效果。
English: DACoN is a novel framework that overcomes limitations in automatic anime line drawing colorization by integrating foundation models for part-level semantics with CNN spatial features, enabling the use of multiple reference images for superior performance.

Authors:Feng Ding, Haisheng Fu, Soroush Oraki, Jie Liang
Title: LSTC-MDA: A Unified Framework for Long-Short Term Temporal Convolution and Mixed Data Augmentation in Skeleton-Based Action Recognition
Abstract:
Skeleton-based action recognition faces two longstanding challenges: the scarcity of labeled training samples and difficulty modeling short- and long-range temporal dependencies. To address these issues, we propose a unified framework, LSTC-MDA, which simultaneously improves temporal modeling and data diversity. We introduce a novel Long-Short Term Temporal Convolution (LSTC) module with parallel short- and long-term branches, these two feature branches are then aligned and fused adaptively using learned similarity weights to preserve critical long-range cues lost by conventional stride-2 temporal convolutions. We also extend Joint Mixing Data Augmentation (JMDA) with an Additive Mixup at the input level, diversifying training samples and restricting mixup operations to the same camera view to avoid distribution shifts. Ablation studies confirm each component contributes. LSTC-MDA achieves state-of-the-art results: 94.1% and 97.5% on NTU 60 (X-Sub and X-View), 90.4% and 92.0% on NTU 120 (X-Sub and X-Set),97.2% on NW-UCLA. Code: https://github.com/xiaobaoxia/LSTC-MDA.
中文: 提出的LSTC-MDA框架通过长短时态卷积模块捕捉多尺度时间依赖性和扩展的联合混合数据增强增加样本多样性,显著提升了基于骨架的动作识别性能,在多个基准测试中取得了最优结果。
English: The proposed LSTC-MDA framework enhances skeleton-based action recognition by introducing a Long-Short Term Temporal Convolution module to capture multi-scale temporal dependencies and an extended Joint Mixing Data Augmentation to increase sample diversity, achieving state-of-the-art results on multiple benchmarks.

Authors:Zhaokai Wang, Penghao Yin, Xiangyu Zhao, Changyao Tian, Yu Qiao, Wenhai Wang, Jifeng Dai, Gen Luo
Title: GenExam: A Multidisciplinary Text-to-Image Exam
Abstract:
Exams are a fundamental test of expert-level intelligence and require integrated understanding, reasoning, and generation. Existing exam-style benchmarks mainly focus on understanding and reasoning tasks, and current generation benchmarks emphasize the illustration of world knowledge and visual concepts, neglecting the evaluation of rigorous drawing exams. We introduce GenExam, the first benchmark for multidisciplinary text-to-image exams, featuring 1,000 samples across 10 subjects with exam-style prompts organized under a four-level taxonomy. Each problem is equipped with ground-truth images and fine-grained scoring points to enable a precise evaluation of semantic correctness and visual plausibility. Experiments show that even state-of-the-art models such as GPT-Image-1 and Gemini-2.5-Flash-Image achieve less than 15% strict scores, and most models yield almost 0%, suggesting the great challenge of our benchmark. By framing image generation as an exam, GenExam offers a rigorous assessment of models' ability to integrate understanding, reasoning, and generation, providing insights on the path to general AGI. Our benchmark and evaluation code are released at https://github.com/OpenGVLab/GenExam.
中文摘要:GenExam是首个多学科文本到图像的考试基准,通过考试式提示严格评估AI模型的理解、推理和生成综合能力,实验表明即使最先进模型也面临巨大挑战,为通用人工智能发展提供重要参考。
English Summary: GenExam is a pioneering multidisciplinary text-to-image benchmark that rigorously evaluates AI models' integrated understanding, reasoning, and generation capabilities through exam-style prompts, revealing significant performance gaps even in state-of-the-art models.

Authors:Peng Xu, Shengwu Xiong, Jiajun Zhang, Yaxiong Chen, Bowen Zhou, Chen Change Loy, David A. Clifton, Kyoung Mu Lee, Luc Van Gool, Ruiming He, Ruilin Yao, Xinwei Long, Jirui Huang, Kai Tian, Sa Yang, Yihua Shao, Jin Feng, Yue Zhong, Jiakai Zhou, Cheng Tang, Tianyu Zou, Yifang Zhang, Junming Liang, Guoyou Li, Zhaoxiang Wang, Qiang Zhou, Yichen Zhao, Shili Xiong, Hyeongjin Nam, Jaerin Lee, Jaeyoung Chung, JoonKyu Park, Junghun Oh, Kanggeon Lee, Wooseok Lee, Juneyoung Ro, Turghun Osman, Can Hu, Chaoyang Liao, Cheng Chen, Chengcheng Han, Chenhao Qiu, Chong Peng, Cong Xu, Dailin Li, Feiyu Wang, Feng Gao, Guibo Zhu, Guopeng Tang, Haibo Lu, Han Fang, Han Qi, Hanxiao Wu, Haobo Cheng, Hongbo Sun, Hongyao Chen, Huayong Hu, Hui Li, Jiaheng Ma, Jiang Yu, Jianing Wang, Jie Yang, Jing He, Jinglin Zhou, Jingxuan Li, Josef Kittler, Lihao Zheng, Linnan Zhao, Mengxi Jia, Muyang Yan, Nguyen Thanh Thien, Pu Luo, Qi Li, Shien Song, Shijie Dong, Shuai Shao, Shutao Li, Taofeng Xue, Tianyang Xu, Tianyi Gao, Tingting Li, Wei Zhang, Weiyang Su, Xiaodong Dong, Xiao-Jun Wu, Xiaopeng Zhou, Xin Chen, Xin Wei, Xinyi You, Xudong Kang, Xujie Zhou, Xusheng Liu, Yanan Wang, Yanbin Huang, Yang Liu, Yang Yang, Yanglin Deng, Yashu Kang, Ye Yuan, Yi Wen, Yicen Tian, Yilin Tao, Yin Tang, Yipeng Lin, Yiqing Wang, Yiting Xi, Yongkang Yu, Yumei Li, Yuxin Qin, Yuying Chen, Yuzhe Cen, Zhaofan Zou, Zhaohong Liu, Zhehao Shen, Zhenglin Du, Zhengyang Li, Zhenni Huang, Zhenwei Shao, Zhilong Song, Zhiyong Feng, Zhiyu Wang, Zhou Yu, Ziang Li, Zihan Zhai, Zijian Zhang, Ziyang Peng, Ziyun Xiao, Zongshu Li
Title: MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook
Abstract:
This paper reviews the MARS2 2025 Challenge on Multimodal Reasoning. We aim to bring together different approaches in multimodal machine learning and LLMs via a large benchmark. We hope it better allows researchers to follow the state-of-the-art in this very dynamic area. Meanwhile, a growing number of testbeds have boosted the evolution of general-purpose large language models. Thus, this year's MARS2 focuses on real-world and specialized scenarios to broaden the multimodal reasoning applications of MLLMs. Our organizing team released two tailored datasets Lens and AdsQA as test sets, which support general reasoning in 12 daily scenarios and domain-specific reasoning in advertisement videos, respectively. We evaluated 40+ baselines that include both generalist MLLMs and task-specific models, and opened up three competition tracks, i.e., Visual Grounding in Real-world Scenarios (VG-RS), Visual Question Answering with Spatial Awareness (VQA-SA), and Visual Reasoning in Creative Advertisement Videos (VR-Ads). Finally, 76 teams from the renowned academic and industrial institutions have registered and 40+ valid submissions (out of 1200+) have been included in our ranking lists. Our datasets, code sets (40+ baselines and 15+ participants' methods), and rankings are publicly available on the MARS2 workshop website and our GitHub organization page https://github.com/mars2workshop/, where our updates and announcements of upcoming events will be continuously provided.
中文:MARS2 2025挑战赛通过发布Lens和AdsQA两个定制数据集,在三大竞赛赛道中评估多模态模型在现实场景与专业领域的推理能力,已测评40余个基线模型并公开所有资源。
English: The MARS2 2025 Challenge introduces two specialized datasets, Lens and AdsQA, to evaluate multimodal reasoning in real-world and domain-specific scenarios through three competition tracks, with over 40 baselines assessed and all resources made publicly available.

Authors:Jingyi Yuan, Jianxiong Ye, Wenkang Chen, Chenqiang Gao
Title: AD-DINOv3: Enhancing DINOv3 for Zero-Shot Anomaly Detection with Anomaly-Aware Calibration
Abstract:
Zero-Shot Anomaly Detection (ZSAD) seeks to identify anomalies from arbitrary novel categories, offering a scalable and annotation-efficient solution. Traditionally, most ZSAD works have been based on the CLIP model, which performs anomaly detection by calculating the similarity between visual and text embeddings. Recently, vision foundation models such as DINOv3 have demonstrated strong transferable representation capabilities. In this work, we are the first to adapt DINOv3 for ZSAD. However, this adaptation presents two key challenges: (i) the domain bias between large-scale pretraining data and anomaly detection tasks leads to feature misalignment; and (ii) the inherent bias toward global semantics in pretrained representations often leads to subtle anomalies being misinterpreted as part of the normal foreground objects, rather than being distinguished as abnormal regions. To overcome these challenges, we introduce AD-DINOv3, a novel vision-language multimodal framework designed for ZSAD. Specifically, we formulate anomaly detection as a multimodal contrastive learning problem, where DINOv3 is employed as the visual backbone to extract patch tokens and a CLS token, and the CLIP text encoder provides embeddings for both normal and abnormal prompts. To bridge the domain gap, lightweight adapters are introduced in both modalities, enabling their representations to be recalibrated for the anomaly detection task. Beyond this baseline alignment, we further design an Anomaly-Aware Calibration Module (AACM), which explicitly guides the CLS token to attend to anomalous regions rather than generic foreground semantics, thereby enhancing discriminability. Extensive experiments on eight industrial and medical benchmarks demonstrate that AD-DINOv3 consistently matches or surpasses state-of-the-art methods.The code will be available at https://github.com/Kaisor-Yuan/AD-DINOv3.
中文摘要:本文提出AD-DINOv3,一种新颖的视觉-语言多模态框架,通过轻量级适配器解决特征对齐问题,并设计异常感知校准模块增强异常区域识别能力,在多个工业与医疗基准测试中达到或超越现有最优方法的零样本异常检测性能。
English Summary: This paper introduces AD-DINOv3, a novel vision-language framework that adapts DINOv3 for Zero-Shot Anomaly Detection by addressing feature misalignment through lightweight adapters and enhancing anomaly discrimination via an Anomaly-Aware Calibration Module, achieving state-of-the-art performance across multiple benchmarks.

Authors:Gang Cheng, Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Ju Li, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, Ke Sun, Linrui Tian, Feng Wang, Guangyuan Wang, Qi Wang, Zhongjian Wang, Jiayu Xiao, Sheng Xu, Bang Zhang, Peng Zhang, Xindi Zhang, Zhe Zhang, Jingren Zhou, Lian Zhuo
Title: Wan-Animate: Unified Character Animation and Replacement with Holistic Replication
Abstract:
We introduce Wan-Animate, a unified framework for character animation and replacement. Given a character image and a reference video, Wan-Animate can animate the character by precisely replicating the expressions and movements of the character in the video to generate high-fidelity character videos. Alternatively, it can integrate the animated character into the reference video to replace the original character, replicating the scene's lighting and color tone to achieve seamless environmental integration. Wan-Animate is built upon the Wan model. To adapt it for character animation tasks, we employ a modified input paradigm to differentiate between reference conditions and regions for generation. This design unifies multiple tasks into a common symbolic representation. We use spatially-aligned skeleton signals to replicate body motion and implicit facial features extracted from source images to reenact expressions, enabling the generation of character videos with high controllability and expressiveness. Furthermore, to enhance environmental integration during character replacement, we develop an auxiliary Relighting LoRA. This module preserves the character's appearance consistency while applying the appropriate environmental lighting and color tone. Experimental results demonstrate that Wan-Animate achieves state-of-the-art performance. We are committed to open-sourcing the model weights and its source code.

Authors:Harvey Mannering, Zhiwu Huang, Adam Prugel-Bennett
Title: Noise-Level Diffusion Guidance: Well Begun is Half Done
Abstract:
Diffusion models have achieved state-of-the-art image generation. However, the random Gaussian noise used to start the diffusion process influences the final output, causing variations in image quality and prompt adherence. Existing noise-level optimization approaches generally rely on extra dataset construction, additional networks, or backpropagation-based optimization, limiting their practicality. In this paper, we propose Noise Level Guidance (NLG), a simple, efficient, and general noise-level optimization approach that refines initial noise by increasing the likelihood of its alignment with general guidance - requiring no additional training data, auxiliary networks, or backpropagation. The proposed NLG approach provides a unified framework generalizable to both conditional and unconditional diffusion models, accommodating various forms of diffusion-level guidance. Extensive experiments on five standard benchmarks demonstrate that our approach enhances output generation quality and input condition adherence. By seamlessly integrating with existing guidance methods while maintaining computational efficiency, our method establishes NLG as a practical and scalable enhancement to diffusion models. Code can be found at https://github.com/harveymannering/NoiseLevelGuidance.
中文: 本文提出噪声水平引导(NLG)方法,通过优化初始噪声无需额外数据、网络或反向传播,有效提升扩散模型的图像生成质量与提示遵循度,适用于多种引导方式并保持计算效率。
English: This paper introduces Noise Level Guidance (NLG), a simple and efficient method that refines initial noise in diffusion models to improve image quality and prompt adherence without requiring extra data, networks, or backpropagation, demonstrating effectiveness across various benchmarks and guidance methods.

Authors:Zhen Xu, Guorui Lu, Chang Gao, Qinyu Chen
Title: EvHand-FPV: Efficient Event-Based 3D Hand Tracking from First-Person View
Abstract:
Hand tracking holds great promise for intuitive interaction paradigms, but frame-based methods often struggle to meet the requirements of accuracy, low latency, and energy efficiency, especially in resource-constrained settings such as Extended Reality (XR) devices. Event cameras provide $μ$s-level temporal resolution at mW-level power by asynchronously sensing brightness changes. In this work, we present EvHand-FPV, a lightweight framework for egocentric First-Person-View 3D hand tracking from a single event camera. We construct an event-based FPV dataset that couples synthetic training data with 3D labels and real event data with 2D labels for evaluation to address the scarcity of egocentric benchmarks. EvHand-FPV also introduces a wrist-based region of interest (ROI) that localizes the hand region via geometric cues, combined with an end-to-end mapping strategy that embeds ROI offsets into the network to reduce computation without explicit reconstruction, and a multi-task learning strategy with an auxiliary geometric feature head that improves representations without test-time overhead. On our real FPV test set, EvHand-FPV improves 2D-AUCp from 0.77 to 0.85 while reducing parameters from 11.2M to 1.2M by 89% and FLOPs per inference from 1.648G to 0.185G by 89%. It also maintains a competitive 3D-AUCp of 0.84 on synthetic data. These results demonstrate accurate and efficient egocentric event-based hand tracking suitable for on-device XR applications. The dataset and code are available at https://github.com/zen5x5/EvHand-FPV.
中文: EvHand-FPV是一种轻量级框架,利用单个事件相机实现高效三维手部追踪,在显著降低计算需求的同时保持高精度,适用于设备端XR应用。
English: EvHand-FPV is a lightweight framework for efficient 3D hand tracking using a single event camera, achieving high accuracy with significantly reduced computational demands for on-device XR applications.

Authors:Jovana Videnovic, Matej Kristan, Alan Lukezic
Title: Distractor-Aware Memory-Based Visual Object Tracking
Abstract:
Recent emergence of memory-based video segmentation methods such as SAM2 has led to models with excellent performance in segmentation tasks, achieving leading results on numerous benchmarks. However, these modes are not fully adjusted for visual object tracking, where distractors (i.e., objects visually similar to the target) pose a key challenge. In this paper we propose a distractor-aware drop-in memory module and introspection-based management method for SAM2, leading to DAM4SAM. Our design effectively reduces the tracking drift toward distractors and improves redetection capability after object occlusion. To facilitate the analysis of tracking in the presence of distractors, we construct DiDi, a Distractor-Distilled dataset. DAM4SAM outperforms SAM2.1 on thirteen benchmarks and sets new state-of-the-art results on ten. Furthermore, integrating the proposed distractor-aware memory into a real-time tracker EfficientTAM leads to 11% improvement and matches tracking quality of the non-real-time SAM2.1-L on multiple tracking and segmentation benchmarks, while integration with edge-based tracker EdgeTAM delivers 4% performance boost, demonstrating a very good generalization across architectures.
中文: 提出的DAM4SAM模块通过引入干扰物感知记忆系统改进了SAM2,有效减少跟踪漂移并提升目标遮挡后的重检测能力,在多个基准测试中创下最新最优成绩,且与其他跟踪器集成时展现出优秀的泛化性能。
English: The proposed DAM4SAM module enhances SAM2 by incorporating a distractor-aware memory system that mitigates tracking drift and improves redetection after occlusion, achieving state-of-the-art results across multiple benchmarks while demonstrating strong generalization when integrated with other trackers.

Authors:Qianxin Xia, Jiawei Du, Guoming Lu, Zhiyong Shu, Jielei Wang
Title: EDITS: Enhancing Dataset Distillation with Implicit Textual Semantics
Abstract:
Dataset distillation aims to synthesize a compact dataset from the original large-scale one, enabling highly efficient learning while preserving competitive model performance. However, traditional techniques primarily capture low-level visual features, neglecting the high-level semantic and structural information inherent in images. In this paper, we propose EDITS, a novel framework that exploits the implicit textual semantics within the image data to achieve enhanced distillation. First, external texts generated by a Vision Language Model (VLM) are fused with image features through a Global Semantic Query module, forming the prior clustered buffer. Local Semantic Awareness then selects representative samples from the buffer to construct image and text prototypes, with the latter produced by guiding a Large Language Model (LLM) with meticulously crafted prompt. Ultimately, Dual Prototype Guidance strategy generates the final synthetic dataset through a diffusion model. Extensive experiments confirm the effectiveness of our method.Source code is available in: https://github.com/einsteinxia/EDITS.
中文摘要:EDITS框架通过融合视觉语言模型生成的文本语义与图像特征,并利用大型语言模型构建文本原型,有效提升了数据集蒸馏过程中对高级语义信息的保留能力,显著优于传统仅关注低级特征的方法。
English Summary: The EDITS framework enhances dataset distillation by integrating textual semantics from vision-language models and large language models to synthesize compact datasets that preserve high-level semantic information, outperforming traditional methods focused on low-level features.

Authors:Puru Vaish, Felix Meister, Tobias Heimann, Christoph Brune, Jelmer M. Wolterink
Title: Consistent View Alignment Improves Foundation Models for 3D Medical Image Segmentation
Abstract:
Many recent approaches in representation learning implicitly assume that uncorrelated views of a data point are sufficient to learn meaningful representations for various downstream tasks. In this work, we challenge this assumption and demonstrate that meaningful structure in the latent space does not emerge naturally. Instead, it must be explicitly induced. We propose a method that aligns representations from different views of the data to align complementary information without inducing false positives. Our experiments show that our proposed self-supervised learning method, Consistent View Alignment, improves performance for downstream tasks, highlighting the critical role of structured view alignment in learning effective representations. Our method achieved first and second place in the MICCAI 2025 SSL3D challenge when using a Primus vision transformer and ResEnc convolutional neural network, respectively. The code and pretrained model weights are released at https://github.com/Tenbatsu24/LatentCampus.
中文摘要:本研究挑战了不相关数据视图足以学习有效表征的假设,提出名为“一致视图对齐”的自监督方法,通过显式构建潜在空间结构提升下游任务性能,并在MICCAI 2025挑战赛中取得领先排名。
English Summary: The study challenges the assumption that uncorrelated data views suffice for learning meaningful representations, proposing a self-supervised method called Consistent View Alignment that explicitly structures latent space to improve downstream task performance, as evidenced by top rankings in the MICCAI 2025 challenge.

Authors:Nguyen Lan Vi Vu, Thanh-Huy Nguyen, Thien Nguyen, Daisuke Kihara, Tianyang Wang, Xingjian Li, Min Xu
Title: Semi-MoE: Mixture-of-Experts meets Semi-Supervised Histopathology Segmentation
Abstract:
Semi-supervised learning has been employed to alleviate the need for extensive labeled data for histopathology image segmentation, but existing methods struggle with noisy pseudo-labels due to ambiguous gland boundaries and morphological misclassification. This paper introduces Semi-MOE, to the best of our knowledge, the first multi-task Mixture-of-Experts framework for semi-supervised histopathology image segmentation. Our approach leverages three specialized expert networks: A main segmentation expert, a signed distance field regression expert, and a boundary prediction expert, each dedicated to capturing distinct morphological features. Subsequently, the Multi-Gating Pseudo-labeling module dynamically aggregates expert features, enabling a robust fuse-and-refine pseudo-labeling mechanism. Furthermore, to eliminate manual tuning while dynamically balancing multiple learning objectives, we propose an Adaptive Multi-Objective Loss. Extensive experiments on GlaS and CRAG benchmarks show that our method outperforms state-of-the-art approaches in low-label settings, highlighting the potential of MoE-based architectures in advancing semi-supervised segmentation. Our code is available at https://github.com/vnlvi2k3/Semi-MoE.
中文: 本文提出Semi-MOE框架,首次采用多任务专家混合模型解决组织病理学图像半监督分割中的伪标签噪声问题,通过专业化网络和自适应损失机制在基准测试中实现最优性能。
English: This paper introduces Semi-MOE, a novel multi-task Mixture-of-Experts framework that addresses noisy pseudo-labels in semi-supervised histopathology image segmentation through specialized expert networks and an adaptive loss mechanism, demonstrating superior performance on benchmark datasets.

Authors:Jiayu Yuan, Ming Dai, Enhui Zheng, Chao Su, Nanxing Chen, Qiming Hu, Shibo Zhu, Yibin Cao
Title: SWA-PF: Semantic-Weighted Adaptive Particle Filter for Memory-Efficient 4-DoF UAV Localization in GNSS-Denied Environments
Abstract:
Vision-based Unmanned Aerial Vehicle (UAV) localization systems have been extensively investigated for Global Navigation Satellite System (GNSS)-denied environments. However, existing retrieval-based approaches face limitations in dataset availability and persistent challenges including suboptimal real-time performance, environmental sensitivity, and limited generalization capability, particularly in dynamic or temporally varying environments. To overcome these limitations, we present a large-scale Multi-Altitude Flight Segments dataset (MAFS) for variable altitude scenarios and propose a novel Semantic-Weighted Adaptive Particle Filter (SWA-PF) method. This approach integrates robust semantic features from both UAV-captured images and satellite imagery through two key innovations: a semantic weighting mechanism and an optimized particle filtering architecture. Evaluated using our dataset, the proposed method achieves 10x computational efficiency gain over feature extraction methods, maintains global positioning errors below 10 meters, and enables rapid 4 degree of freedom (4-DoF) pose estimation within seconds using accessible low-resolution satellite maps. Code and dataset will be available at https://github.com/YuanJiayuuu/SWA-PF.
中文总结:本研究提出大规模多高度飞行段数据集和新型语义加权自适应粒子滤波方法,在无GNSS环境下实现10倍计算效率提升、10米内定位精度及快速4自由度姿态估计。
English Summary: This study introduces a large-scale Multi-Altitude Flight Segments dataset and a novel Semantic-Weighted Adaptive Particle Filter method that achieves 10x computational efficiency, sub-10-meter positioning accuracy, and rapid 4-DoF pose estimation in GNSS-denied environments.

Authors:Huichun Liu, Xiaosong Li, Yang Liu, Xiaoqi Cheng, Haishu Tan
Title: NDLPNet: A Location-Aware Nighttime Deraining Network and a Real-World Benchmark Dataset
Abstract:
Visual degradation caused by rain streak artifacts in low-light conditions significantly hampers the performance of nighttime surveillance and autonomous navigation. Existing image deraining techniques are primarily designed for daytime conditions and perform poorly under nighttime illumination due to the spatial heterogeneity of rain distribution and the impact of light-dependent stripe visibility. In this paper, we propose a novel Nighttime Deraining Location-enhanced Perceptual Network(NDLPNet) that effectively captures the spatial positional information and density distribution of rain streaks in low-light environments. Specifically, we introduce a Position Perception Module (PPM) to capture and leverage spatial contextual information from input data, enhancing the model's capability to identify and recalibrate the importance of different feature channels. The proposed nighttime deraining network can effectively remove the rain streaks as well as preserve the crucial background information. Furthermore, We construct a night scene rainy (NSR) dataset comprising 900 image pairs, all based on real-world nighttime scenes, providing a new benchmark for nighttime deraining task research. Extensive qualitative and quantitative experimental evaluations on both existing datasets and the NSR dataset consistently demonstrate our method outperform the state-of-the-art (SOTA) methods in nighttime deraining tasks. The source code and dataset is available at https://github.com/Feecuin/NDLPNet.
中文: 本文提出的夜间去雨定位增强感知网络(NDLPNet)能有效消除弱光环境下的雨纹并保留背景信息,通过在真实夜间场景数据集上的验证表明其性能优于现有最优方法。
English: The proposed Nighttime Deraining Location-enhanced Perceptual Network (NDLPNet) effectively removes rain streaks while preserving background details in low-light conditions, outperforming existing methods as validated on a new real-world nighttime dataset.

Authors:Jinwoo Jeon, JunHyeok Oh, Hayeong Lee, Byung-Jun Lee
Title: Iterative Prompt Refinement for Safer Text-to-Image Generation
Abstract:
Text-to-Image (T2I) models have made remarkable progress in generating images from text prompts, but their output quality and safety still depend heavily on how prompts are phrased. Existing safety methods typically refine prompts using large language models (LLMs), but they overlook the images produced, which can result in unsafe outputs or unnecessary changes to already safe prompts. To address this, we propose an iterative prompt refinement algorithm that uses Vision Language Models (VLMs) to analyze both the input prompts and the generated images. By leveraging visual feedback, our method refines prompts more effectively, improving safety while maintaining user intent and reliability comparable to existing LLM-based approaches. Additionally, we introduce a new dataset labeled with both textual and visual safety signals using off-the-shelf multi-modal LLM, enabling supervised fine-tuning. Experimental results demonstrate that our approach produces safer outputs without compromising alignment with user intent, offering a practical solution for generating safer T2I content. Our code is available at https://github.com/ku-dmlab/IPR. \textbf{\textcolor{red}WARNING: This paper contains examples of harmful or inappropriate images generated by models.
中文: 本文提出一种迭代式提示优化算法,通过视觉语言模型综合分析文本提示与生成图像,在提升安全性的同时保持用户意图,并提供了用于监督微调的新型数据集。
English: This paper introduces an iterative prompt refinement algorithm that uses Vision Language Models to analyze both text prompts and generated images, enhancing safety without compromising user intent while offering a new dataset for supervised fine-tuning.

Authors:Hao Yin, Xin Man, Feiyu Chen, Jie Shao, Heng Tao Shen
Title: Cross-modal Full-mode Fine-grained Alignment for Text-to-Image Person Retrieval
Abstract:
Text-to-Image Person Retrieval (TIPR) is a cross-modal matching task that aims to retrieve the most relevant person images based on a given text query. The key challenge in TIPR lies in achieving effective alignment between textual and visual modalities within a common latent space. To address this challenge, prior approaches incorporate attention mechanisms for implicit cross-modal local alignment. However, they lack the ability to verify whether all local features are correctly aligned. Moreover, existing methods primarily focus on hard negative samples during model updates, with the goal of refining distinctions between positive and negative pairs, often neglecting incorrectly matched positive pairs. To alleviate these issues, we propose FMFA, a cross-modal Full-Mode Fine-grained Alignment framework, which enhances global matching through explicit fine-grained alignment and existing implicit relational reasoning -- hence the term ``full-mode" -- without requiring additional supervision. Specifically, we design an Adaptive Similarity Distribution Matching (A-SDM) module to rectify unmatched positive sample pairs. A-SDM adaptively pulls the unmatched positive pairs closer in the joint embedding space, thereby achieving more precise global alignment. Additionally, we introduce an Explicit Fine-grained Alignment (EFA) module, which makes up for the lack of verification capability of implicit relational reasoning. EFA strengthens explicit cross-modal fine-grained interactions by sparsifying the similarity matrix and employs a hard coding method for local alignment. Our proposed method is evaluated on three public datasets, achieving state-of-the-art performance among all global matching methods. Our code is available at https://github.com/yinhao1102/FMFA.
中文:FMFA框架通过结合显式细粒度对齐和隐式关系推理,无需额外监督即可实现更优的跨模态匹配,从而提升文本到图像的人物检索效果。
English: The proposed FMFA framework enhances text-to-image person retrieval by combining explicit fine-grained alignment and implicit relational reasoning to achieve superior cross-modal matching without extra supervision.

Authors:Ming Dai, Wenxuan Cheng, Jiang-Jiang Liu, Lingfeng Yang, Zhenhua Feng, Wankou Yang, Jingdong Wang
Title: Improving Generalized Visual Grounding with Instance-aware Joint Learning
Abstract:
Generalized visual grounding tasks, including Generalized Referring Expression Comprehension (GREC) and Segmentation (GRES), extend the classical visual grounding paradigm by accommodating multi-target and non-target scenarios. Specifically, GREC focuses on accurately identifying all referential objects at the coarse bounding box level, while GRES aims for achieve fine-grained pixel-level perception. However, existing approaches typically treat these tasks independently, overlooking the benefits of jointly training GREC and GRES to ensure consistent multi-granularity predictions and streamline the overall process. Moreover, current methods often treat GRES as a semantic segmentation task, neglecting the crucial role of instance-aware capabilities and the necessity of ensuring consistent predictions between instance-level boxes and masks. To address these limitations, we propose InstanceVG, a multi-task generalized visual grounding framework equipped with instance-aware capabilities, which leverages instance queries to unify the joint and consistency predictions of instance-level boxes and masks. To the best of our knowledge, InstanceVG is the first framework to simultaneously tackle both GREC and GRES while incorporating instance-aware capabilities into generalized visual grounding. To instantiate the framework, we assign each instance query a prior reference point, which also serves as an additional basis for target matching. This design facilitates consistent predictions of points, boxes, and masks for the same instance. Extensive experiments obtained on ten datasets across four tasks demonstrate that InstanceVG achieves state-of-the-art performance, significantly surpassing the existing methods in various evaluation metrics. The code and model will be publicly available at https://github.com/Dmmm1997/InstanceVG.
中文:InstanceVG是一种创新的多任务框架,通过引入实例感知能力统一处理广义指代表达式理解与分割任务,借助实例查询实现框与掩码的一致性预测,在多项评估中达到最优性能。
English: InstanceVG is a novel multi-task framework that jointly addresses Generalized Referring Expression Comprehension and Segmentation by incorporating instance-aware capabilities, achieving state-of-the-art performance through unified predictions of boxes and masks.

Authors:Zirun Guo, Feng Zhang, Kai Jia, Tao Jin
Title: LLM-I: LLMs are Naturally Interleaved Multimodal Creators
Abstract:
We propose LLM-Interleaved (LLM-I), a flexible and dynamic framework that reframes interleaved image-text generation as a tool-use problem. LLM-I is designed to overcome the "one-tool" bottleneck of current unified models, which are limited to synthetic imagery and struggle with tasks requiring factual grounding or programmatic precision. Our framework empowers a central LLM or MLLM agent to intelligently orchestrate a diverse toolkit of specialized visual tools, including online image search, diffusion-based generation, code execution, and image editing. The agent is trained to select and apply these tools proficiently via a Reinforcement Learning (RL) framework that features a hybrid reward system combining rule-based logic with judgments from LLM and MLLM evaluators. Trained on a diverse new dataset using four different model backbones, LLM-I demonstrates state-of-the-art performance, outperforming existing methods by a large margin across four benchmarks. We also introduce a novel test-time scaling strategy that provides further performance gains. Project Page: https://github.com/ByteDance-BandAI/LLM-I.
中文: LLM-Interleaved (LLM-I) 是一个将交错式图文生成重构为工具使用问题的灵活框架,通过强化学习训练智能体协调多种视觉工具,在多个基准测试中大幅超越现有方法。
English: LLM-Interleaved (LLM-I) is a dynamic framework that transforms interleaved image-text generation into a tool-use problem, enabling an LLM agent to intelligently orchestrate specialized visual tools and achieve state-of-the-art performance across multiple benchmarks.

Authors:Amir-Hossein Shahidzadeh, Jiyue Zhu, Kezhou Chen, Sha Yi, Cornelia Fermüller, Yiannis Aloimonos, Xiaolong Wang
Title: Object Pose Estimation through Dexterous Touch
Abstract:
Robust object pose estimation is essential for manipulation and interaction tasks in robotics, particularly in scenarios where visual data is limited or sensitive to lighting, occlusions, and appearances. Tactile sensors often offer limited and local contact information, making it challenging to reconstruct the pose from partial data. Our approach uses sensorimotor exploration to actively control a robot hand to interact with the object. We train with Reinforcement Learning (RL) to explore and collect tactile data. The collected 3D point clouds are used to iteratively refine the object's shape and pose. In our setup, one hand holds the object steady while the other performs active exploration. We show that our method can actively explore an object's surface to identify critical pose features without prior knowledge of the object's geometry. Supplementary material and more demonstrations will be provided at https://amirshahid.github.io/BimanualTactilePose .

Authors:Jiangbei Yue, Shuonan Yang, Tailin Chen, Jianbo Jiao, Zeyu Fu
Title: Multimodal Hate Detection Using Dual-Stream Graph Neural Networks
Abstract:
Hateful videos present serious risks to online safety and real-world well-being, necessitating effective detection methods. Although multimodal classification approaches integrating information from several modalities outperform unimodal ones, they typically neglect that even minimal hateful content defines a video's category. Specifically, they generally treat all content uniformly, instead of emphasizing the hateful components. Additionally, existing multimodal methods cannot systematically capture structured information in videos, limiting the effectiveness of multimodal fusion. To address these limitations, we propose a novel multimodal dual-stream graph neural network model. It constructs an instance graph by separating the given video into several instances to extract instance-level features. Then, a complementary weight graph assigns importance weights to these features, highlighting hateful instances. Importance weights and instance features are combined to generate video labels. Our model employs a graph-based framework to systematically model structured relationships within and across modalities. Extensive experiments on public datasets show that our model is state-of-the-art in hateful video classification and has strong explainability. Code is available: https://github.com/Multimodal-Intelligence-Lab-MIL/MultiHateGNN.
中文: 该研究提出的多模态双流图神经网络模型通过构建实例图和权重图来突出仇恨内容并捕捉结构化关系,有效检测仇恨视频,实现了最先进的分类性能和强可解释性。
English: The proposed multimodal dual-stream graph neural network model effectively detects hateful videos by constructing instance and weight graphs to emphasize hateful content and capture structured relationships, achieving state-of-the-art performance and explainability.

Authors:Uriel Garcilazo-Cruz, Joseph O. Okeme, Rodrigo A. Vargas--Hernández
Title: LivePyxel: Accelerating image annotations with a Python-integrated webcam live streaming
Abstract:
The lack of flexible annotation tools has hindered the deployment of AI models in some scientific areas. Most existing image annotation software requires users to upload a precollected dataset, which limits support for on-demand pipelines and introduces unnecessary steps to acquire images. This constraint is particularly problematic in laboratory environments, where real-time data acquisition from instruments such as microscopes is increasingly common. In this work, we introduce \texttt{LivePixel}, a Python-based graphical user interface that integrates with imaging systems, such as webcams, microscopes, and others, to enable real-time image annotation. LivePyxel is designed to be easy to use through a simple interface that allows users to precisely delimit areas for annotation using tools commonly found in commercial graphics editing software. Of particular interest is the availability of Bézier splines and binary masks, and the software's capacity to work with non-destructive layers that enable high-performance editing. LivePyxel also integrates a wide compatibility across video devices, and it's optimized for object detection operations via the use of OpenCV in combination with high-performance libraries designed to handle matrix and linear algebra operations via Numpy effectively. LivePyxel facilitates seamless data collection and labeling, accelerating the development of AI models in experimental workflows. LivePyxel freely available at https://github.com/UGarCil/LivePyxel
中文: LivePixel 是一款基于 Python 的工具,可直接从显微镜等设备进行实时图像标注,通过整合贝塞尔曲线和二进制掩码等灵活工具,解决了传统软件的局限性,从而加速科学工作流程中人工智能模型的开发。
English: LivePixel is a Python-based tool that enables real-time image annotation directly from devices like microscopes, addressing the limitations of traditional software by integrating flexible tools such as Bézier splines and binary masks to streamline AI model development in scientific workflows.

Authors:Hao Xu, Xiaolin Wu, Xi Zhang
Title: Improving 3D Gaussian Splatting Compression by Scene-Adaptive Lattice Vector Quantization
Abstract:
3D Gaussian Splatting (3DGS) is rapidly gaining popularity for its photorealistic rendering quality and real-time performance, but it generates massive amounts of data. Hence compressing 3DGS data is necessary for the cost effectiveness of 3DGS models. Recently, several anchor-based neural compression methods have been proposed, achieving good 3DGS compression performance. However, they all rely on uniform scalar quantization (USQ) due to its simplicity. A tantalizing question is whether more sophisticated quantizers can improve the current 3DGS compression methods with very little extra overhead and minimal change to the system. The answer is yes by replacing USQ with lattice vector quantization (LVQ). To better capture scene-specific characteristics, we optimize the lattice basis for each scene, improving LVQ's adaptability and R-D efficiency. This scene-adaptive LVQ (SALVQ) strikes a balance between the R-D efficiency of vector quantization and the low complexity of USQ. SALVQ can be seamlessly integrated into existing 3DGS compression architectures, enhancing their R-D performance with minimal modifications and computational overhead. Moreover, by scaling the lattice basis vectors, SALVQ can dynamically adjust lattice density, enabling a single model to accommodate multiple bit rate targets. This flexibility eliminates the need to train separate models for different compression levels, significantly reducing training time and memory consumption.
中文: 3D高斯泼溅(3DGS)虽能实现逼真渲染但数据量庞大需压缩,采用场景自适应格型矢量量化(SALVQ)替代均匀标量量化,能以极低开销提升压缩性能并实现动态码率调节。
English: 3D Gaussian Splatting (3DGS) achieves photorealistic rendering but requires compression for cost efficiency, and replacing uniform scalar quantization with scene-adaptive lattice vector quantization (SALVQ) enhances compression performance with minimal overhead while enabling dynamic bit rate adjustment.

Authors:Tianyu Chen, Yasi Zhang, Zhi Zhang, Peiyu Yu, Shu Wang, Zhendong Wang, Kevin Lin, Xiaofei Wang, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Jianwen Xie, Oscar Leong, Lijuan Wang, Ying Nian Wu, Mingyuan Zhou
Title: EdiVal-Agent: An Object-Centric Framework for Automated, Scalable, Fine-Grained Evaluation of Multi-Turn Editing
Abstract:
Instruction-based image editing has advanced rapidly, yet reliable and interpretable evaluation remains a bottleneck. Current protocols either (i) depend on paired reference images -- resulting in limited coverage and inheriting biases from prior generative models -- or (ii) rely solely on zero-shot vision-language models (VLMs), whose prompt-based assessments of instruction following, content consistency, and visual quality are often imprecise. To address this, we introduce EdiVal-Agent, an automated, scalable, and fine-grained evaluation framework for multi-turn instruction-based editing from an object-centric perspective, supported by a suite of expert tools. Given an image, EdiVal-Agent first decomposes it into semantically meaningful objects, then synthesizes diverse, context-aware editing instructions. For evaluation, it integrates VLMs with open-vocabulary object detectors to assess instruction following, uses semantic-level feature extractors to evaluate content consistency, and leverages human preference models to judge visual quality. We show that combining VLMs with object detectors yields stronger agreement with human judgments in instruction-following evaluation compared to using VLMs alone and CLIP-based metrics. Furthermore, the pipeline's modular design allows future tools to be seamlessly integrated, enhancing evaluation accuracy over time. Instantiating this pipeline, we build EdiVal-Bench, a multi-turn editing benchmark covering 9 instruction types and 11 state-of-the-art editing models spanning autoregressive (AR) (including Nano Banana, GPT-Image-1), flow-matching, and diffusion paradigms. We demonstrate that EdiVal-Agent can be used to identify existing failure modes, thereby informing the development of the next generation of editing models. Project page: https://tianyucodings.github.io/EdiVAL-page/.
Chinese: 本文提出了EdiVal-Agent,一个自动化、可扩展的评估框架,通过结合视觉语言模型与物体检测器,对基于指令的图像编辑进行细粒度评估,解决了当前评估方法的局限性,并显示出与人类判断更好的一致性。
English: This paper introduces EdiVal-Agent, an automated and scalable evaluation framework that integrates vision-language models with object detectors to provide fine-grained assessment of instruction-based image editing, addressing limitations in current evaluation methods and demonstrating improved alignment with human judgments.

Authors:Hugo Carlesso, Josiane Mothe, Radu Tudor Ionescu
Title: Curriculum Multi-Task Self-Supervision Improves Lightweight Architectures for Onboard Satellite Hyperspectral Image Segmentation
Abstract:
Hyperspectral imaging (HSI) captures detailed spectral signatures across hundreds of contiguous bands per pixel, being indispensable for remote sensing applications such as land-cover classification, change detection, and environmental monitoring. Due to the high dimensionality of HSI data and the slow rate of data transfer in satellite-based systems, compact and efficient models are required to support onboard processing and minimize the transmission of redundant or low-value data, e.g. cloud-covered areas. To this end, we introduce a novel curriculum multi-task self-supervised learning (CMTSSL) framework designed for lightweight architectures for HSI analysis. CMTSSL integrates masked image modeling with decoupled spatial and spectral jigsaw puzzle solving, guided by a curriculum learning strategy that progressively increases data complexity during self-supervision. This enables the encoder to jointly capture fine-grained spectral continuity, spatial structure, and global semantic features. Unlike prior dual-task SSL methods, CMTSSL simultaneously addresses spatial and spectral reasoning within a unified and computationally efficient design, being particularly suitable for training lightweight models for onboard satellite deployment. We validate our approach on four public benchmark datasets, demonstrating consistent gains in downstream segmentation tasks, using architectures that are over 16,000x lighter than some state-of-the-art models. These results highlight the potential of CMTSSL in generalizable representation learning with lightweight architectures for real-world HSI applications. Our code is publicly available at https://github.com/hugocarlesso/CMTSSL.
中文: 高光谱成像需要紧凑模型以支持星载高效处理,新提出的课程多任务自监督学习框架通过整合空间与光谱推理的轻量化设计,在模型比现有技术轻16,000倍的情况下仍保持优异性能。
English: Hyperspectral imaging requires compact models for efficient onboard satellite processing, which is addressed by the novel curriculum multi-task self-supervised learning framework that integrates spatial and spectral reasoning in a lightweight design, achieving strong performance with models over 16,000 times lighter than existing ones.

Authors:Yingtai Li, Haoran Lai, Xiaoqian Zhou, Shuai Ming, Wenxin Ma, Wei Wei, Shaohua Kevin Zhou
Title: More performant and scalable: Rethinking contrastive vision-language pre-training of radiology in the LLM era
Abstract:
The emergence of Large Language Models (LLMs) presents unprecedented opportunities to revolutionize medical contrastive vision-language pre-training. In this paper, we show how LLMs can facilitate large-scale supervised pre-training, thereby advancing vision-language alignment. We begin by demonstrate that modern LLMs can automatically extract diagnostic labels from radiology reports with remarkable precision (>96\% AUC in our experiments) without complex prompt engineering, enabling the creation of large-scale "silver-standard" datasets at a minimal cost (~\$3 for 50k CT image-report pairs). Further, we find that vision encoder trained on this "silver-standard" dataset achieves performance comparable to those trained on labels extracted by specialized BERT-based models, thereby democratizing the access to large-scale supervised pre-training. Building on this foundation, we proceed to reveal that supervised pre-training fundamentally improves contrastive vision-language alignment. Our approach achieves state-of-the-art performance using only a 3D ResNet-18 with vanilla CLIP training, including 83.8\% AUC for zero-shot diagnosis on CT-RATE, 77.3\% AUC on RAD-ChestCT, and substantial improvements in cross-modal retrieval (MAP@50=53.7\% for image-image, Recall@100=52.2\% for report-image). These results demonstrate the potential of utilizing LLMs to facilitate {\bf more performant and scalable} medical AI systems. Our code is avaiable at https://github.com/SadVoxel/More-performant-and-scalable.
中文: 大型语言模型能够自动从放射报告中提取诊断标签,以极低成本创建大规模医疗数据集,在医学人工智能系统的视觉语言预训练中实现了最先进的性能。
English: Large Language Models enable cost-effective creation of large-scale medical datasets by automatically extracting diagnostic labels from radiology reports, achieving state-of-the-art performance in vision-language pre-training for medical AI systems.

Authors:Ruifei Ding, Zhe Chen, Wen Fan, Chen Long, Huijuan Xiao, Yelu Zeng, Zhen Dong, Bisheng Yang
Title: WHU-STree: A Multi-modal Benchmark Dataset for Street Tree Inventory
Abstract:
Street trees are vital to urban livability, providing ecological and social benefits. Establishing a detailed, accurate, and dynamically updated street tree inventory has become essential for optimizing these multifunctional assets within space-constrained urban environments. Given that traditional field surveys are time-consuming and labor-intensive, automated surveys utilizing Mobile Mapping Systems (MMS) offer a more efficient solution. However, existing MMS-acquired tree datasets are limited by small-scale scene, limited annotation, or single modality, restricting their utility for comprehensive analysis. To address these limitations, we introduce WHU-STree, a cross-city, richly annotated, and multi-modal urban street tree dataset. Collected across two distinct cities, WHU-STree integrates synchronized point clouds and high-resolution images, encompassing 21,007 annotated tree instances across 50 species and 2 morphological parameters. Leveraging the unique characteristics, WHU-STree concurrently supports over 10 tasks related to street tree inventory. We benchmark representative baselines for two key tasks--tree species classification and individual tree segmentation. Extensive experiments and in-depth analysis demonstrate the significant potential of multi-modal data fusion and underscore cross-domain applicability as a critical prerequisite for practical algorithm deployment. In particular, we identify key challenges and outline potential future works for fully exploiting WHU-STree, encompassing multi-modal fusion, multi-task collaboration, cross-domain generalization, spatial pattern learning, and Multi-modal Large Language Model for street tree asset management. The WHU-STree dataset is accessible at: https://github.com/WHU-USI3DV/WHU-STree.
中文: WHU-STree数据集通过提供跨城市、多模态且标注丰富的街道树木数据,解决了现有数据集的局限性,支持10余项任务,并在树种分类和单木分割中展现了数据融合的实际应用潜力。
English: The WHU-STree dataset addresses limitations in urban street tree inventories by providing a cross-city, multi-modal collection with rich annotations, supporting over 10 tasks and demonstrating the potential of data fusion for practical algorithm deployment in tree species classification and segmentation.

Authors:Zhihao Zhang, Chunyu Lin, Lang Nie, Jiyuan Wang, Yao Zhao
Title: Advancing Real-World Parking Slot Detection with Large-Scale Dataset and Semi-Supervised Baseline
Abstract:
As automatic parking systems evolve, the accurate detection of parking slots has become increasingly critical. This study focuses on parking slot detection using surround-view cameras, which offer a comprehensive bird's-eye view of the parking environment. However, the current datasets are limited in scale, and the scenes they contain are seldom disrupted by real-world noise (e.g., light, occlusion, etc.). Moreover, manual data annotation is prone to errors and omissions due to the complexity of real-world conditions, significantly increasing the cost of annotating large-scale datasets. To address these issues, we first construct a large-scale parking slot detection dataset (named CRPS-D), which includes various lighting distributions, diverse weather conditions, and challenging parking slot variants. Compared with existing datasets, the proposed dataset boasts the largest data scale and consists of a higher density of parking slots, particularly featuring more slanted parking slots. Additionally, we develop a semi-supervised baseline for parking slot detection, termed SS-PSD, to further improve performance by exploiting unlabeled data. To our knowledge, this is the first semi-supervised approach in parking slot detection, which is built on the teacher-student model with confidence-guided mask consistency and adaptive feature perturbation. Experimental results demonstrate the superiority of SS-PSD over the existing state-of-the-art (SoTA) solutions on both the proposed dataset and the existing dataset. Particularly, the more unlabeled data there is, the more significant the gains brought by our semi-supervised scheme. The relevant source codes and the dataset have been made publicly available at https://github.com/zzh362/CRPS-D.
中文: 本研究提出了大规模停车位检测数据集CRPS-D以解决现有数据集在真实场景覆盖上的不足,并开发了半监督方法SS-PSD,通过利用未标注数据显著提升了检测性能,超越了现有最优方案。
English: This study introduces CRPS-D, a large-scale parking slot detection dataset addressing limitations of existing datasets by including diverse real-world conditions, and proposes SS-PSD, a semi-supervised method that outperforms state-of-the-art solutions by leveraging unlabeled data.

Authors:Yan Xingyang, Huang Xiaohong, Zhang Zhao, You Tian, Xu Ziheng
Title: Using KL-Divergence to Focus Frequency Information in Low-Light Image Enhancement
Abstract:
In the Fourier domain, luminance information is primarily encoded in the amplitude spectrum, while spatial structures are captured in the phase components. The traditional Fourier Frequency information fitting employs pixel-wise loss functions, which tend to focus excessively on local information and may lead to global information loss. In this paper, we present LLFDisc, a U-shaped deep enhancement network that integrates cross-attention and gating mechanisms tailored for frequency-aware enhancement. We propose a novel distribution-aware loss that directly fits the Fourier-domain information and minimizes their divergence using a closed-form KL-Divergence objective. This enables the model to align Fourier-domain information more robustly than with conventional MSE-based losses. Furthermore, we enhance the perceptual loss based on VGG by embedding KL-Divergence on extracted deep features, enabling better structural fidelity. Extensive experiments across multiple benchmarks demonstrate that LLFDisc achieves state-of-the-art performance in both qualitative and quantitative evaluations. Our code will be released at: https://github.com/YanXY000/LLFDisc
中文摘要:本文提出LLFDisc,一种采用新型分布感知损失和增强感知损失的U形深度增强网络,通过鲁棒对齐傅里叶域信息并提升结构保真度,实现了最先进的性能表现。
English Summary: This paper introduces LLFDisc, a U-shaped deep enhancement network with a novel distribution-aware loss and enhanced perceptual loss, which achieves state-of-the-art performance by robustly aligning Fourier-domain information and improving structural fidelity.

Authors:Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Sicong Li, Qingming Huang
Title: Dual-Stage Reweighted MoE for Long-Tailed Egocentric Mistake Detection
Abstract:
In this report, we address the problem of determining whether a user performs an action incorrectly from egocentric video data. To handle the challenges posed by subtle and infrequent mistakes, we propose a Dual-Stage Reweighted Mixture-of-Experts (DR-MoE) framework. In the first stage, features are extracted using a frozen ViViT model and a LoRA-tuned ViViT model, which are combined through a feature-level expert module. In the second stage, three classifiers are trained with different objectives: reweighted cross-entropy to mitigate class imbalance, AUC loss to improve ranking under skewed distributions, and label-aware loss with sharpness-aware minimization to enhance calibration and generalization. Their predictions are fused using a classification-level expert module. The proposed method achieves strong performance, particularly in identifying rare and ambiguous mistake instances. The code is available at https://github.com/boyuh/DR-MoE.
中文: 本文提出了一种双阶段重加权专家混合框架,通过融合多模型特征和专用分类器,有效检测第一人称视频中细微且罕见的用户错误行为。
English: This paper introduces a Dual-Stage Reweighed Mixture-of-Experts (DR-MoE) framework that effectively detects subtle and infrequent user errors in egocentric videos by combining multi-model features and specialized classifiers.

Authors:Hojat Ardi, Amir Jahanshahi, Ali Diba
Title: T-SiamTPN: Temporal Siamese Transformer Pyramid Networks for Robust and Efficient UAV Tracking
Abstract:
Aerial object tracking remains a challenging task due to scale variations, dynamic backgrounds, clutter, and frequent occlusions. While most existing trackers emphasize spatial cues, they often overlook temporal dependencies, resulting in limited robustness in long-term tracking and under occlusion. Furthermore, correlation-based Siamese trackers are inherently constrained by the linear nature of correlation operations, making them ineffective against complex, non-linear appearance changes. To address these limitations, we introduce T-SiamTPN, a temporal-aware Siamese tracking framework that extends the SiamTPN architecture with explicit temporal modeling. Our approach incorporates temporal feature fusion and attention-based interactions, strengthening temporal consistency and enabling richer feature representations. These enhancements yield significant improvements over the baseline and achieve performance competitive with state-of-the-art trackers. Crucially, despite the added temporal modules, T-SiamTPN preserves computational efficiency. Deployed on the resource-constrained Jetson Nano, the tracker runs in real time at 7.1 FPS, demonstrating its suitability for real-world embedded applications without notable runtime overhead. Experimental results highlight substantial gains: compared to the baseline, T-SiamTPN improves success rate by 13.7% and precision by 14.7%. These findings underscore the importance of temporal modeling in Siamese tracking frameworks and establish T-SiamTPN as a strong and efficient solution for aerial object tracking. Code is available at: https://github.com/to/be/released
Chinese: T-SiamTPN通过在SiamTPN架构中引入时序建模,显著提升了空中目标跟踪的鲁棒性和精度,同时保持实时计算效率,适用于嵌入式应用场景。
English: T-SiamTPN enhances aerial object tracking by integrating temporal modeling into the SiamTPN framework, improving robustness and achieving state-of-the-art performance while maintaining real-time efficiency on embedded devices.

Authors:Weiming Chen, Zhihan Zhu, Yijia Wang, Zhihai He
Title: Runge-Kutta Approximation and Decoupled Attention for Rectified Flow Inversion and Semantic Editing
Abstract:
Rectified flow (RF) models have recently demonstrated superior generative performance compared to DDIM-based diffusion models. However, in real-world applications, they suffer from two major challenges: (1) low inversion accuracy that hinders the consistency with the source image, and (2) entangled multimodal attention in diffusion transformers, which hinders precise attention control. To address the first challenge, we propose an efficient high-order inversion method for rectified flow models based on the Runge-Kutta solver of differential equations. To tackle the second challenge, we introduce Decoupled Diffusion Transformer Attention (DDTA), a novel mechanism that disentangles text and image attention inside the multimodal diffusion transformers, enabling more precise semantic control. Extensive experiments on image reconstruction and text-guided editing tasks demonstrate that our method achieves state-of-the-art performance in terms of fidelity and editability. Code is available at https://github.com/wmchen/RKSovler_DDTA.
Chinese: 整流流模型面临反转精度低和多模态注意力纠缠的挑战,通过基于龙格-库塔求解器的高阶反转方法和解耦扩散变换器注意力机制,在保真度和可编辑性方面实现了最优性能。
English: Rectified flow models face challenges with inversion accuracy and entangled multimodal attention, which are addressed through a high-order inversion method using the Runge-Kutta solver and a Decoupled Diffusion Transformer Attention mechanism, achieving state-of-the-art performance in fidelity and editability.

Authors:Qifei Jia, Yu Liu, Yajie Chai, Xintong Yao, Qiming Lu, Yasen Zhang, Runyu Shi, Ying Huang, Guoquan Zhang
Title: Lego-Edit: A General Image Editing Framework with Model-Level Bricks and MLLM Builder
Abstract:
Instruction-based image editing has garnered significant attention due to its direct interaction with users. However, real-world user instructions are immensely diverse, and existing methods often fail to generalize effectively to instructions outside their training domain, limiting their practical application. To address this, we propose Lego-Edit, which leverages the generalization capability of Multi-modal Large Language Model (MLLM) to organize a suite of model-level editing tools to tackle this challenge. Lego-Edit incorporates two key designs: (1) a model-level toolkit comprising diverse models efficiently trained on limited data and several image manipulation functions, enabling fine-grained composition of editing actions by the MLLM; and (2) a three-stage progressive reinforcement learning approach that uses feedback on unannotated, open-domain instructions to train the MLLM, equipping it with generalized reasoning capabilities for handling real-world instructions. Experiments demonstrate that Lego-Edit achieves state-of-the-art performance on GEdit-Bench and ImgBench. It exhibits robust reasoning capabilities for open-domain instructions and can utilize newly introduced editing tools without additional fine-tuning. Code is available: https://github.com/xiaomi-research/lego-edit.
Chinese: Lego-Edit利用多模态大语言模型组织编辑工具包,并通过渐进式强化学习训练,在开放领域指令上展现出强大的推理能力,无需额外微调即可使用新工具,在GEdit-Bench和ImgBench上实现了最先进的性能。
English: Lego-Edit utilizes a Multi-modal Large Language Model to orchestrate a toolkit of editing models and employs progressive reinforcement learning, achieving state-of-the-art performance by generalizing effectively to diverse real-world instructions without requiring additional fine-tuning for new tools.

Authors:Yabo Zhang, Yihan Zeng, Qingyun Li, Zhen Hu, Kavin Han, Wangmeng Zuo
Title: Tool-R1: Sample-Efficient Reinforcement Learning for Agentic Tool Use
Abstract:
Large language models (LLMs) have demonstrated strong capabilities in language understanding and reasoning, yet they remain limited when tackling real-world tasks that require up-to-date knowledge, precise operations, or specialized tool use. To address this, we propose Tool-R1, a reinforcement learning framework that enables LLMs to perform general, compositional, and multi-step tool use by generating executable Python code. Tool-R1 supports integration of user-defined tools and standard libraries, with variable sharing across steps to construct coherent workflows. An outcome-based reward function, combining LLM-based answer judgment and code execution success, guides policy optimization. To improve training efficiency, we maintain a dynamic sample queue to cache and reuse high-quality trajectories, reducing the overhead of costly online sampling. Experiments on the GAIA benchmark show that Tool-R1 substantially improves both accuracy and robustness, achieving about 10\% gain over strong baselines, with larger improvements on complex multi-step tasks. These results highlight the potential of Tool-R1 for enabling reliable and efficient tool-augmented reasoning in real-world applications. Our code will be available at https://github.com/YBYBZhang/Tool-R1.
Chinese: Tool-R1 是一个强化学习框架,通过生成可执行的 Python 代码和集成用户定义工具,提升大语言模型处理复杂多步骤任务的能力,在 GAIA 基准测试中显著提高了准确性和鲁棒性。
English: Tool-R1 is a reinforcement learning framework that enhances large language models' ability to perform complex, multi-step tasks using executable Python code and integrated tools, significantly improving accuracy and robustness on benchmarks like GAIA.

Authors:Julien Walther, Rémi Giraud, Michaël Clément
Title: Superpixel Anything: A general object-based framework for accurate yet regular superpixel segmentation
Abstract:
Superpixels are widely used in computer vision to simplify image representation and reduce computational complexity. While traditional methods rely on low-level features, deep learning-based approaches leverage high-level features but also tend to sacrifice regularity of superpixels to capture complex objects, leading to accurate but less interpretable segmentations. In this work, we introduce SPAM (SuperPixel Anything Model), a versatile framework for segmenting images into accurate yet regular superpixels. We train a model to extract image features for superpixel generation, and at inference, we leverage a large-scale pretrained model for semantic-agnostic segmentation to ensure that superpixels align with object masks. SPAM can handle any prior high-level segmentation, resolving uncertainty regions, and is able to interactively focus on specific objects. Comprehensive experiments demonstrate that SPAM qualitatively and quantitatively outperforms state-of-the-art methods on segmentation tasks, making it a valuable and robust tool for various applications. Code and pre-trained models are available here: https://github.com/waldo-j/spam.
中文摘要:SPAM是一个多功能框架,通过结合图像特征和大规模预训练模型,生成既准确又规则化的超像素,在分割任务中超越了现有方法。
English Summary: SPAM is a versatile framework that generates accurate and regular superpixels by integrating image features with large-scale pretrained models, outperforming existing methods in segmentation tasks.

Authors:Zhehao Li, Yucheng Qian, Chong Wang, Yinghao Lu, Zhihao Yang, Jiafei Wu
Title: Contextualized Representation Learning for Effective Human-Object Interaction Detection
Abstract:
Human-Object Interaction (HOI) detection aims to simultaneously localize human-object pairs and recognize their interactions. While recent two-stage approaches have made significant progress, they still face challenges due to incomplete context modeling. In this work, we introduce a Contextualized Representation Learning that integrates both affordance-guided reasoning and contextual prompts with visual cues to better capture complex interactions. We enhance the conventional HOI detection framework by expanding it beyond simple human-object pairs to include multivariate relationships involving auxiliary entities like tools. Specifically, we explicitly model the functional role (affordance) of these auxiliary objects through triplet structures . This enables our model to identify tool-dependent interactions such as 'filling'. Furthermore, the learnable prompt is enriched with instance categories and subsequently integrated with contextual visual features using an attention mechanism. This process aligns language with image content at both global and regional levels. These contextualized representations equip the model with enriched relational cues for more reliable reasoning over complex, context-dependent interactions. Our proposed method demonstrates superior performance on both the HICO-Det and V-COCO datasets in most scenarios. The source code is available at https://github.com/lzzhhh1019/CRL.
中文: 本文提出了一种情境化表征学习方法,通过结合功能引导推理和情境提示与视觉线索来增强人物-物体交互检测,在标准基准测试中取得了优越性能。
English: This paper introduces a Contextualized Representation Learning method that enhances Human-Object Interaction detection by incorporating affordance-guided reasoning and contextual prompts with visual cues, achieving superior performance on standard benchmarks.

Authors:Siju Ma, Changsiyu Gong, Xiaofeng Fan, Yong Ma, Chengjie Jiang
Title: RIS-FUSION: Rethinking Text-Driven Infrared and Visible Image Fusion from the Perspective of Referring Image Segmentation
Abstract:
Text-driven infrared and visible image fusion has gained attention for enabling natural language to guide the fusion process. However, existing methods lack a goal-aligned task to supervise and evaluate how effectively the input text contributes to the fusion outcome. We observe that referring image segmentation (RIS) and text-driven fusion share a common objective: highlighting the object referred to by the text. Motivated by this, we propose RIS-FUSION, a cascaded framework that unifies fusion and RIS through joint optimization. At its core is the LangGatedFusion module, which injects textual features into the fusion backbone to enhance semantic alignment. To support multimodal referring image segmentation task, we introduce MM-RIS, a large-scale benchmark with 12.5k training and 3.5k testing triplets, each consisting of an infrared-visible image pair, a segmentation mask, and a referring expression. Extensive experiments show that RIS-FUSION achieves state-of-the-art performance, outperforming existing methods by over 11% in mIoU. Code and dataset will be released at https://github.com/SijuMa2003/RIS-FUSION.
中文摘要:作者提出RIS-FUSION框架,通过联合优化将参照图像分割融入文本驱动的红外-可见光图像融合,在新建的大规模数据集上实现最佳性能,mIoU指标提升超过11%。
English Summary: The authors propose RIS-FUSION, a unified framework that enhances text-driven infrared-visible image fusion by incorporating referring image segmentation through joint optimization, achieving state-of-the-art performance with an 11% mIoU improvement.

Authors:Wenzhuo Jin, Qianfeng Yang, Xianhao Wu, Hongming Chen, Pengpeng Li, Xiang Chen
Title: SmokeBench: A Real-World Dataset for Surveillance Image Desmoking in Early-Stage Fire Scenes
Abstract:
Early-stage fire scenes (0-15 minutes after ignition) represent a crucial temporal window for emergency interventions. During this stage, the smoke produced by combustion significantly reduces the visibility of surveillance systems, severely impairing situational awareness and hindering effective emergency response and rescue operations. Consequently, there is an urgent need to remove smoke from images to obtain clear scene information. However, the development of smoke removal algorithms remains limited due to the lack of large-scale, real-world datasets comprising paired smoke-free and smoke-degraded images. To address these limitations, we present a real-world surveillance image desmoking benchmark dataset named SmokeBench, which contains image pairs captured under diverse scenes setup and smoke concentration. The curated dataset provides precisely aligned degraded and clean images, enabling supervised learning and rigorous evaluation. We conduct comprehensive experiments by benchmarking a variety of desmoking methods on our dataset. Our dataset provides a valuable foundation for advancing robust and practical image desmoking in real-world fire scenes. This dataset has been released to the public and can be downloaded from https://github.com/ncfjd/SmokeBench.
中文摘要:火灾初期烟雾会严重遮挡监控视野,为此我们开发了SmokeBench真实场景数据集,提供成对的无烟与有烟图像,以推动图像去烟算法发展,提升应急救援效能。
English Summary: Early-stage fires produce smoke that obscures surveillance visibility, prompting the creation of SmokeBench—a real-world dataset of paired smoke-free and smoke-degraded images to advance desmoking algorithms for emergency response.

Authors:Xianda Guo, Chenming Zhang, Ruilin Wang, Youmin Zhang, Wenzhao Zheng, Matteo Poggi, Hao Zhao, Qin Zou, Long Chen
Title: StereoCarla: A High-Fidelity Driving Dataset for Generalizable Stereo
Abstract:
Stereo matching plays a crucial role in enabling depth perception for autonomous driving and robotics. While recent years have witnessed remarkable progress in stereo matching algorithms, largely driven by learning-based methods and synthetic datasets, the generalization performance of these models remains constrained by the limited diversity of existing training data. To address these challenges, we present StereoCarla, a high-fidelity synthetic stereo dataset specifically designed for autonomous driving scenarios. Built on the CARLA simulator, StereoCarla incorporates a wide range of camera configurations, including diverse baselines, viewpoints, and sensor placements as well as varied environmental conditions such as lighting changes, weather effects, and road geometries. We conduct comprehensive cross-domain experiments across four standard evaluation datasets (KITTI2012, KITTI2015, Middlebury, ETH3D) and demonstrate that models trained on StereoCarla outperform those trained on 11 existing stereo datasets in terms of generalization accuracy across multiple benchmarks. Furthermore, when integrated into multi-dataset training, StereoCarla contributes substantial improvements to generalization accuracy, highlighting its compatibility and scalability. This dataset provides a valuable benchmark for developing and evaluating stereo algorithms under realistic, diverse, and controllable settings, facilitating more robust depth perception systems for autonomous vehicles. Code can be available at https://github.com/XiandaGuo/OpenStereo, and data can be available at https://xiandaguo.net/StereoCarla.
中文: StereoCarla是基于CARLA模拟器开发的高保真合成立体数据集,通过整合多样化的相机配置和环境条件,显著提升了立体匹配算法在自动驾驶中的泛化能力,并在多个基准测试中表现出卓越性能。
English: StereoCarla is a high-fidelity synthetic stereo dataset developed using the CARLA simulator to enhance the generalization of stereo matching algorithms in autonomous driving by incorporating diverse camera configurations and environmental conditions, demonstrating superior performance across multiple benchmarks.

Authors:Jinjie Shen, Yaxiong Wang, Lechao Cheng, Nan Pu, Zhun Zhong
Title: Beyond Artificial Misalignment: Detecting and Grounding Semantic-Coordinated Multimodal Manipulations
Abstract:
The detection and grounding of manipulated content in multimodal data has emerged as a critical challenge in media forensics. While existing benchmarks demonstrate technical progress, they suffer from misalignment artifacts that poorly reflect real-world manipulation patterns: practical attacks typically maintain semantic consistency across modalities, whereas current datasets artificially disrupt cross-modal alignment, creating easily detectable anomalies. To bridge this gap, we pioneer the detection of semantically-coordinated manipulations where visual edits are systematically paired with semantically consistent textual descriptions. Our approach begins with constructing the first Semantic-Aligned Multimodal Manipulation (SAMM) dataset, generated through a two-stage pipeline: 1) applying state-of-the-art image manipulations, followed by 2) generation of contextually-plausible textual narratives that reinforce the visual deception. Building on this foundation, we propose a Retrieval-Augmented Manipulation Detection and Grounding (RamDG) framework. RamDG commences by harnessing external knowledge repositories to retrieve contextual evidence, which serves as the auxiliary texts and encoded together with the inputs through our image forgery grounding and deep manipulation detection modules to trace all manipulations. Extensive experiments demonstrate our framework significantly outperforms existing methods, achieving 2.06\% higher detection accuracy on SAMM compared to state-of-the-art approaches. The dataset and code are publicly available at https://github.com/shen8424/SAMM-RamDG-CAP.
中文摘要:本研究通过构建首个语义对齐的多模态篡改数据集,并开发检索增强的检测框架,创新性地实现了对语义协调的多模态篡改内容的检测,其性能显著优于现有方法。
English Summary: This research introduces a novel framework for detecting semantically-coordinated multimodal manipulations by creating the first Semantic-Aligned Multimodal Manipulation dataset and developing a retrieval-augmented detection system that significantly outperforms existing methods.

Authors:Liming Lu, Shuchao Pang, Xu Zheng, Xiang Gu, Anan Du, Yunhuai Liu, Yongbin Zhou
Title: CIARD: Cyclic Iterative Adversarial Robustness Distillation
Abstract:
Adversarial robustness distillation (ARD) aims to transfer both performance and robustness from teacher model to lightweight student model, enabling resilient performance on resource-constrained scenarios. Though existing ARD approaches enhance student model's robustness, the inevitable by-product leads to the degraded performance on clean examples. We summarize the causes of this problem inherent in existing methods with dual-teacher framework as: 1. The divergent optimization objectives of dual-teacher models, i.e., the clean and robust teachers, impede effective knowledge transfer to the student model, and 2. The iteratively generated adversarial examples during training lead to performance deterioration of the robust teacher model. To address these challenges, we propose a novel Cyclic Iterative ARD (CIARD) method with two key innovations: a. A multi-teacher framework with contrastive push-loss alignment to resolve conflicts in dual-teacher optimization objectives, and b. Continuous adversarial retraining to maintain dynamic teacher robustness against performance degradation from the varying adversarial examples. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CIARD achieves remarkable performance with an average 3.53 improvement in adversarial defense rates across various attack scenarios and a 5.87 increase in clean sample accuracy, establishing a new benchmark for balancing model robustness and generalization. Our code is available at https://github.com/eminentgu/CIARD
中文: 提出的CIARD方法通过多教师框架的对比对齐和持续对抗训练,有效解决了现有对抗鲁棒性蒸馏中清洁样本性能下降的问题,在多个数据集上实现了防御率和准确率的显著提升。
English: The proposed CIARD method introduces a multi-teacher framework with contrastive alignment and continuous retraining to simultaneously enhance adversarial robustness and clean sample accuracy in lightweight models, achieving significant improvements across multiple datasets.

Authors:Titong Jiang, Xuefeng Jiang, Yuan Ma, Xin Wen, Bailin Li, Kun Zhan, Peng Jia, Yahui Liu, Sheng Sun, Xianpeng Lang
Title: The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning
Abstract:
We present LightVLA, a simple yet effective differentiable token pruning framework for vision-language-action (VLA) models. While VLA models have shown impressive capability in executing real-world robotic tasks, their deployment on resource-constrained platforms is often bottlenecked by the heavy attention-based computation over large sets of visual tokens. LightVLA addresses this challenge through adaptive, performance-driven pruning of visual tokens: It generates dynamic queries to evaluate visual token importance, and adopts Gumbel softmax to enable differentiable token selection. Through fine-tuning, LightVLA learns to preserve the most informative visual tokens while pruning tokens which do not contribute to task execution, thereby improving efficiency and performance simultaneously. Notably, LightVLA requires no heuristic magic numbers and introduces no additional trainable parameters, making it compatible with modern inference frameworks. Experimental results demonstrate that LightVLA outperforms different VLA models and existing token pruning methods across diverse tasks on the LIBERO benchmark, achieving higher success rates with substantially reduced computational overhead. Specifically, LightVLA reduces FLOPs and latency by 59.1% and 38.2% respectively, with a 2.6% improvement in task success rate. Meanwhile, we also investigate the learnable query-based token pruning method LightVLA* with additional trainable parameters, which also achieves satisfactory performance. Our work reveals that as VLA pursues optimal performance, LightVLA spontaneously learns to prune tokens from a performance-driven perspective. To the best of our knowledge, LightVLA is the first work to apply adaptive visual token pruning to VLA tasks with the collateral goals of efficiency and performance, marking a significant step toward more efficient, powerful and practical real-time robotic systems.
中文: LightVLA是一种可微分的令牌剪枝框架,通过自适应剪除非关键视觉令牌,在提升任务成功率的同时大幅降低计算开销,从而优化视觉-语言-动作模型的效率与性能。
English: LightVLA is a differentiable token pruning framework that enhances vision-language-action models by adaptively pruning non-essential visual tokens, achieving higher task success rates with significantly reduced computational costs.

Authors:Xiang Xue, Yatu Ji, Qing-dao-er-ji Ren, Bao Shi, Min Lu, Nier Wu, Xufei Zhuang, Haiteng Xu, Gan-qi-qi-ge Cha
Title: iCD: A Implicit Clustering Distillation Mathod for Structural Information Mining
Abstract:
Logit Knowledge Distillation has gained substantial research interest in recent years due to its simplicity and lack of requirement for intermediate feature alignment; however, it suffers from limited interpretability in its decision-making process. To address this, we propose implicit Clustering Distillation (iCD): a simple and effective method that mines and transfers interpretable structural knowledge from logits, without requiring ground-truth labels or feature-space alignment. iCD leverages Gram matrices over decoupled local logit representations to enable student models to learn latent semantic structural patterns. Extensive experiments on benchmark datasets demonstrate the effectiveness of iCD across diverse teacher-student architectures, with particularly strong performance in fine-grained classification tasks -- achieving a peak improvement of +5.08% over the baseline. The code is available at: https://github.com/maomaochongaa/iCD.
Chinese: 本文提出隐式聚类蒸馏(iCD)方法,通过挖掘和传递对数中的可解释结构知识,无需真实标签或特征对齐即可提升知识蒸馏的可解释性,在细粒度分类任务中最高实现5.08%的性能提升。
English: This paper introduces implicit Clustering Distillation (iCD), a novel method that enhances interpretability in knowledge distillation by transferring structural patterns from logits without requiring labels or feature alignment, achieving up to 5.08% improvement in fine-grained classification tasks.

Authors:Fazle Rafsani, Jay Shah, Catherine D. Chong, Todd J. Schwedt, Teresa Wu
Title: DinoAtten3D: Slice-Level Attention Aggregation of DinoV2 for 3D Brain MRI Anomaly Classification
Abstract:
Anomaly detection and classification in medical imaging are critical for early diagnosis but remain challenging due to limited annotated data, class imbalance, and the high cost of expert labeling. Emerging vision foundation models such as DINOv2, pretrained on extensive, unlabeled datasets, offer generalized representations that can potentially alleviate these limitations. In this study, we propose an attention-based global aggregation framework tailored specifically for 3D medical image anomaly classification. Leveraging the self-supervised DINOv2 model as a pretrained feature extractor, our method processes individual 2D axial slices of brain MRIs, assigning adaptive slice-level importance weights through a soft attention mechanism. To further address data scarcity, we employ a composite loss function combining supervised contrastive learning with class-variance regularization, enhancing inter-class separability and intra-class consistency. We validate our framework on the ADNI dataset and an institutional multi-class headache cohort, demonstrating strong anomaly classification performance despite limited data availability and significant class imbalance. Our results highlight the efficacy of utilizing pretrained 2D foundation models combined with attention-based slice aggregation for robust volumetric anomaly detection in medical imaging. Our implementation is publicly available at https://github.com/Rafsani/DinoAtten3D.git.
中文: 本研究提出一种基于注意力的框架,利用自监督DINOv2模型对3D医学图像进行异常分类,通过自适应切片加权和复合损失函数有效解决数据稀缺和类别不平衡问题,并在脑部MRI数据集上验证了其有效性。
English: This study introduces an attention-based framework using the self-supervised DINOv2 model to classify anomalies in 3D medical images, effectively addressing data scarcity and class imbalance through adaptive slice weighting and a composite loss function, validated on brain MRI datasets.

Authors:Rui-Feng Wang, Mingrui Xu, Matthew C Bauer, Iago Beffart Schardong, Xiaowen Ma, Kangning Cui
Title: Cott-ADNet: Lightweight Real-Time Cotton Boll and Flower Detection Under Field Conditions
Abstract:
Cotton is one of the most important natural fiber crops worldwide, yet harvesting remains limited by labor-intensive manual picking, low efficiency, and yield losses from missing the optimal harvest window. Accurate recognition of cotton bolls and their maturity is therefore essential for automation, yield estimation, and breeding research. We propose Cott-ADNet, a lightweight real-time detector tailored to cotton boll and flower recognition under complex field conditions. Building on YOLOv11n, Cott-ADNet enhances spatial representation and robustness through improved convolutional designs, while introducing two new modules: a NeLU-enhanced Global Attention Mechanism to better capture weak and low-contrast features, and a Dilated Receptive Field SPPF to expand receptive fields for more effective multi-scale context modeling at low computational cost. We curate a labeled dataset of 4,966 images, and release an external validation set of 1,216 field images to support future research. Experiments show that Cott-ADNet achieves 91.5% Precision, 89.8% Recall, 93.3% mAP50, 71.3% mAP, and 90.6% F1-Score with only 7.5 GFLOPs, maintaining stable performance under multi-scale and rotational variations. These results demonstrate Cott-ADNet as an accurate and efficient solution for in-field deployment, and thus provide a reliable basis for automated cotton harvesting and high-throughput phenotypic analysis. Code and dataset is available at https://github.com/SweefongWong/Cott-ADNet.
中文: 本研究提出Cott-ADNet轻量实时检测器,通过增强模块在复杂田间条件下精准识别棉铃和棉花,以低计算成本实现高精度,为自动化采收和研究提供可靠解决方案。
English: The study introduces Cott-ADNet, a lightweight real-time detector that enhances cotton boll and flower recognition in complex field conditions through improved modules, achieving high accuracy with low computational cost for automated harvesting and research.

Authors:Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, Mingyu Liu, Dingning Liu, Jiange Yang, Zhoujie Fu, Junyi Chen, Chunhua Shen, Jiangmiao Pang, Kaipeng Zhang, Tong He
Title: OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling
Abstract:
The field of 4D world modeling - aiming to jointly capture spatial geometry and temporal dynamics - has witnessed remarkable progress in recent years, driven by advances in large-scale generative models and multimodal learning. However, the development of truly general 4D world models remains fundamentally constrained by the availability of high-quality data. Existing datasets and benchmarks often lack the dynamic complexity, multi-domain diversity, and spatial-temporal annotations required to support key tasks such as 4D geometric reconstruction, future prediction, and camera-control video generation. To address this gap, we introduce OmniWorld, a large-scale, multi-domain, multi-modal dataset specifically designed for 4D world modeling. OmniWorld consists of a newly collected OmniWorld-Game dataset and several curated public datasets spanning diverse domains. Compared with existing synthetic datasets, OmniWorld-Game provides richer modality coverage, larger scale, and more realistic dynamic interactions. Based on this dataset, we establish a challenging benchmark that exposes the limitations of current state-of-the-art (SOTA) approaches in modeling complex 4D environments. Moreover, fine-tuning existing SOTA methods on OmniWorld leads to significant performance gains across 4D reconstruction and video generation tasks, strongly validating OmniWorld as a powerful resource for training and evaluation. We envision OmniWorld as a catalyst for accelerating the development of general-purpose 4D world models, ultimately advancing machines' holistic understanding of the physical world.

Authors:Salma Galaaoui, Eduardo Valle, David Picard, Nermin Samet
Title: 3D Human Pose and Shape Estimation from LiDAR Point Clouds: A Review
Abstract:
In this paper, we present a comprehensive review of 3D human pose estimation and human mesh recovery from in-the-wild LiDAR point clouds. We compare existing approaches across several key dimensions, and propose a structured taxonomy to classify these methods. Following this taxonomy, we analyze each method's strengths, limitations, and design choices. In addition, (i) we perform a quantitative comparison of the three most widely used datasets, detailing their characteristics; (ii) we compile unified definitions of all evaluation metrics; and (iii) we establish benchmark tables for both tasks on these datasets to enable fair comparisons and promote progress in the field. We also outline open challenges and research directions critical for advancing LiDAR-based 3D human understanding. Moreover, we maintain an accompanying webpage that organizes papers according to our taxonomy and continuously update it with new studies: https://github.com/valeoai/3D-Human-Pose-Shape-Estimation-from-LiDAR
本文对基于LiDAR的3D人体姿态估计与网格重建方法进行了系统综述和分类,比较了不同方法、数据集及评估指标,建立了基准测试并指出了未来研究方向。
This paper provides a comprehensive review and taxonomy of 3D human pose estimation and mesh recovery from LiDAR data, comparing methods, datasets, and metrics while establishing benchmarks and identifying future research directions.

Authors:Felix B. Mueller, Timo Lueddecke, Richard Vogg, Alexander S. Ecker
Title: Domain-Adaptive Pretraining Improves Primate Behavior Recognition
Abstract:
Computer vision for animal behavior offers promising tools to aid research in ecology, cognition, and to support conservation efforts. Video camera traps allow for large-scale data collection, but high labeling costs remain a bottleneck to creating large-scale datasets. We thus need data-efficient learning approaches. In this work, we show that we can utilize self-supervised learning to considerably improve action recognition on primate behavior. On two datasets of great ape behavior (PanAf and ChimpACT), we outperform published state-of-the-art action recognition models by 6.1 %pt. accuracy and 6.3 %pt. mAP, respectively. We achieve this by utilizing a pretrained V-JEPA model and applying domain-adaptive pretraining (DAP), i.e. continuing the pretraining with in-domain data. We show that most of the performance gain stems from the DAP. Our method promises great potential for improving the recognition of animal behavior, as DAP does not require labeled samples. Code is available at https://github.com/ecker-lab/dap-behavior
中文: 通过自监督学习和领域自适应预训练,该方法显著提升了灵长类行为动作识别性能,无需标注数据即可超越现有最优模型。
English: Self-supervised learning with domain-adaptive pretraining significantly enhances action recognition in primate behavior, outperforming existing models without requiring labeled data.

Authors:Johanna Karras, Yingwei Li, Yasamin Jafarian, Ira Kemelmacher-Shlizerman
Title: HoloGarment: 360° Novel View Synthesis of In-the-Wild Garments
Abstract:
Novel view synthesis (NVS) of in-the-wild garments is a challenging task due significant occlusions, complex human poses, and cloth deformations. Prior methods rely on synthetic 3D training data consisting of mostly unoccluded and static objects, leading to poor generalization on real-world clothing. In this paper, we propose HoloGarment (Hologram-Garment), a method that takes 1-3 images or a continuous video of a person wearing a garment and generates 360° novel views of the garment in a canonical pose. Our key insight is to bridge the domain gap between real and synthetic data with a novel implicit training paradigm leveraging a combination of large-scale real video data and small-scale synthetic 3D data to optimize a shared garment embedding space. During inference, the shared embedding space further enables dynamic video-to-360° NVS through the construction of a garment "atlas" representation by finetuning a garment embedding on a specific real-world video. The atlas captures garment-specific geometry and texture across all viewpoints, independent of body pose or motion. Extensive experiments show that HoloGarment achieves state-of-the-art performance on NVS of in-the-wild garments from images and videos. Notably, our method robustly handles challenging real-world artifacts -- such as wrinkling, pose variation, and occlusion -- while maintaining photorealism, view consistency, fine texture details, and accurate geometry. Visit our project page for additional results: https://johannakarras.github.io/HoloGarment

Authors:Ondřej Valach, Ivan Gruber
Title: RailSafeNet: Visual Scene Understanding for Tram Safety
Abstract:
Tram-human interaction safety is an important challenge, given that trams frequently operate in densely populated areas, where collisions can range from minor injuries to fatal outcomes. This paper addresses the issue from the perspective of designing a solution leveraging digital image processing, deep learning, and artificial intelligence to improve the safety of pedestrians, drivers, cyclists, pets, and tram passengers. We present RailSafeNet, a real-time framework that fuses semantic segmentation, object detection and a rule-based Distance Assessor to highlight track intrusions. Using only monocular video, the system identifies rails, localises nearby objects and classifies their risk by comparing projected distances with the standard 1435mm rail gauge. Experiments on the diverse RailSem19 dataset show that a class-filtered SegFormer B3 model achieves 65% intersection-over-union (IoU), while a fine-tuned YOLOv8 attains 75.6% mean average precision (mAP) calculated at an intersection over union (IoU) threshold of 0.50. RailSafeNet therefore delivers accurate, annotation-light scene understanding that can warn drivers before dangerous situations escalate. Code available at https://github.com/oValach/RailSafeNet.
中文: 本文提出RailSafeNet实时框架,通过数字图像处理和人工智能检测轨道入侵并评估电车碰撞风险,在RailSem19数据集上实现了高精度的目标检测与语义分割性能。
English: This paper introduces RailSafeNet, a real-time framework using digital image processing and AI to detect track intrusions and assess collision risks for trams, achieving high accuracy in object detection and semantic segmentation on the RailSem19 dataset.

Authors:Bernardo Forni, Gabriele Lombardi, Federico Pozzi, Mirco Planamente
Title: FS-SAM2: Adapting Segment Anything Model 2 for Few-Shot Semantic Segmentation via Low-Rank Adaptation
Abstract:
Few-shot semantic segmentation has recently attracted great attention. The goal is to develop a model capable of segmenting unseen classes using only a few annotated samples. Most existing approaches adapt a pre-trained model by training from scratch an additional module. Achieving optimal performance with these approaches requires extensive training on large-scale datasets. The Segment Anything Model 2 (SAM2) is a foundational model for zero-shot image and video segmentation with a modular design. In this paper, we propose a Few-Shot segmentation method based on SAM2 (FS-SAM2), where SAM2's video capabilities are directly repurposed for the few-shot task. Moreover, we apply a Low-Rank Adaptation (LoRA) to the original modules in order to handle the diverse images typically found in standard datasets, unlike the temporally connected frames used in SAM2's pre-training. With this approach, only a small number of parameters is meta-trained, which effectively adapts SAM2 while benefiting from its impressive segmentation performance. Our method supports any K-shot configuration. We evaluate FS-SAM2 on the PASCAL-5$^i$, COCO-20$^i$ and FSS-1000 datasets, achieving remarkable results and demonstrating excellent computational efficiency during inference. Code is available at https://github.com/fornib/FS-SAM2
中文: 本文提出FS-SAM2方法,通过重新利用Segment Anything Model 2的视频分割能力并应用低秩适配技术,仅需训练少量参数即可高效适应少样本分割任务,在多个基准测试中取得优异成果且保持计算高效性。
English: This paper introduces FS-SAM2, a few-shot segmentation method that repurposes the Segment Anything Model 2's video capabilities and applies Low-Rank Adaptation to efficiently adapt it with minimal parameter training, achieving remarkable results on multiple benchmarks while maintaining computational efficiency.

Authors:Zhi Qin Tan, Xiatian Zhu, Owen Addison, Yunpeng Li
Title: U-Mamba2: Scaling State Space Models for Dental Anatomy Segmentation in CBCT
Abstract:
Cone-Beam Computed Tomography (CBCT) is a widely used 3D imaging technique in dentistry, providing volumetric information about the anatomical structures of jaws and teeth. Accurate segmentation of these anatomies is critical for clinical applications such as diagnosis and surgical planning, but remains time-consuming and challenging. In this paper, we present U-Mamba2, a new neural network architecture designed for multi-anatomy CBCT segmentation in the context of the ToothFairy3 challenge. U-Mamba2 integrates the Mamba2 state space models into the U-Net architecture, enforcing stronger structural constraints for higher efficiency without compromising performance. In addition, we integrate interactive click prompts with cross-attention blocks, pre-train U-Mamba2 using self-supervised learning, and incorporate dental domain knowledge into the model design to address key challenges of dental anatomy segmentation in CBCT. Extensive experiments, including independent tests, demonstrate that U-Mamba2 is both effective and efficient, securing first place in both tasks of the Toothfairy3 challenge. In Task 1, U-Mamba2 achieved a mean Dice of 0.84, HD95 of 38.17 with the held-out test data, with an average inference time of 40.58s. In Task 2, U-Mamba2 achieved the mean Dice of 0.87 and HD95 of 2.15 with the held-out test data. The code is publicly available at https://github.com/zhiqin1998/UMamba2.
Chinese: U-Mamba2是一种新型神经网络,它将Mamba2状态空间模型集成到U-Net架构中,在ToothFairy3挑战赛中实现了高效准确的多解剖结构CBCT分割,并获得第一名。
English: U-Mamba2 is a novel neural network that integrates Mamba2 state space models into U-Net architecture, achieving efficient and accurate multi-anatomy CBCT segmentation for dental applications while winning first place in the ToothFairy3 challenge.

Authors:Farahdiba Zarin, Nicolas Padoy, Jérémy Dana, Vinkle Srivastav
Title: End-to-End Learning of Multi-Organ Implicit Surfaces from 3D Medical Imaging Data
Abstract:
The fine-grained surface reconstruction of different organs from 3D medical imaging can provide advanced diagnostic support and improved surgical planning. However, the representation of the organs is often limited by the resolution, with a detailed higher resolution requiring more memory and computing footprint. Implicit representations of objects have been proposed to alleviate this problem in general computer vision by providing compact and differentiable functions to represent the 3D object shapes. However, architectural and data-related differences prevent the direct application of these methods to medical images. This work introduces ImplMORe, an end-to-end deep learning method using implicit surface representations for multi-organ reconstruction from 3D medical images. ImplMORe incorporates local features using a 3D CNN encoder and performs multi-scale interpolation to learn the features in the continuous domain using occupancy functions. We apply our method for single and multiple organ reconstructions using the totalsegmentator dataset. By leveraging the continuous nature of occupancy functions, our approach outperforms the discrete explicit representation based surface reconstruction approaches, providing fine-grained surface details of the organ at a resolution higher than the given input image. The source code will be made publicly available at: https://github.com/CAMMA-public/ImplMORe
中文: ImplMORe是一种端到端的深度学习方法,通过隐式表面表示从3D医学图像实现精细多器官重建,利用连续占据函数提供比输入图像更高分辨率的器官表面细节,性能优于离散表示方法。
English: ImplMORe is an end-to-end deep learning method that uses implicit surface representations to achieve fine-grained multi-organ reconstruction from 3D medical images, outperforming discrete approaches by providing higher-resolution surface details with continuous occupancy functions.

Authors:Sebastian Diaz, Benjamin Billot, Neel Dey, Molin Zhang, Esra Abaci Turk, P. Ellen Grant, Polina Golland, Elfar Adalsteinsson
Title: Robust Fetal Pose Estimation across Gestational Ages via Cross-Population Augmentation
Abstract:
Fetal motion is a critical indicator of neurological development and intrauterine health, yet its quantification remains challenging, particularly at earlier gestational ages (GA). Current methods track fetal motion by predicting the location of annotated landmarks on 3D echo planar imaging (EPI) time-series, primarily in third-trimester fetuses. The predicted landmarks enable simplification of the fetal body for downstream analysis. While these methods perform well within their training age distribution, they consistently fail to generalize to early GAs due to significant anatomical changes in both mother and fetus across gestation, as well as the difficulty of obtaining annotated early GA EPI data. In this work, we develop a cross-population data augmentation framework that enables pose estimation models to robustly generalize to younger GA clinical cohorts using only annotated images from older GA cohorts. Specifically, we introduce a fetal-specific augmentation strategy that simulates the distinct intrauterine environment and fetal positioning of early GAs. Our experiments find that cross-population augmentation yields reduced variability and significant improvements across both older GA and challenging early GA cases. By enabling more reliable pose estimation across gestation, our work potentially facilitates early clinical detection and intervention in challenging 4D fetal imaging settings. Code is available at https://github.com/sebodiaz/cross-population-pose.
中文摘要:本研究开发了一种跨群体数据增强框架,通过仅使用较大胎龄标注数据来模拟早期胎儿子宫环境,使胎儿姿态估计模型能够可靠地适用于较小胎龄临床病例,显著提升了胎儿运动追踪在不同发育阶段的准确性。
English Summary: This study introduces a cross-population data augmentation framework that enables fetal pose estimation models to generalize effectively to earlier gestational ages using only annotated data from older cohorts, improving motion tracking reliability across different developmental stages.

Authors:Bingyu Li, Haocheng Dong, Da Zhang, Zhiyuan Zhao, Junyu Gao, Xuelong Li
Title: Exploring Efficient Open-Vocabulary Segmentation in the Remote Sensing
Abstract:
Open-Vocabulary Remote Sensing Image Segmentation (OVRSIS), an emerging task that adapts Open-Vocabulary Segmentation (OVS) to the remote sensing (RS) domain, remains underexplored due to the absence of a unified evaluation benchmark and the domain gap between natural and RS images. To bridge these gaps, we first establish a standardized OVRSIS benchmark (\textbf{OVRSISBench}) based on widely-used RS segmentation datasets, enabling consistent evaluation across methods. Using this benchmark, we comprehensively evaluate several representative OVS/OVRSIS models and reveal their limitations when directly applied to remote sensing scenarios. Building on these insights, we propose \textbf{RSKT-Seg}, a novel open-vocabulary segmentation framework tailored for remote sensing. RSKT-Seg integrates three key components: (1) a Multi-Directional Cost Map Aggregation (RS-CMA) module that captures rotation-invariant visual cues by computing vision-language cosine similarities across multiple directions; (2) an Efficient Cost Map Fusion (RS-Fusion) transformer, which jointly models spatial and semantic dependencies with a lightweight dimensionality reduction strategy; and (3) a Remote Sensing Knowledge Transfer (RS-Transfer) module that injects pre-trained knowledge and facilitates domain adaptation via enhanced upsampling. Extensive experiments on the benchmark show that RSKT-Seg consistently outperforms strong OVS baselines by +3.8 mIoU and +5.9 mACC, while achieving 2x faster inference through efficient aggregation. Our code is \href{https://github.com/LiBingyu01/RSKT-Seg}{\textcolor{blue}{here}}.
中文: 针对开放词汇遥感图像分割缺乏统一评估基准和领域差异的问题,本研究建立了标准化基准并提出了RSKT-Seg框架,通过多方向特征聚合与知识迁移模块实现性能突破,在保持高效推理的同时显著超越现有基线模型。
English: To address the lack of benchmarks and domain gaps in Open-Vocabulary Remote Sensing Image Segmentation (OVRSIS), this study introduces a standardized evaluation benchmark and proposes RSKT-Seg, a novel framework that integrates multi-directional feature aggregation and domain adaptation, achieving superior performance and efficiency over existing methods.

Authors:Zilong Zhang, Chujie Qin, Chunle Guo, Yong Zhang, Chao Xue, Ming-Ming Cheng, Chongyi Li
Title: RAM++: Robust Representation Learning via Adaptive Mask for All-in-One Image Restoration
Abstract:
This work presents Robust Representation Learning via Adaptive Mask (RAM++), a two-stage framework for all-in-one image restoration. RAM++ integrates high-level semantic understanding with low-level texture generation to achieve content-oriented robust restoration. It addresses the limitations of existing degradation-oriented methods in extreme scenarios (e.g., degradations strongly coupled with image structures). RAM++ also mitigates common challenges such as unbalanced performance across tasks, overfitting to seen degradations, and weak generalization to unseen ones through three key designs: 1) Adaptive Semantic-Aware Mask (AdaSAM): a pretraining strategy that applies pixel-level masks to semantically rich and textured regions. This design enables the network to learn both generative priors and image content priors from various degradations. 2) Mask Attribute Conductance (MAC): a selective fine-tuning strategy that adjusts the layers with higher contributions to bridge the integrity gap between masked pretraining and full-image fine-tuning while retaining learned priors. 3) Robust Feature Regularization (RFR): a strategy that leverages DINOv2's semantically consistent and degradation-invariant representations, together with efficient feature fusion, to achieve faithful and semantically coherent restoration. With these designs, RAM++ achieves robust, well-balanced, and state-of-the-art performance across seen, unseen, extreme, and mixed degradations. Our code and model will be released at https://github.com/DragonisCV/RAM
中文摘要:RAM++ 是一个两阶段图像修复框架,通过自适应掩码策略融合高级语义理解与低级纹理生成,在各类退化场景中实现鲁棒性最优性能。
English Summary: RAM++ is a two-stage image restoration framework that integrates semantic understanding with texture generation through adaptive masking strategies to achieve robust performance across diverse degradation scenarios.

Authors:Marian Renz, Felix Igelbrink, Martin Atzmueller
Title: Integrating Prior Observations for Incremental 3D Scene Graph Prediction
Abstract:
3D semantic scene graphs (3DSSG) provide compact structured representations of environments by explicitly modeling objects, attributes, and relationships. While 3DSSGs have shown promise in robotics and embodied AI, many existing methods rely mainly on sensor data, not integrating further information from semantically rich environments. Additionally, most methods assume access to complete scene reconstructions, limiting their applicability in real-world, incremental settings. This paper introduces a novel heterogeneous graph model for incremental 3DSSG prediction that integrates additional, multi-modal information, such as prior observations, directly into the message-passing process. Utilizing multiple layers, the model flexibly incorporates global and local scene representations without requiring specialized modules or full scene reconstructions. We evaluate our approach on the 3DSSG dataset, showing that GNNs enriched with multi-modal information such as semantic embeddings (e.g., CLIP) and prior observations offer a scalable and generalizable solution for complex, real-world environments. The full source code of the presented architecture will be made available at https://github.com/m4renz/incremental-scene-graph-prediction.
中文: 本文提出了一种新颖的异构图模型,用于增量式3D语义场景图预测,该模型融合了先验观察和语义嵌入等多模态信息,无需完整场景重建即可提供可扩展的解决方案。
English: This paper presents a novel heterogeneous graph model for incremental 3D semantic scene graph prediction that integrates multi-modal information like prior observations and semantic embeddings, offering a scalable solution without requiring complete scene reconstructions.

Authors:Zhenni Yu, Li Zhao, Guobao Xiao, Xiaoqin Zhang
Title: SAM-TTT: Segment Anything Model via Reverse Parameter Configuration and Test-Time Training for Camouflaged Object Detection
Abstract:
This paper introduces a new Segment Anything Model (SAM) that leverages reverse parameter configuration and test-time training to enhance its performance on Camouflaged Object Detection (COD), named SAM-TTT. While most existing SAM-based COD models primarily focus on enhancing SAM by extracting favorable features and amplifying its advantageous parameters, a crucial gap is identified: insufficient attention to adverse parameters that impair SAM's semantic understanding in downstream tasks. To tackle this issue, the Reverse SAM Parameter Configuration Module is proposed to effectively mitigate the influence of adverse parameters in a train-free manner by configuring SAM's parameters. Building on this foundation, the T-Visioner Module is unveiled to strengthen advantageous parameters by integrating Test-Time Training layers, originally developed for language tasks, into vision tasks. Test-Time Training layers represent a new class of sequence modeling layers characterized by linear complexity and an expressive hidden state. By integrating two modules, SAM-TTT simultaneously suppresses adverse parameters while reinforcing advantageous ones, significantly improving SAM's semantic understanding in COD task. Our experimental results on various COD benchmarks demonstrate that the proposed approach achieves state-of-the-art performance, setting a new benchmark in the field. The code will be available at https://github.com/guobaoxiao/SAM-TTT.
中文: 本文提出SAM-TTT模型,通过逆向参数配置模块抑制干扰参数,并结合测试时训练模块增强有利参数,显著提升了伪装目标检测任务的语义理解能力,在多个基准测试中达到最优性能。
English: This paper presents SAM-TTT, a novel Segment Anything Model that enhances Camouflaged Object Detection by suppressing adverse parameters through reverse configuration while strengthening advantageous ones via test-time training layers, achieving state-of-the-art performance on multiple benchmarks.

Authors:Meng Luo, Shengqiong Wu, Liqiang Jing, Tianjie Ju, Li Zheng, Jinxiang Lai, Tianlong Wu, Xinya Du, Jian Li, Siyuan Yan, Jiebo Luo, William Yang Wang, Hao Fei, Mong-Li Lee, Wynne Hsu
Title: Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding
Abstract:
Recent advancements in large video models (LVMs) have significantly enhance video understanding. However, these models continue to suffer from hallucinations, producing content that conflicts with input videos. To address this issue, we propose Dr.V, a hierarchical framework covering perceptive, temporal, and cognitive levels to diagnose video hallucination by fine-grained spatial-temporal grounding. Dr.V comprises of two key components: a benchmark dataset Dr.V-Bench and a satellite video agent Dr.V-Agent. Dr.V-Bench includes 10k instances drawn from 4,974 videos spanning diverse tasks, each enriched with detailed spatial-temporal annotation. Dr.V-Agent detects hallucinations in LVMs by systematically applying fine-grained spatial-temporal grounding at the perceptive and temporal levels, followed by cognitive level reasoning. This step-by-step pipeline mirrors human-like video comprehension and effectively identifies hallucinations. Extensive experiments demonstrate that Dr.V-Agent is effective in diagnosing hallucination while enhancing interpretability and reliability, offering a practical blueprint for robust video understanding in real-world scenarios. All our data and code are available at https://github.com/Eurekaleo/Dr.V.
中文:Dr.V框架通过细粒度时空定位的分层方法,结合全面的基准数据集和提升可解释性与可靠性的卫星代理,有效诊断大型视频模型中的幻觉问题。
English: The Dr.V framework effectively diagnoses hallucinations in large video models through a hierarchical approach using fine-grained spatial-temporal grounding, supported by a comprehensive benchmark dataset and a satellite agent that enhances interpretability and reliability.

Authors:Jiacheng Liu, Pengxiang Ding, Qihang Zhou, Yuxuan Wu, Da Huang, Zimian Peng, Wei Xiao, Weinan Zhang, Lixin Yang, Cewu Lu, Donglin Wang
Title: TrajBooster: Boosting Humanoid Whole-Body Manipulation via Trajectory-Centric Learning
Abstract:
Recent Vision-Language-Action models show potential to generalize across embodiments but struggle to quickly align with a new robot's action space when high-quality demonstrations are scarce, especially for bipedal humanoids. We present TrajBooster, a cross-embodiment framework that leverages abundant wheeled-humanoid data to boost bipedal VLA. Our key idea is to use end-effector trajectories as a morphology-agnostic interface. TrajBooster (i) extracts 6D dual-arm end-effector trajectories from real-world wheeled humanoids, (ii) retargets them in simulation to Unitree G1 with a whole-body controller trained via a heuristic-enhanced harmonized online DAgger to lift low-dimensional trajectory references into feasible high-dimensional whole-body actions, and (iii) forms heterogeneous triplets that couple source vision/language with target humanoid-compatible actions to post-pre-train a VLA, followed by only 10 minutes of teleoperation data collection on the target humanoid domain. Deployed on Unitree G1, our policy achieves beyond-tabletop household tasks, enabling squatting, cross-height manipulation, and coordinated whole-body motion with markedly improved robustness and generalization. Results show that TrajBooster allows existing wheeled-humanoid data to efficiently strengthen bipedal humanoid VLA performance, reducing reliance on costly same-embodiment data while enhancing action space understanding and zero-shot skill transfer capabilities. For more details, For more details, please refer to our \href{https://jiachengliu3.github.io/TrajBooster/}.
中文:TrajBooster是一种跨具身框架,通过重用轮式人形机器人的轨迹数据来增强双足机器人的性能,仅需少量目标领域演示即可实现稳健的全身运动。
English: TrajBooster is a cross-embodiment framework that enhances bipedal robot performance by repurposing wheeled-humanoid trajectory data, achieving robust whole-body motions with minimal target-domain demonstrations.

Authors:Liying Wang, Xiaoli Zhang, Chuanmin Jia, Siwei Ma
Title: MAFS: Masked Autoencoder for Infrared-Visible Image Fusion and Semantic Segmentation
Abstract:
Infrared-visible image fusion methods aim at generating fused images with good visual quality and also facilitate the performance of high-level tasks. Indeed, existing semantic-driven methods have considered semantic information injection for downstream applications. However, none of them investigates the potential for reciprocal promotion between pixel-wise image fusion and cross-modal feature fusion perception tasks from a macroscopic task-level perspective. To address this limitation, we propose a unified network for image fusion and semantic segmentation. MAFS is a parallel structure, containing a fusion sub-network and a segmentation sub-network. On the one hand, We devise a heterogeneous feature fusion strategy to enhance semantic-aware capabilities for image fusion. On the other hand, by cascading the fusion sub-network and a segmentation backbone, segmentation-related knowledge is transferred to promote feature-level fusion-based segmentation. Within the framework, we design a novel multi-stage Transformer decoder to aggregate fine-grained multi-scale fused features efficiently. Additionally, a dynamic factor based on the max-min fairness allocation principle is introduced to generate adaptive weights of two tasks and guarantee smooth training in a multi-task manner. Extensive experiments demonstrate that our approach achieves competitive results compared with state-of-the-art methods. The code is available at https://github.com/Abraham-Einstein/MAFS/.
中文摘要:本文提出MAFS统一网络,通过并行子网络和多任务框架整合红外-可见光图像融合与语义分割,实现两项任务的相互促进。
English Summary: This paper introduces MAFS, a unified network that integrates infrared-visible image fusion and semantic segmentation through parallel sub-networks and a multi-task framework to mutually enhance both tasks.

Authors:Haiduo Huang, Fuwei Yang, Zhenhua Liu, Xuanwu Yin, Dong Li, Pengju Ren, Emad Barsoum
Title: SpecVLM: Fast Speculative Decoding in Vision-Language Models
Abstract:
Speculative decoding is a powerful way to accelerate autoregressive large language models (LLMs), but directly porting it to vision-language models (VLMs) faces unique systems constraints: the prefill stage is dominated by visual tokens whose count scales with image resolution and video length, inflating both compute and memory, especially the key-value (KV) cache. We study speculative decoding for VLMs and introduce SpecVLM, a practical system that (1) establishes a strong EAGLE-2-style baseline, EagleVLM, delivering 1.5--2.3x end-to-end speedups over full autoregressive inference, and (2) further accelerates VLM inference with an elastic visual compressor that adaptively selects among pruning, pooling, convolution, and resampler primitives to balance FLOPs/parameters and accuracy per input. To avoid costly offline distillation corpora, we propose an online-logit distillation protocol that trains the draft model with on-the-fly teacher logits and penultimate features using a combined cross-entropy and Smooth L1 objective, eliminating storage and preprocessing while remaining compute-efficient. This protocol reveals a training-time scaling effect: longer online training monotonically increases the draft model's average accepted length, improving speculative efficiency. Empirically, SpecVLM achieves additional acceleration, culminating in 2.5--2.9x end-to-end speedups within 5 epochs across LLaVA and MMMU, consistently over resolutions and task difficulties, while preserving the target model's output distribution (lossless decoding). Our code is available at https://github.com/haiduo/SpecVLM.
Chinese: 推测解码通过引入SpecVLM系统加速视觉语言模型,该系统采用弹性视觉压缩器和在线对数蒸馏技术,在保持无损解码的同时实现2.5-2.9倍加速。
English: Speculative decoding accelerates vision-language models by introducing SpecVLM, which employs an elastic visual compressor and online-logit distillation to achieve 2.5–2.9x speedups while maintaining lossless decoding.

Authors:Mehwish Mehmood, Shahzaib Iqbal, Tariq Mahmood Khan, Ivor Spence, Muhammad Fahim
Title: LFRA-Net: A Lightweight Focal and Region-Aware Attention Network for Retinal Vessel Segmentatio
Abstract:
Retinal vessel segmentation is critical for the early diagnosis of vision-threatening and systemic diseases, especially in real-world clinical settings with limited computational resources. Although significant improvements have been made in deep learning-based segmentation methods, current models still face challenges in extracting tiny vessels and suffer from high computational costs. In this study, we present LFRA-Net by incorporating focal modulation attention at the encoder-decoder bottleneck and region-aware attention in the selective skip connections. LFRA-Net is a lightweight network optimized for precise and effective retinal vascular segmentation. It enhances feature representation and regional focus by efficiently capturing local and global dependencies. LFRA-Net outperformed many state-of-the-art models while maintaining lightweight characteristics with only 0.17 million parameters, 0.66 MB memory size, and 10.50 GFLOPs. We validated it on three publicly available datasets: DRIVE, STARE, and CHASE\_DB. It performed better in terms of Dice score (84.28\%, 88.44\%, and 85.50\%) and Jaccard index (72.86\%, 79.31\%, and 74.70\%) on the DRIVE, STARE, and CHASE\_DB datasets, respectively. LFRA-Net provides an ideal ratio between segmentation accuracy and computational cost compared to existing deep learning methods, which makes it suitable for real-time clinical applications in areas with limited resources. The code can be found at https://github.com/Mehwish4593/LFRA-Net.
中文: LFRA-Net是一种轻量级深度学习模型,通过融合焦点调制和区域感知注意力机制,在低计算资源下实现了高精度的视网膜血管分割,适用于临床实际应用。
English: LFRA-Net is a lightweight deep learning model that enhances retinal vessel segmentation by integrating focal modulation and region-aware attention, achieving high accuracy with minimal computational resources for real-world clinical use.

Authors:Diogo Mendonça, Tiago Barros, Cristiano Premebida, Urbano J. Nunes
Title: Seg2Track-SAM2: SAM2-based Multi-object Tracking and Segmentation for Zero-shot Generalization
Abstract:
Autonomous systems require robust Multi-Object Tracking (MOT) capabilities to operate reliably in dynamic environments. MOT ensures consistent object identity assignment and precise spatial delineation. Recent advances in foundation models, such as SAM2, have demonstrated strong zero-shot generalization for video segmentation, but their direct application to MOTS (MOT+Segmentation) remains limited by insufficient identity management and memory efficiency. This work introduces Seg2Track-SAM2, a framework that integrates pre-trained object detectors with SAM2 and a novel Seg2Track module to address track initialization, track management, and reinforcement. The proposed approach requires no fine-tuning and remains detector-agnostic. Experimental results on KITTI MOT and KITTI MOTS benchmarks show that Seg2Track-SAM2 achieves state-of-the-art (SOTA) performance, ranking fourth overall in both car and pedestrian classes on KITTI MOTS, while establishing a new benchmark in association accuracy (AssA). Furthermore, a sliding-window memory strategy reduces memory usage by up to 75% with negligible performance degradation, supporting deployment under resource constraints. These results confirm that Seg2Track-SAM2 advances MOTS by combining robust zero-shot tracking, enhanced identity preservation, and efficient memory utilization. The code is available at https://github.com/hcmr-lab/Seg2Track-SAM2
中文摘要:Seg2Track-SAM2提出了一种创新框架,将目标检测器与SAM2及Seg2Track模块相结合,无需微调即可实现最先进的多目标跟踪与分割性能,并通过滑动窗口内存策略降低75%内存使用,同时保持优异的身份关联精度。
English Summary: Seg2Track-SAM2 introduces a novel framework combining object detectors with SAM2 and a Seg2Track module to achieve state-of-the-art MOTS performance without fine-tuning, featuring enhanced identity management and 75% memory reduction through sliding-window optimization.

Authors:Wa-Kin Lei, Jun-Cheng Chen, Shang-Tse Chen
Title: DRAG: Data Reconstruction Attack using Guided Diffusion
Abstract:
With the rise of large foundation models, split inference (SI) has emerged as a popular computational paradigm for deploying models across lightweight edge devices and cloud servers, addressing data privacy and computational cost concerns. However, most existing data reconstruction attacks have focused on smaller CNN classification models, leaving the privacy risks of foundation models in SI settings largely unexplored. To address this gap, we propose a novel data reconstruction attack based on guided diffusion, which leverages the rich prior knowledge embedded in a latent diffusion model (LDM) pre-trained on a large-scale dataset. Our method performs iterative reconstruction on the LDM's learned image prior, effectively generating high-fidelity images resembling the original data from their intermediate representations (IR). Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods, both qualitatively and quantitatively, in reconstructing data from deep-layer IRs of the vision foundation model. The results highlight the urgent need for more robust privacy protection mechanisms for large models in SI scenarios. Code is available at: https://github.com/ntuaislab/DRAG.
Chinese: 本研究提出了一种基于引导扩散的数据重建攻击方法,能够从分割推理场景中视觉基础模型的中间表示有效恢复高保真度图像,揭示了严重的隐私风险。
English: This study introduces a guided diffusion-based data reconstruction attack that effectively recovers high-fidelity images from intermediate representations of vision foundation models in split inference scenarios, revealing significant privacy vulnerabilities.

Authors:Chuang Liu, Nan Guo
Title: Joint-octamamba:an octa joint segmentation network based on feature enhanced mamba
Abstract:
OCTA is a crucial non-invasive imaging technique for diagnosing and monitoring retinal diseases like diabetic retinopathy, age-related macular degeneration, and glaucoma. Current 2D-based methods for retinal vessel (RV) segmentation offer insufficient accuracy. To address this, we propose RVMamba, a novel architecture integrating multiple feature extraction modules with the Mamba state-space model. Moreover, existing joint segmentation models for OCTA data exhibit performance imbalance between different tasks. To simultaneously improve the segmentation of the foveal avascular zone (FAZ) and mitigate this imbalance, we introduce FAZMamba and a unified Joint-OCTAMamba framework. Experimental results on the OCTA-500 dataset demonstrate that Joint-OCTAMamba outperforms existing models across evaluation metrics.The code is available at https://github.com/lc-sfis/Joint-OCTAMamba.
Chinese: 提出的RVMamba和Joint-OCTAMamba框架显著提升了OCTA成像中视网膜血管和黄斑无血管区的分割效果,在OCTA-500数据集上全面优于现有模型。
English: The proposed RVMamba and Joint-OCTAMamba frameworks significantly enhance retinal vessel and foveal avascular zone segmentation in OCTA imaging, outperforming existing models on the OCTA-500 dataset.

Authors:Qiyuan Guan, Qianfeng Yang, Xiang Chen, Tianyu Song, Guiyue Jin, Jiyu Jin
Title: WeatherBench: A Real-World Benchmark Dataset for All-in-One Adverse Weather Image Restoration
Abstract:
Existing all-in-one image restoration approaches, which aim to handle multiple weather degradations within a single framework, are predominantly trained and evaluated using mixed single-weather synthetic datasets. However, these datasets often differ significantly in resolution, style, and domain characteristics, leading to substantial domain gaps that hinder the development and fair evaluation of unified models. Furthermore, the lack of a large-scale, real-world all-in-one weather restoration dataset remains a critical bottleneck in advancing this field. To address these limitations, we present a real-world all-in-one adverse weather image restoration benchmark dataset, which contains image pairs captured under various weather conditions, including rain, snow, and haze, as well as diverse outdoor scenes and illumination settings. The resulting dataset provides precisely aligned degraded and clean images, enabling supervised learning and rigorous evaluation. We conduct comprehensive experiments by benchmarking a variety of task-specific, task-general, and all-in-one restoration methods on our dataset. Our dataset offers a valuable foundation for advancing robust and practical all-in-one image restoration in real-world scenarios. The dataset has been publicly released and is available at https://github.com/guanqiyuan/WeatherBench.
中文: 本文提出了一个真实世界的多天气图像恢复基准数据集,解决了现有合成数据集存在的领域差异问题,为统一模型的监督学习和严格评估提供了基础。
English: This paper introduces a real-world benchmark dataset for all-in-one adverse weather image restoration, addressing domain gaps in existing synthetic datasets and enabling supervised learning and rigorous evaluation of unified models.

Authors:Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Fei Ren, Shaobo Wang, Kaixin Li, Linfeng Zhang
Title: SpeCa: Accelerating Diffusion Transformers with Speculative Feature Caching
Abstract:
Diffusion models have revolutionized high-fidelity image and video synthesis, yet their computational demands remain prohibitive for real-time applications. These models face two fundamental challenges: strict temporal dependencies preventing parallelization, and computationally intensive forward passes required at each denoising step. Drawing inspiration from speculative decoding in large language models, we present SpeCa, a novel 'Forecast-then-verify' acceleration framework that effectively addresses both limitations. SpeCa's core innovation lies in introducing Speculative Sampling to diffusion models, predicting intermediate features for subsequent timesteps based on fully computed reference timesteps. Our approach implements a parameter-free verification mechanism that efficiently evaluates prediction reliability, enabling real-time decisions to accept or reject each prediction while incurring negligible computational overhead. Furthermore, SpeCa introduces sample-adaptive computation allocation that dynamically modulates resources based on generation complexity, allocating reduced computation for simpler samples while preserving intensive processing for complex instances. Experiments demonstrate 6.34x acceleration on FLUX with minimal quality degradation (5.5% drop), 7.3x speedup on DiT while preserving generation fidelity, and 79.84% VBench score at 6.1x acceleration for HunyuanVideo. The verification mechanism incurs minimal overhead (1.67%-3.5% of full inference costs), establishing a new paradigm for efficient diffusion model inference while maintaining generation quality even at aggressive acceleration ratios. Our codes have been released in Github: \textbf{https://github.com/Shenyi-Z/Cache4Diffusion}
中文: SpeCa通过提出预测性采样框架,在扩散模型中预测后续时间步特征并高效验证可靠性,以最小计算开销实现最高7.3倍加速,同时保持生成质量。
English: SpeCa introduces a speculative sampling framework that accelerates diffusion models by predicting future timesteps and verifying their reliability with minimal overhead, achieving up to 7.3x speedup while maintaining generation quality.

Authors:Yanyun Pu, Kehan Li, Zeyi Huang, Zhijie Zhong, Kaixiang Yang
Title: MVQA-68K: A Multi-dimensional and Causally-annotated Dataset with Quality Interpretability for Video Assessment
Abstract:
With the rapid advancement of video generation models such as Sora, video quality assessment (VQA) is becoming increasingly crucial for selecting high-quality videos from large-scale datasets used in pre-training. Traditional VQA methods, typically producing single numerical scores, often lack comprehensiveness and interpretability. To address these challenges, we introduce MVQA-68K, a novel multi-dimensional VQA dataset comprising over 68,000 carefully annotated videos, covering seven essential quality dimensions: overall aesthetics, camera movement, dynamic degree, texture detail, composition, visual quality, and factual consistency. Each annotation includes detailed chain-of-thought reasoning to facilitate interpretability and comprehensive understanding. Extensive experiments demonstrate that MVQA-68K significantly enhances the performance of various multimodal large language models (MLLMs) on the VQA task, achieving state-of-the-art results not only on our internal test set (Fig.1) but also on public benchmarks including LSVQ-test, LSVQ-1080p, and LIVE-VQC. Meantime, incorporating explicit reasoning process during VQA training substantially boosts the zero-shot generalization. Code and dataset will be available at github: https://github.com/Controller01-ai/MVQA-68K
Chinese: 为解决传统视频质量评估方法的不足,我们推出了MVQA-68K多维数据集,包含超过68,000个带详细推理标注的视频,显著提升了多模态模型在视频质量评估任务上的性能与泛化能力。
English: To address the limitations of traditional video quality assessment methods, we introduce MVQA-68K, a multi-dimensional dataset with over 68,000 annotated videos and detailed reasoning, which significantly improves multimodal models' performance and generalization on VQA tasks.

Authors:Haonan Shi, Yubin Wang, De Cheng, Lingfeng He, Nannan Wang, Xinbo Gao
Title: Hierarchical Identity Learning for Unsupervised Visible-Infrared Person Re-Identification
Abstract:
Unsupervised visible-infrared person re-identification (USVI-ReID) aims to learn modality-invariant image features from unlabeled cross-modal person datasets by reducing the modality gap while minimizing reliance on costly manual annotations. Existing methods typically address USVI-ReID using cluster-based contrastive learning, which represents a person by a single cluster center. However, they primarily focus on the commonality of images within each cluster while neglecting the finer-grained differences among them. To address the limitation, we propose a Hierarchical Identity Learning (HIL) framework. Since each cluster may contain several smaller sub-clusters that reflect fine-grained variations among images, we generate multiple memories for each existing coarse-grained cluster via a secondary clustering. Additionally, we propose Multi-Center Contrastive Learning (MCCL) to refine representations for enhancing intra-modal clustering and minimizing cross-modal discrepancies. To further improve cross-modal matching quality, we design a Bidirectional Reverse Selection Transmission (BRST) mechanism, which establishes reliable cross-modal correspondences by performing bidirectional matching of pseudo-labels. Extensive experiments conducted on the SYSU-MM01 and RegDB datasets demonstrate that the proposed method outperforms existing approaches. The source code is available at: https://github.com/haonanshi0125/HIL.
中文摘要:该研究提出的分层身份学习框架通过多中心对比学习和双向匹配机制,解决了无监督可见光-红外行人重识别中细粒度差异被忽视的问题,在减少跨模态差异的同时显著提升了基准数据集上的性能表现。
English Summary: The proposed Hierarchical Identity Learning framework addresses limitations in unsupervised visible-infrared person re-identification by introducing multi-center contrastive learning and bidirectional matching to capture fine-grained variations while reducing cross-modal discrepancies, achieving superior performance on benchmark datasets.

Authors:Dezhen Wang, Haixiang Zhao, Xiang Shen, Sheng Miao
Title: SFGNet: Semantic and Frequency Guided Network for Camouflaged Object Detection
Abstract:
Camouflaged object detection (COD) aims to segment objects that blend into their surroundings. However, most existing studies overlook the semantic differences among textual prompts of different targets as well as fine-grained frequency features. In this work, we propose a novel Semantic and Frequency Guided Network (SFGNet), which incorporates semantic prompts and frequency-domain features to capture camouflaged objects and improve boundary perception. We further design Multi-Band Fourier Module(MBFM) to enhance the ability of the network in handling complex backgrounds and blurred boundaries. In addition, we design an Interactive Structure Enhancement Block (ISEB) to ensure structural integrity and boundary details in the predictions. Extensive experiments conducted on three COD benchmark datasets demonstrate that our method significantly outperforms state-of-the-art approaches. The core code of the model is available at the following link: https://github.com/winter794444/SFGNetICASSP2026.
中文摘要:本研究提出的SFGNet方法通过整合语义提示和频域特征,结合专门设计的模块,在伪装物体检测和边界感知方面显著优于现有方法。
English Summary: The proposed SFGNet method integrates semantic prompts and frequency-domain features through specialized modules to significantly improve camouflaged object detection and boundary perception, demonstrating superior performance over existing approaches.

Authors:Wenhao Tang, Sheng Huang, Heng Fang, Fengtao Zhou, Bo Liu, Qingshan Liu
Title: Multiple Instance Learning Framework with Masked Hard Instance Mining for Gigapixel Histopathology Image Analysis
Abstract:
Digitizing pathological images into gigapixel Whole Slide Images (WSIs) has opened new avenues for Computational Pathology (CPath). As positive tissue comprises only a small fraction of gigapixel WSIs, existing Multiple Instance Learning (MIL) methods typically focus on identifying salient instances via attention mechanisms. However, this leads to a bias towards easy-to-classify instances while neglecting challenging ones. Recent studies have shown that hard examples are crucial for accurately modeling discriminative boundaries. Applying such an idea at the instance level, we elaborate a novel MIL framework with masked hard instance mining (MHIM-MIL), which utilizes a Siamese structure with a consistency constraint to explore the hard instances. Using a class-aware instance probability, MHIM-MIL employs a momentum teacher to mask salient instances and implicitly mine hard instances for training the student model. To obtain diverse, non-redundant hard instances, we adopt large-scale random masking while utilizing a global recycle network to mitigate the risk of losing key features. Furthermore, the student updates the teacher using an exponential moving average, which identifies new hard instances for subsequent training iterations and stabilizes optimization. Experimental results on cancer diagnosis, subtyping, survival analysis tasks, and 12 benchmarks demonstrate that MHIM-MIL outperforms the latest methods in both performance and efficiency. The code is available at: https://github.com/DearCaat/MHIM-MIL.
中文: MHIM-MIL框架通过孪生网络结构和掩码硬实例挖掘技术,有效解决了计算病理学中难以分类样本被忽略的问题,在多项基准测试中性能与效率均显著超越现有方法。
English: The MHIM-MIL framework introduces masked hard instance mining through a Siamese structure to address the neglect of challenging instances in computational pathology, significantly outperforming existing methods across multiple benchmarks while improving efficiency.

Authors:Ayhan Can Erdur, Christian Beischl, Daniel Scholz, Jiazhen Pan, Benedikt Wiestler, Daniel Rueckert, Jan C Peeken
Title: MultiMAE for Brain MRIs: Robustness to Missing Inputs Using Multi-Modal Masked Autoencoder
Abstract:
Missing input sequences are common in medical imaging data, posing a challenge for deep learning models reliant on complete input data. In this work, inspired by MultiMAE [2], we develop a masked autoencoder (MAE) paradigm for multi-modal, multi-task learning in 3D medical imaging with brain MRIs. Our method treats each MRI sequence as a separate input modality, leveraging a late-fusion-style transformer encoder to integrate multi-sequence information (multi-modal) and individual decoder streams for each modality for multi-task reconstruction. This pretraining strategy guides the model to learn rich representations per modality while also equipping it to handle missing inputs through cross-sequence reasoning. The result is a flexible and generalizable encoder for brain MRIs that infers missing sequences from available inputs and can be adapted to various downstream applications. We demonstrate the performance and robustness of our method against an MAE-ViT baseline in downstream segmentation and classification tasks, showing absolute improvement of $10.1$ overall Dice score and $0.46$ MCC over the baselines with missing input sequences. Our experiments demonstrate the strength of this pretraining strategy. The implementation is made available.
中文摘要:本研究提出了一种用于3D脑部MRI分析的掩码自编码器框架,通过将各MRI序列视为独立模态,利用跨序列推理重建缺失数据,在后续任务中实现了显著性能提升。
English Summary: This study introduces a masked autoencoder framework for 3D brain MRI analysis that handles missing sequences by treating each MRI sequence as a separate modality, using cross-sequence reasoning to reconstruct missing data while achieving significant performance gains in downstream tasks.

Authors:Jeanny Pan, Philipp Seeböck, Christoph Fürböck, Svitlana Pochepnia, Jennifer Straub, Lucian Beer, Helmut Prosch, Georg Langs
Title: Disentanglement of Biological and Technical Factors via Latent Space Rotation in Clinical Imaging Improves Disease Pattern Discovery
Abstract:
Identifying new disease-related patterns in medical imaging data with the help of machine learning enlarges the vocabulary of recognizable findings. This supports diagnostic and prognostic assessment. However, image appearance varies not only due to biological differences, but also due to imaging technology linked to vendors, scanning- or re- construction parameters. The resulting domain shifts impedes data representation learning strategies and the discovery of biologically meaningful cluster appearances. To address these challenges, we introduce an approach to actively learn the domain shift via post-hoc rotation of the data latent space, enabling disentanglement of biological and technical factors. Results on real-world heterogeneous clinical data showcase that the learned disentangled representation leads to stable clusters representing tissue-types across different acquisition settings. Cluster consistency is improved by +19.01% (ARI), +16.85% (NMI), and +12.39% (Dice) compared to the entangled representation, outperforming four state-of-the-art harmonization methods. When using the clusters to quantify tissue composition on idiopathic pulmonary fibrosis patients, the learned profiles enhance Cox survival prediction. This indicates that the proposed label-free framework facilitates biomarker discovery in multi-center routine imaging data. Code is available on GitHub https://github.com/cirmuw/latent-space-rotation-disentanglement.
中文: 通过潜在空间旋转主动学习域偏移,该方法在医学影像中分离生物与技术因素,提升了跨设备聚类稳定性,并在多中心数据中增强了生存预测能力。
English: Machine learning identifies new disease patterns in medical imaging by disentangling biological and technical factors through latent space rotation, improving cluster consistency and enhancing survival prediction in multi-center data.

Authors:Jian Song, Wei Mei, Yunfeng Xu, Qiang Fu, Renke Kou, Lina Bu, Yucheng Long
Title: Motion Estimation for Multi-Object Tracking using KalmanNet with Semantic-Independent Encoding
Abstract:
Motion estimation is a crucial component in multi-object tracking (MOT). It predicts the trajectory of objects by analyzing the changes in their positions in consecutive frames of images, reducing tracking failures and identity switches. The Kalman filter (KF) based on the linear constant-velocity model is one of the most commonly used methods in MOT. However, it may yield unsatisfactory results when KF's parameters are mismatched and objects move in non-stationary. In this work, we utilize the learning-aided filter to handle the motion estimation of MOT. In particular, we propose a novel method named Semantic-Independent KalmanNet (SIKNet), which encodes the state vector (the input feature) using a Semantic-Independent Encoder (SIE) by two steps. First, the SIE uses a 1D convolution with a kernel size of 1, which convolves along the dimension of homogeneous-semantic elements across different state vectors to encode independent semantic information. Then it employs a fully-connected layer and a nonlinear activation layer to encode nonlinear and cross-dependency information between heterogeneous-semantic elements. To independently evaluate the performance of the motion estimation module in MOT, we constructed a large-scale semi-simulated dataset from several open-source MOT datasets. Experimental results demonstrate that the proposed SIKNet outperforms the traditional KF and achieves superior robustness and accuracy than existing learning-aided filters. The code is available at (https://github.com/SongJgit/filternet and https://github.com/SongJgit/TBDTracker).
中文: 本文提出了一种名为语义独立卡尔曼网络(SIKNet)的学习辅助运动估计方法,通过两步编码状态向量来捕捉独立语义和非线性依赖信息,在多目标跟踪中展现出优于传统卡尔曼滤波器和其他学习型滤波器的鲁棒性与准确性。
English: This paper introduces Semantic-Independent KalmanNet (SIKNet), a learning-aided motion estimation method for multi-object tracking that enhances robustness and accuracy by encoding state vectors with independent semantic and nonlinear dependency information, outperforming traditional Kalman filters and other learning-based approaches.

Authors:Ziling Liu, Ziwei Chen, Mingqi Gao, Jinyu Yang, Feng Zheng
Title: Leveraging Geometric Priors for Unaligned Scene Change Detection
Abstract:
Unaligned Scene Change Detection aims to detect scene changes between image pairs captured at different times without assuming viewpoint alignment. To handle viewpoint variations, current methods rely solely on 2D visual cues to establish cross-image correspondence to assist change detection. However, large viewpoint changes can alter visual observations, causing appearance-based matching to drift or fail. Additionally, supervision limited to 2D change masks from small-scale SCD datasets restricts the learning of generalizable multi-view knowledge, making it difficult to reliably identify visual overlaps and handle occlusions. This lack of explicit geometric reasoning represents a critical yet overlooked limitation. In this work, we introduce geometric priors for the first time to address the core challenges of unaligned SCD, for reliable identification of visual overlaps, robust correspondence establishment, and explicit occlusion detection. Building on these priors, we propose a training-free framework that integrates them with the powerful representations of a visual foundation model to enable reliable change detection under viewpoint misalignment. Through extensive evaluation on the PSCD, ChangeSim, and PASLCD datasets, we demonstrate that our approach achieves superior and robust performance. Our code will be released at https://github.com/ZilingLiu/GeoSCD.
中文: 本文首次将几何先验引入非对齐场景变化检测,以解决视角变化和遮挡问题,提出了一种无需训练的框架,将该先验与视觉基础模型结合,在多个数据集上实现了优越且鲁棒的检测性能。
English: This paper introduces geometric priors into unaligned scene change detection to address viewpoint variations and occlusion challenges, proposing a training-free framework that integrates these priors with a visual foundation model for robust performance across multiple datasets.

Authors:Yifan Lu, Ziqi Zhang, Chunfeng Yuan, Jun Gao, Congxuan Zhang, Xiaojuan Qi, Bing Li, Weiming Hu
Title: Mitigating Hallucinations in Large Vision-Language Models by Self-Injecting Hallucinations
Abstract:
Large Vision-Language Models (LVLMs) suffer from serious hallucination problems, where the model-generated responses are inconsistent with the visual inputs. Existing hallucination mitigation methods are mainly based on preference alignment and require external human annotations or auxiliary models for preference data collection, which increase costs and limit sustainable improvement. To tackle these challenges, we propose Autonomous Preference Alignment via Self-Injection (APASI), a novel and generalizable method that mitigates hallucinations without external dependencies. APASI leverages the target LVLM to self-inject hallucinations into a generated response, creating a pair of responses with varying preference levels. During the self-injection process, the dis-preferred response is generated based on three key observations of hallucinations, ensuring it simulates real hallucination patterns. This fidelity offers an accurate learning signal for hallucination mitigation. Moreover, APASI incorporates an iterative alignment training strategy combined with curriculum learning to periodically update the preference data with increasing challenge, enabling stable and continuous enhancement of the LVLM. Extensive experiments across six benchmarks show that APASI not only effectively mitigates hallucinations for three baseline models but also achieves comparable or even superior performance to alignment-based methods with external dependency, thereby demonstrating its effectiveness and generalization capability. The code is available at https://github.com/davidluciolu/APASI.
中文: APASI是一种创新的自主偏好对齐方法,通过自我注入模拟幻觉并采用迭代训练,无需外部依赖即可有效减少大型视觉语言模型的幻觉问题,且性能媲美依赖外部资源的方法。
English: APASI is a novel autonomous preference alignment method that mitigates hallucinations in Large Vision-Language Models by self-injecting simulated hallucinations and using iterative training, achieving competitive performance without external dependencies.

Authors:Kerun Mi, Guoliang Kang, Guangyu Li, Lin Zhao, Tao Zhou, Chen Gong
Title: Cross-Domain Attribute Alignment with CLIP: A Rehearsal-Free Approach for Class-Incremental Unsupervised Domain Adaptation
Abstract:
Class-Incremental Unsupervised Domain Adaptation (CI-UDA) aims to adapt a model from a labeled source domain to an unlabeled target domain, where the sets of potential target classes appearing at different time steps are disjoint and are subsets of the source classes. The key to solving this problem lies in avoiding catastrophic forgetting of knowledge about previous target classes during continuously mitigating the domain shift. Most previous works cumbersomely combine two technical components. On one hand, they need to store and utilize rehearsal target sample from previous time steps to avoid catastrophic forgetting; on the other hand, they perform alignment only between classes shared across domains at each time step. Consequently, the memory will continuously increase and the asymmetric alignment may inevitably result in knowledge forgetting. In this paper, we propose to mine and preserve domain-invariant and class-agnostic knowledge to facilitate the CI-UDA task. Specifically, via using CLIP, we extract the class-agnostic properties which we name as "attribute". In our framework, we learn a "key-value" pair to represent an attribute, where the key corresponds to the visual prototype and the value is the textual prompt. We maintain two attribute dictionaries, each corresponding to a different domain. Then we perform attribute alignment across domains to mitigate the domain shift, via encouraging visual attention consistency and prediction consistency. Through attribute modeling and cross-domain alignment, we effectively reduce catastrophic knowledge forgetting while mitigating the domain shift, in a rehearsal-free way. Experiments on three CI-UDA benchmarks demonstrate that our method outperforms previous state-of-the-art methods and effectively alleviates catastrophic forgetting. Code is available at https://github.com/RyunMi/VisTA.
中文: 针对类别增量无监督领域自适应(CI-UDA)问题,本方法通过CLIP提取领域无关的"属性"特征,采用键值对表示并进行跨领域对齐,在无需记忆样本的情况下同步解决领域偏移和灾难性遗忘问题,实验证明其性能优于现有最优方法。
English: The proposed method for Class-Incremental Unsupervised Domain Adaptation (CI-UDA) leverages CLIP to extract domain-invariant attributes represented as key-value pairs, enabling rehearsal-free cross-domain alignment that effectively mitigates both domain shift and catastrophic forgetting while outperforming prior approaches.

Authors:Yitong Zhang, Ximo Li, Liyi Cai, Jia Li
Title: Realistic Environmental Injection Attacks on GUI Agents
Abstract:
GUI agents built on LVLMs are increasingly used to interact with websites. However, their exposure to open-world content makes them vulnerable to Environmental Injection Attacks (EIAs) that hijack agent behavior via webpage elements. Many recent studies assume the attacker to be a regular user who can only upload a single trigger image, which is more realistic than earlier assumptions of website-level administrative control. However, these works still fall short of realism: (1) the trigger's position and surrounding context remain largely fixed between training and testing, failing to capture the dynamic nature of real webpages and (2) the trigger often occupies an unrealistically large area, whereas real-world images are typically small. To better reflect real-world scenarios, we introduce a more realistic threat model where the attacker is a regular user and the trigger image is small and embedded within a dynamically changing environment. As a result, existing attacks prove largely ineffective under this threat model. To better expose the vulnerabilities of GUI agents, we propose Chameleon, an attack framework with two main novelties. The first is LLM-Driven Environment Simulation, which automatically generates diverse and high-fidelity webpage simulations. The second is Attention Black Hole, which transforms attention weights into explicit supervisory signals that guide the agent's focus toward the trigger region. We evaluate Chameleon on 6 realistic websites and 4 representative LVLM-powered GUI agents, where it significantly outperforms existing methods. Ablation studies confirm that both novelties are critical to performance. Our findings reveal underexplored vulnerabilities in modern GUI agents and establish a robust foundation for future research on defense in open-world GUI agent systems. The code is publicly available at https://github.com/zhangyitonggg/attack2gui.
中文: 基于LVLM的GUI代理易受环境注入攻击,本研究提出Chameleon攻击框架,通过模拟动态网页环境和引导代理关注小尺寸触发器,在现实威胁模型下显著优于现有方法。
English: GUI agents based on LVLMs are vulnerable to Environmental Injection Attacks, and this study introduces Chameleon, a novel attack framework that simulates dynamic web environments and directs agent attention to small triggers, significantly outperforming existing methods under realistic threat models.

Authors:Gao Yu Lee, Tanmoy Dam, Md Meftahul Ferdaus, Daniel Puiu Poenar, Vu N. Duong
Title: ANROT-HELANet: Adverserially and Naturally Robust Attention-Based Aggregation Network via The Hellinger Distance for Few-Shot Classification
Abstract:
Few-Shot Learning (FSL), which involves learning to generalize using only a few data samples, has demonstrated promising and superior performances to ordinary CNN methods. While Bayesian based estimation approaches using Kullback-Leibler (KL) divergence have shown improvements, they remain vulnerable to adversarial attacks and natural noises. We introduce ANROT-HELANet, an Adversarially and Naturally RObusT Hellinger Aggregation Network that significantly advances the state-of-the-art in FSL robustness and performance. Our approach implements an adversarially and naturally robust Hellinger distance-based feature class aggregation scheme, demonstrating resilience to adversarial perturbations up to $ε=0.30$ and Gaussian noise up to $σ=0.30$. The network achieves substantial improvements across benchmark datasets, including gains of 1.20\% and 1.40\% for 1-shot and 5-shot scenarios on miniImageNet respectively. We introduce a novel Hellinger Similarity contrastive loss function that generalizes cosine similarity contrastive loss for variational few-shot inference scenarios. Our approach also achieves superior image reconstruction quality with a FID score of 2.75, outperforming traditional VAE (3.43) and WAE (3.38) approaches. Extensive experiments conducted on four few-shot benchmarked datasets verify that ANROT-HELANet's combination of Hellinger distance-based feature aggregation, attention mechanisms, and our novel loss function establishes new state-of-the-art performance while maintaining robustness against both adversarial and natural perturbations. Our code repository will be available at https://github.com/GreedYLearner1146/ANROT-HELANet/tree/main.
中文: ANROT-HELANet提出了一种基于Hellinger距离的鲁棒聚合网络,显著提升了小样本学习的性能和对抗自然干扰与对抗性攻击的鲁棒性,在多个基准测试中取得了最先进的成果。
English: ANROT-HELANet introduces a robust Hellinger distance-based aggregation network that significantly enhances few-shot learning performance and resilience against adversarial and natural perturbations, achieving state-of-the-art results across multiple benchmarks.

Authors:Yihang She, Andrew Blake, David Coomes, Srinivasan Keshav
Title: Scaling Up Forest Vision with Synthetic Data
Abstract:
Accurate tree segmentation is a key step in extracting individual tree metrics from forest laser scans, and is essential to understanding ecosystem functions in carbon cycling and beyond. Over the past decade, tree segmentation algorithms have advanced rapidly due to developments in AI. However existing, public, 3D forest datasets are not large enough to build robust tree segmentation systems. Motivated by the success of synthetic data in other domains such as self-driving, we investigate whether similar approaches can help with tree segmentation. In place of expensive field data collection and annotation, we use synthetic data during pretraining, and then require only minimal, real forest plot annotation for fine-tuning. We have developed a new synthetic data generation pipeline to do this for forest vision tasks, integrating advances in game-engines with physics-based LiDAR simulation. As a result, we have produced a comprehensive, diverse, annotated 3D forest dataset on an unprecedented scale. Extensive experiments with a state-of-the-art tree segmentation algorithm and a popular real dataset show that our synthetic data can substantially reduce the need for labelled real data. After fine-tuning on just a single, real, forest plot of less than 0.1 hectare, the pretrained model achieves segmentations that are competitive with a model trained on the full scale real data. We have also identified critical factors for successful use of synthetic data: physics, diversity, and scale, paving the way for more robust 3D forest vision systems in the future. Our data generation pipeline and the resulting dataset are available at https://github.com/yihshe/CAMP3D.git.
Chinese: 本研究开发了一种合成数据生成流程,通过使用合成数据进行预训练并结合极少量的真实数据进行微调,显著降低了树木分割对标注真实数据的需求,仅需0.1公顷真实林地标注即可达到与全量真实数据训练相媲美的分割效果。
English: This study introduces a synthetic data generation pipeline that significantly reduces the need for labeled real data in tree segmentation by using synthetic data for pretraining and minimal real data for fine-tuning, achieving competitive results with only 0.1 hectare of real forest plot annotation.

Authors:Chengde Lin, Xuezhu Gong, Shuxue Ding, Mingzhe Yang, Xijun Lu, Chengjun Mo
Title: StegOT: Trade-offs in Steganography via Optimal Transport
Abstract:
Image hiding is often referred to as steganography, which aims to hide a secret image in a cover image of the same resolution. Many steganography models are based on genera-tive adversarial networks (GANs) and variational autoencoders (VAEs). However, most existing models suffer from mode collapse. Mode collapse will lead to an information imbalance between the cover and secret images in the stego image and further affect the subsequent extraction. To address these challenges, this paper proposes StegOT, an autoencoder-based steganography model incorporating optimal transport theory. We designed the multiple channel optimal transport (MCOT) module to transform the feature distribution, which exhibits multiple peaks, into a single peak to achieve the trade-off of information. Experiments demonstrate that we not only achieve a trade-off between the cover and secret images but also enhance the quality of both the stego and recovery images. The source code will be released on https://github.com/Rss1124/StegOT.
中文: 本文提出StegOT模型,一种基于自编码器并结合最优传输理论的隐写方法,通过多通道最优传输模块平衡载体与秘密图像的信息,提升隐写和恢复图像的质量。
English: This paper introduces StegOT, an autoencoder-based steganography model that uses optimal transport theory to balance information between cover and secret images, improving both stego and recovery image quality.

Authors:Zhiwen Yang, Yuxin Peng
Title: SPHERE: Semantic-PHysical Engaged REpresentation for 3D Semantic Scene Completion
Abstract:
Camera-based 3D Semantic Scene Completion (SSC) is a critical task in autonomous driving systems, assessing voxel-level geometry and semantics for holistic scene perception. While existing voxel-based and plane-based SSC methods have achieved considerable progress, they struggle to capture physical regularities for realistic geometric details. On the other hand, neural reconstruction methods like NeRF and 3DGS demonstrate superior physical awareness, but suffer from high computational cost and slow convergence when handling large-scale, complex autonomous driving scenes, leading to inferior semantic accuracy. To address these issues, we propose the Semantic-PHysical Engaged REpresentation (SPHERE) for camera-based SSC, which integrates voxel and Gaussian representations for joint exploitation of semantic and physical information. First, the Semantic-guided Gaussian Initialization (SGI) module leverages dual-branch 3D scene representations to locate focal voxels as anchors to guide efficient Gaussian initialization. Then, the Physical-aware Harmonics Enhancement (PHE) module incorporates semantic spherical harmonics to model physical-aware contextual details and promote semantic-geometry consistency through focal distribution alignment, generating SSC results with realistic details. Extensive experiments and analyses on the popular SemanticKITTI and SSCBench-KITTI-360 benchmarks validate the effectiveness of SPHERE. The code is available at https://github.com/PKU-ICST-MIPL/SPHERE_ACMMM2025.
中文摘要:SPHERE框架通过融合体素与高斯表示,结合语义引导和物理感知建模,在自动驾驶场景中实现了更优的几何细节与语义精度的3D语义场景补全。
English Summary: The proposed SPHERE framework integrates voxel and Gaussian representations to enhance camera-based 3D Semantic Scene Completion by combining semantic guidance with physical-aware modeling, achieving superior geometric details and semantic accuracy in autonomous driving scenes.

Authors:Zheng Li, Pei Qu, Yufei Jia, Shihui Zhou, Haizhou Ge, Jiahang Cao, Jinni Zhou, Guyue Zhou, Jun Ma
Title: ManiVID-3D: Generalizable View-Invariant Reinforcement Learning for Robotic Manipulation via Disentangled 3D Representations
Abstract:
Deploying visual reinforcement learning (RL) policies in real-world manipulation is often hindered by camera viewpoint changes. A policy trained from a fixed front-facing camera may fail when the camera is shifted--an unavoidable situation in real-world settings where sensor placement is hard to manage appropriately. Existing methods often rely on precise camera calibration or struggle with large perspective changes. To address these limitations, we propose ManiVID-3D, a novel 3D RL architecture designed for robotic manipulation, which learns view-invariant representations through self-supervised disentangled feature learning. The framework incorporates ViewNet, a lightweight yet effective module that automatically aligns point cloud observations from arbitrary viewpoints into a unified spatial coordinate system without the need for extrinsic calibration. Additionally, we develop an efficient GPU-accelerated batch rendering module capable of processing over 5000 frames per second, enabling large-scale training for 3D visual RL at unprecedented speeds. Extensive evaluation across 10 simulated and 5 real-world tasks demonstrates that our approach achieves a 44.7% higher success rate than state-of-the-art methods under viewpoint variations while using 80% fewer parameters. The system's robustness to severe perspective changes and strong sim-to-real performance highlight the effectiveness of learning geometrically consistent representations for scalable robotic manipulation in unstructured environments. Our project website can be found in https://zheng-joe-lee.github.io/manivid3d/.

Authors:Yuqiu Liu, Jialin Song, Manolis Savva, Wuyang Chen
Title: WildSmoke: Ready-to-Use Dynamic 3D Smoke Assets from a Single Video in the Wild
Abstract:
We propose a pipeline to extract and reconstruct dynamic 3D smoke assets from a single in-the-wild video, and further integrate interactive simulation for smoke design and editing. Recent developments in 3D vision have significantly improved reconstructing and rendering fluid dynamics, supporting realistic and temporally consistent view synthesis. However, current fluid reconstructions rely heavily on carefully controlled clean lab environments, whereas real-world videos captured in the wild are largely underexplored. We pinpoint three key challenges of reconstructing smoke in real-world videos and design targeted techniques, including smoke extraction with background removal, initialization of smoke particles and camera poses, and inferring multi-view videos. Our method not only outperforms previous reconstruction and generation methods with high-quality smoke reconstructions (+2.22 average PSNR on wild videos), but also enables diverse and realistic editing of fluid dynamics by simulating our smoke assets. We provide our models, data, and 4D smoke assets at [https://autumnyq.github.io/WildSmoke](https://autumnyq.github.io/WildSmoke).

Authors:Zhi Chen, Le Zhang
Title: UltraUPConvNet: A UPerNet- and ConvNeXt-Based Multi-Task Network for Ultrasound Tissue Segmentation and Disease Prediction
Abstract:
Ultrasound imaging is widely used in clinical practice due to its cost-effectiveness, mobility, and safety. However, current AI research often treats disease prediction and tissue segmentation as two separate tasks and their model requires substantial computational overhead. In such a situation, we introduce UltraUPConvNet, a computationally efficient universal framework designed for both ultrasound image classification and segmentation. Trained on a large-scale dataset containing more than 9,700 annotations across seven different anatomical regions, our model achieves state-of-the-art performance on certain datasets with lower computational overhead. Our model weights and codes are available at https://github.com/yyxl123/UltraUPConvNet
中文: UltraUPConvNet是一种计算效率高的通用框架,在包含七个解剖区域超过9700个标注的大规模数据集上训练,能以较低计算开销在超声图像分类和分割任务中实现最先进的性能。
English: UltraUPConvNet is a computationally efficient universal framework that achieves state-of-the-art performance in both ultrasound image classification and segmentation with reduced computational overhead, trained on a large-scale dataset of over 9,700 annotations across seven anatomical regions.

Authors:Chao Chen, Shunyu Yao, Yuanwu He, Tao Feng, Ruojing Song, Yuliang Guo, Xinyu Huang, Chenxu Wu, Ren Liu, Chen Feng
Title: End-to-End Visual Autonomous Parking via Control-Aided Attention
Abstract:
Precise parking requires an end-to-end system where perception adaptively provides policy-relevant details-especially in critical areas where fine control decisions are essential. End-to-end learning offers a unified framework by directly mapping sensor inputs to control actions, but existing approaches lack effective synergy between perception and control. We find that transformer-based self-attention, when used alone, tends to produce unstable and temporally inconsistent spatial attention, which undermines the reliability of downstream policy decisions over time. Instead, we propose CAA-Policy, an end-to-end imitation learning system that allows control signal to guide the learning of visual attention via a novel Control-Aided Attention (CAA) mechanism. For the first time, we train such an attention module in a self-supervised manner, using backpropagated gradients from the control outputs instead of from the training loss. This strategy encourages the attention to focus on visual features that induce high variance in action outputs, rather than merely minimizing the training loss-a shift we demonstrate leads to a more robust and generalizable policy. To further enhance stability, CAA-Policy integrates short-horizon waypoint prediction as an auxiliary task, and introduces a separately trained motion prediction module to robustly track the target spot over time. Extensive experiments in the CARLA simulator show that \titlevariable~consistently surpasses both the end-to-end learning baseline and the modular BEV segmentation + hybrid A* pipeline, achieving superior accuracy, robustness, and interpretability. Code is released at https://github.com/Joechencc/CAAPolicy.
Chinese: 提出的CAA-Policy通过控制辅助注意力机制,以自监督方式利用控制信号引导视觉注意力,显著提升了停车策略的鲁棒性,并在精度和稳定性上超越了现有方法。
English: The proposed CAA-Policy introduces a Control-Aided Attention mechanism that uses control signals to guide visual attention in a self-supervised manner, enhancing parking policy robustness and outperforming existing methods in accuracy and stability.

Authors:Xiaoyu Huang, Lauren M Maxson, Trang Nguyen, Cheng Jack Song, Yuankai Huo
Title: Organoid Tracker: A SAM2-Powered Platform for Zero-shot Cyst Analysis in Human Kidney Organoid Videos
Abstract:
Recent advances in organoid models have revolutionized the study of human kidney disease mechanisms and drug discovery by enabling scalable, cost-effective research without the need for animal sacrifice. Here, we present a kidney organoid platform optimized for efficient screening in polycystic kidney disease (PKD). While these systems generate rich spatial-temporal microscopy video datasets, current manual approaches to analysis remain limited to coarse classifications (e.g., hit vs. non-hit), often missing valuable pixel-level and longitudinal information. To help overcome this bottleneck, we developed Organoid Tracker, a graphical user interface (GUI) platform designed with a modular plugin architecture, which empowers researchers to extract detailed, quantitative metrics without programming expertise. Built on the cutting-edge vision foundation model Segment Anything Model 2 (SAM2), Organoid Tracker enables zero-shot segmentation and automated analysis of spatial-temporal microscopy videos. It quantifies key metrics such as cyst formation rate, growth velocity, and morphological changes, while generating comprehensive reports. By providing an extensible, open-source framework, Organoid Tracker offers a powerful solution for improving and accelerating research in kidney development, PKD modeling, and therapeutic discovery. The platform is publicly available as open-source software at https://github.com/hrlblab/OrganoidTracker.
Chinese: Organoid Tracker平台基于Segment Anything Model 2构建,能够对肾脏类器官显微镜视频进行自动化零样本分割和定量分析,为多囊肾病研究与药物发现提供高效工具。
English: The Organoid Tracker platform, built on the Segment Anything Model 2, enables automated, zero-shot segmentation and quantitative analysis of kidney organoid microscopy videos to accelerate research in polycystic kidney disease and drug discovery.

Authors:Gurutva Patle, Nilay Girgaonkar, Nagabhushan Somraj, Rajiv Soundararajan
Title: AD-GS: Alternating Densification for Sparse-Input 3D Gaussian Splatting
Abstract:
3D Gaussian Splatting (3DGS) has shown impressive results in real-time novel view synthesis. However, it often struggles under sparse-view settings, producing undesirable artifacts such as floaters, inaccurate geometry, and overfitting due to limited observations. We find that a key contributing factor is uncontrolled densification, where adding Gaussian primitives rapidly without guidance can harm geometry and cause artifacts. We propose AD-GS, a novel alternating densification framework that interleaves high and low densification phases. During high densification, the model densifies aggressively, followed by photometric loss based training to capture fine-grained scene details. Low densification then primarily involves aggressive opacity pruning of Gaussians followed by regularizing their geometry through pseudo-view consistency and edge-aware depth smoothness. This alternating approach helps reduce overfitting by carefully controlling model capacity growth while progressively refining the scene representation. Extensive experiments on challenging datasets demonstrate that AD-GS significantly improves rendering quality and geometric consistency compared to existing methods. The source code for our model can be found on our project page: https://gurutvapatle.github.io/publications/2025/ADGS.html .

Authors:Ali Hedayatnia, Mostafa Tavassolipour, Babak Nadjar Araabi, Abdol-Hossein Vahabie
Title: Robustifying Diffusion-Denoised Smoothing Against Covariate Shift
Abstract:
Randomized smoothing is a well-established method for achieving certified robustness against l2-adversarial perturbations. By incorporating a denoiser before the base classifier, pretrained classifiers can be seamlessly integrated into randomized smoothing without significant performance degradation. Among existing methods, Diffusion Denoised Smoothing - where a pretrained denoising diffusion model serves as the denoiser - has produced state-of-the-art results. However, we show that employing a denoising diffusion model introduces a covariate shift via misestimation of the added noise, ultimately degrading the smoothed classifier's performance. To address this issue, we propose a novel adversarial objective function focused on the added noise of the denoising diffusion model. This approach is inspired by our understanding of the origin of the covariate shift. Our goal is to train the base classifier to ensure it is robust against the covariate shift introduced by the denoiser. Our method significantly improves certified accuracy across three standard classification benchmarks - MNIST, CIFAR-10, and ImageNet - achieving new state-of-the-art performance in l2-adversarial perturbations. Our implementation is publicly available at https://github.com/ahedayat/Robustifying-DDS-Against-Covariate-Shift
中文摘要:随机平滑是一种增强分类器对抗性鲁棒性的方法,本研究通过提出针对去噪扩散模型噪声估计偏差的新型对抗目标,有效解决了协变量偏移问题,在多个数据集上实现了最优认证精度。
English Summary: Randomized smoothing enhances classifier robustness against adversarial attacks, and this study introduces a novel adversarial objective to correct covariate shift in denoising diffusion models, achieving state-of-the-art certified accuracy on multiple benchmarks.

Authors:Weiqiang Zhao, Tianzhu Liu, Yuzhe Gui, Yanfeng Gu
Title: Total Variation Subgradient Guided Image Fusion for Dual-Camera CASSI System
Abstract:
Spectral imaging technology has long-faced fundamental challenges in balancing spectral, spatial, and temporal resolutions. While compressive sensing-based Coded Aperture Snapshot Spectral Imaging (CASSI) mitigates this trade-off through optical encoding, high compression ratios result in ill-posed reconstruction problems. Traditional model-based methods exhibit limited performance due to reliance on handcrafted inherent image priors, while deep learning approaches are constrained by their black-box nature, which compromises physical interpretability. To address these limitations, we propose a dual-camera CASSI reconstruction framework that integrates total variation (TV) subgradient theory. By establishing an end-to-end SD-CASSI mathematical model, we reduce the computational complexity of solving the inverse problem and provide a mathematically well-founded framework for analyzing multi-camera systems. A dynamic regularization strategy is introduced, incorporating normalized gradient constraints from RGB/panchromatic-derived reference images, which constructs a TV subgradient similarity function with strict convex optimization guarantees. Leveraging spatial priors from auxiliary cameras, an adaptive reference generation and updating mechanism is designed to provide subgradient guidance. Experimental results demonstrate that the proposed method effectively preserves spatial-spectral structural consistency. The theoretical framework establishes an interpretable mathematical foundation for computational spectral imaging, demonstrating robust performance across diverse reconstruction scenarios. The source code is available at https://github.com/bestwishes43/ADMM-TVDS.
中文摘要:该研究提出的双相机CASSI框架通过结合全变分次梯度理论与动态正则化策略,在保持光谱空间结构一致性的同时,为计算光谱成像建立了可解释的数学基础,有效解决了高压缩比下的病态重建问题。
English Summary: The proposed dual-camera CASSI framework integrates total variation subgradient theory with dynamic regularization to overcome reconstruction limitations, achieving enhanced spatial-spectral consistency while providing mathematical interpretability for computational spectral imaging.

Authors:Aryan Kashyap Naveen, Bhuvanesh Singla, Raajan Wankhade, Shreesha M, Ramu S, Ram Mohana Reddy Guddeti
Title: AutoOEP -- A Multi-modal Framework for Online Exam Proctoring
Abstract:
The burgeoning of online education has created an urgent need for robust and scalable systems to ensure academic integrity during remote examinations. Traditional human proctoring is often not feasible at scale, while existing automated solutions can be intrusive or fail to detect a wide range of cheating behaviors. This paper introduces AutoOEP (Automated Online Exam Proctoring), a comprehensive, multi-modal framework that leverages computer vision and machine learning to provide effective, automated proctoring. The system utilizes a dual-camera setup to capture both a frontal view of the examinee and a side view of the workspace, minimizing blind spots. Our approach integrates several parallel analyses: the Face Module performs continuous identity verification using ArcFace, along with head pose estimation, gaze tracking, and mouth movement analysis to detect suspicious cues. Concurrently, the Hand Module employs a fine-tuned YOLOv11 model for detecting prohibited items (e.g., mobile phones, notes) and tracks hand proximity to these objects. Features from these modules are aggregated and fed into a Long Short-Term Memory (LSTM) network that analyzes temporal patterns to calculate a real-time cheating probability score. We evaluate AutoOEP on a custom-collected dataset simulating diverse exam conditions. Our system achieves an accuracy of 90.7% in classifying suspicious activities. The object detection component obtains a mean Average Precision (mAP@.5) of 0.57 for prohibited items, and the entire framework processes video streams at approximately 2.4 frames per second without a GPU. The results demonstrate that AutoOEP is an effective and resource-efficient solution for automated proctoring, significantly reducing the need for human intervention and enhancing the integrity of online assessments. The code is public and can be accessed at https://github.com/05kashyap/AutoOEP.
中文摘要:本文提出AutoOEP系统,通过计算机视觉和机器学习技术,结合双摄像头进行面部行为分析和违禁物品检测,实现了90.7%的在线考试作弊行为识别准确率,为远程监考提供了高效解决方案。
English Summary: This paper introduces AutoOEP, a multi-modal automated proctoring system using computer vision and machine learning to detect cheating behaviors through facial analysis and object detection, achieving 90.7% accuracy in identifying suspicious activities during online exams.

Authors:Qingxiang Liu, Ting Huang, Zeyu Zhang, Hao Tang
Title: Nav-R1: Reasoning and Navigation in Embodied Scenes
Abstract:
Embodied navigation requires agents to integrate perception, reasoning, and action for robust interaction in complex 3D environments. Existing approaches often suffer from incoherent and unstable reasoning traces that hinder generalization across diverse environments, and difficulty balancing long-horizon semantic reasoning with low-latency control for real-time navigation. To address these challenges, we propose Nav-R1, an embodied foundation model that unifies reasoning in embodied environments. We first construct Nav-CoT-110K, a large-scale dataset of step-by-step Chains-of-Thought (CoT) for embodied tasks, which enables cold-start initialization with structured reasoning. Building on this foundation, we design a GRPO-based reinforcement learning framework with three complementary rewards: format, understanding, and navigation, to improve structural adherence, semantic grounding, and path fidelity. Furthermore, we introduce a Fast-in-Slow reasoning paradigm, decoupling deliberate semantic reasoning from low-latency reactive control for efficient yet coherent navigation. Extensive evaluations on embodied AI benchmarks demonstrate that Nav-R1 consistently outperforms strong baselines, with over 8% average improvement in reasoning and navigation performance. Real-world deployment on a mobile robot further validates its robustness under limited onboard resources. Code: https://github.com/AIGeeksGroup/Nav-R1. Website: https://aigeeksgroup.github.io/Nav-R1.
中文:Nav-R1模型通过构建大规模思维链数据集和强化学习框架,解决了具身导航中的推理不连贯问题,在仿真测试和真实机器人部署中均实现了显著的性能提升。
English: The Nav-R1 model addresses challenges in embodied navigation by unifying reasoning through a large-scale dataset and a reinforcement learning framework, achieving significant performance improvements in both simulated benchmarks and real-world robot deployment.

Authors:Simone Mosco, Daniel Fusaro, Wanmeng Li, Emanuele Menegatti, Alberto Pretto
Title: Point-Plane Projections for Accurate LiDAR Semantic Segmentation in Small Data Scenarios
Abstract:
LiDAR point cloud semantic segmentation is essential for interpreting 3D environments in applications such as autonomous driving and robotics. Recent methods achieve strong performance by exploiting different point cloud representations or incorporating data from other sensors, such as cameras or external datasets. However, these approaches often suffer from high computational complexity and require large amounts of training data, limiting their generalization in data-scarce scenarios. In this paper, we improve the performance of point-based methods by effectively learning features from 2D representations through point-plane projections, enabling the extraction of complementary information while relying solely on LiDAR data. Additionally, we introduce a geometry-aware technique for data augmentation that aligns with LiDAR sensor properties and mitigates class imbalance. We implemented and evaluated our method that applies point-plane projections onto multiple informative 2D representations of the point cloud. Experiments demonstrate that this approach leads to significant improvements in limited-data scenarios, while also achieving competitive results on two publicly available standard datasets, as SemanticKITTI and PandaSet. The code of our method is available at https://github.com/SiMoM0/3PNet
中文摘要:本文通过点云平面投影提取二维特征并结合几何感知数据增强技术,有效提升了激光雷达点云语义分割在有限数据场景下的性能,并在公开数据集上取得了具有竞争力的结果。
English summary: This paper enhances LiDAR point cloud semantic segmentation by using point-plane projections to extract 2D features and introducing geometry-aware data augmentation, achieving strong performance in data-limited scenarios and competitive results on standard datasets.

Authors:Xiaoyang Ma, Yiyang Chai, Xinran Qu, Hong Sun
Title: USCTNet: A deep unfolding nuclear-norm optimization solver for physically consistent HSI reconstruction
Abstract:
Reconstructing hyperspectral images (HSIs) from a single RGB image is ill-posed and can become physically inconsistent when the camera spectral sensitivity (CSS) and scene illumination are misspecified. We formulate RGB-to-HSI reconstruction as a physics-grounded inverse problem regularized by a nuclear norm in a learnable transform domain, and we explicitly estimate CSS and illumination to define the forward operator embedded in each iteration, ensuring colorimetric consistency. To avoid the cost and instability of full singular-value decompositions (SVDs) required by singular-value thresholding (SVT), we introduce a data-adaptive low-rank subspace SVT operator. Building on these components, we develop USCTNet, a deep unfolding solver tailored to HSI that couples a parameter estimation module with learnable proximal updates. Extensive experiments on standard benchmarks show consistent improvements over state-of-the-art RGB-based methods in reconstruction accuracy. Code: https://github.com/psykheXX/USCTNet-Code-Implementation.git
Chinese Summary: 本研究提出USCTNet,一种通过将物理建模与可学习邻近算子相结合,从RGB图像重建高光谱图像的深度展开网络,在多个基准测试中展现出优于现有方法的精度。
English Summary: The study introduces USCTNet, a deep unfolding network that reconstructs hyperspectral images from RGB inputs by integrating physics-based modeling with learnable proximal operators, achieving superior accuracy over existing methods.

Authors:Emily Kaczmarek, Justin Szeto, Brennan Nichyporuk, Tal Arbel
Title: Building a General SimCLR Self-Supervised Foundation Model Across Neurological Diseases to Advance 3D Brain MRI Diagnoses
Abstract:
3D structural Magnetic Resonance Imaging (MRI) brain scans are commonly acquired in clinical settings to monitor a wide range of neurological conditions, including neurodegenerative disorders and stroke. While deep learning models have shown promising results analyzing 3D MRI across a number of brain imaging tasks, most are highly tailored for specific tasks with limited labeled data, and are not able to generalize across tasks and/or populations. The development of self-supervised learning (SSL) has enabled the creation of large medical foundation models that leverage diverse, unlabeled datasets ranging from healthy to diseased data, showing significant success in 2D medical imaging applications. However, even the very few foundation models for 3D brain MRI that have been developed remain limited in resolution, scope, or accessibility. In this work, we present a general, high-resolution SimCLR-based SSL foundation model for 3D brain structural MRI, pre-trained on 18,759 patients (44,958 scans) from 11 publicly available datasets spanning diverse neurological diseases. We compare our model to Masked Autoencoders (MAE), as well as two supervised baselines, on four diverse downstream prediction tasks in both in-distribution and out-of-distribution settings. Our fine-tuned SimCLR model outperforms all other models across all tasks. Notably, our model still achieves superior performance when fine-tuned using only 20% of labeled training samples for predicting Alzheimer's disease. We use publicly available code and data, and release our trained model at https://github.com/emilykaczmarek/3D-Neuro-SimCLR, contributing a broadly applicable and accessible foundation model for clinical brain MRI analysis.
中文: 本研究提出了一种基于SimCLR的高分辨率自监督基础模型,用于3D脑部MRI分析,该模型在多种任务中表现优异,即使在有限标注数据下仍保持卓越性能,并已公开共享以促进临床应用。
English: This work introduces a high-resolution, self-supervised SimCLR foundation model for 3D brain MRI, pre-trained on diverse datasets, which outperforms other models across multiple tasks and maintains strong performance with limited labeled data.

Authors:Prajit Sengupta, Islem Rekik
Title: FireGNN: Neuro-Symbolic Graph Neural Networks with Trainable Fuzzy Rules for Interpretable Medical Image Classification
Abstract:
Medical image classification requires not only high predictive performance but also interpretability to ensure clinical trust and adoption. Graph Neural Networks (GNNs) offer a powerful framework for modeling relational structures within datasets; however, standard GNNs often operate as black boxes, limiting transparency and usability, particularly in clinical settings. In this work, we present an interpretable graph-based learning framework named FireGNN that integrates trainable fuzzy rules into GNNs for medical image classification. These rules embed topological descriptors - node degree, clustering coefficient, and label agreement - using learnable thresholds and sharpness parameters to enable intrinsic symbolic reasoning. Additionally, we explore auxiliary self-supervised tasks (e.g., homophily prediction, similarity entropy) as a benchmark to evaluate the contribution of topological learning. Our fuzzy-rule-enhanced model achieves strong performance across five MedMNIST benchmarks and the synthetic dataset MorphoMNIST, while also generating interpretable rule-based explanations. To our knowledge, this is the first integration of trainable fuzzy rules within a GNN. Source Code: https://github.com/basiralab/FireGNN
中文摘要:FireGNN框架将可训练的模糊规则与图神经网络相结合,通过拓扑描述符实现符号推理,在提升医学图像分类性能的同时生成可解释的规则说明。
English Summary: The FireGNN framework integrates trainable fuzzy rules with Graph Neural Networks to enhance interpretability in medical image classification, achieving strong performance across multiple benchmarks while providing rule-based explanations.

Authors:Christian Fane
Title: A Real-Time Diminished Reality Approach to Privacy in MR Collaboration
Abstract:
Diminished reality (DR) refers to the digital removal of real-world objects by compositing background content in their place. This thesis presents a real-time, inpainting-based DR system designed to enable privacy control in shared-space mixed reality (MR) meetings. The system allows a primary headset user to selectively remove personal or sensitive items from their environment, ensuring that those objects are no longer visible to other participants. Removal is achieved through semantic segmentation and precise object selection, followed by real-time inpainting from the viewpoint of a secondary observer, implemented using a mobile ZED 2i depth camera. The solution is designed to be portable and robust, requiring neither a fixed secondary viewpoint nor prior 3D scanning of the environment. The system utilises YOLOv11 for object detection and a modified Decoupled Spatial-Temporal Transformer (DSTT) model for high-quality video inpainting. At 720p resolution, the pipeline sustains frame rates exceeding 20 fps, demonstrating the feasibility of real-time diminished reality for practical privacy-preserving MR applications.
本论文提出了一种实时减实系统,通过语义分割和视频修复技术选择性地移除混合现实会议中的敏感物体,在720p分辨率下帧率超过20 fps,实现了隐私保护功能。
This thesis introduces a real-time diminished reality system that uses semantic segmentation and video inpainting to selectively remove sensitive objects from mixed reality meetings, achieving over 20 fps at 720p for privacy protection.

Authors:Hang Yin, Haoyu Wei, Xiuwei Xu, Wenxuan Guo, Jie Zhou, Jiwen Lu
Title: GC-VLN: Instruction as Graph Constraints for Training-free Vision-and-Language Navigation
Abstract:
In this paper, we propose a training-free framework for vision-and-language navigation (VLN). Existing zero-shot VLN methods are mainly designed for discrete environments or involve unsupervised training in continuous simulator environments, which makes it challenging to generalize and deploy them in real-world scenarios. To achieve a training-free framework in continuous environments, our framework formulates navigation guidance as graph constraint optimization by decomposing instructions into explicit spatial constraints. The constraint-driven paradigm decodes spatial semantics through constraint solving, enabling zero-shot adaptation to unseen environments. Specifically, we construct a spatial constraint library covering all types of spatial relationship mentioned in VLN instructions. The human instruction is decomposed into a directed acyclic graph, with waypoint nodes, object nodes and edges, which are used as queries to retrieve the library to build the graph constraints. The graph constraint optimization is solved by the constraint solver to determine the positions of waypoints, obtaining the robot's navigation path and final goal. To handle cases of no solution or multiple solutions, we construct a navigation tree and the backtracking mechanism. Extensive experiments on standard benchmarks demonstrate significant improvements in success rate and navigation efficiency compared to state-of-the-art zero-shot VLN methods. We further conduct real-world experiments to show that our framework can effectively generalize to new environments and instruction sets, paving the way for a more robust and autonomous navigation framework.
中文: 本文提出了一种免训练的视觉语言导航框架,通过将导航指令转化为图约束优化问题,在模拟和真实环境中均实现了卓越的零样本导航性能。
English: This paper introduces a training-free framework for vision-and-language navigation that formulates navigation as graph constraint optimization, achieving superior zero-shot performance in both simulated and real-world environments.

Authors:Emily Kaczmarek, Justin Szeto, Brennan Nichyporuk, Tal Arbel
Title: SSL-AD: Spatiotemporal Self-Supervised Learning for Generalizability and Adaptability Across Alzheimer's Prediction Tasks and Datasets
Abstract:
Alzheimer's disease is a progressive, neurodegenerative disorder that causes memory loss and cognitive decline. While there has been extensive research in applying deep learning models to Alzheimer's prediction tasks, these models remain limited by lack of available labeled data, poor generalization across datasets, and inflexibility to varying numbers of input scans and time intervals between scans. In this study, we adapt three state-of-the-art temporal self-supervised learning (SSL) approaches for 3D brain MRI analysis, and add novel extensions designed to handle variable-length inputs and learn robust spatial features. We aggregate four publicly available datasets comprising 3,161 patients for pre-training, and show the performance of our model across multiple Alzheimer's prediction tasks including diagnosis classification, conversion detection, and future conversion prediction. Importantly, our SSL model implemented with temporal order prediction and contrastive learning outperforms supervised learning on six out of seven downstream tasks. It demonstrates adaptability and generalizability across tasks and number of input images with varying time intervals, highlighting its capacity for robust performance across clinical applications. We release our code and model publicly at https://github.com/emilykaczmarek/SSL-AD.
中文: 本研究采用时序自监督学习方法分析三维脑部核磁共振影像,通过处理可变输入和扫描间隔,在多项阿尔茨海默病预测任务中展现出优于监督学习的性能与跨数据集泛化能力。
English: This study adapts temporal self-supervised learning approaches for 3D brain MRI analysis to overcome limitations in Alzheimer's prediction, demonstrating superior performance over supervised methods across multiple tasks while handling variable inputs and time intervals.

Authors:Iacopo Curti, Pierluigi Zama Ramirez, Alioscia Petrelli, Luigi Di Stefano
Title: Multimodal SAM-adapter for Semantic Segmentation
Abstract:
Semantic segmentation, a key task in computer vision with broad applications in autonomous driving, medical imaging, and robotics, has advanced substantially with deep learning. Nevertheless, current approaches remain vulnerable to challenging conditions such as poor lighting, occlusions, and adverse weather. To address these limitations, multimodal methods that integrate auxiliary sensor data (e.g., LiDAR, infrared) have recently emerged, providing complementary information that enhances robustness. In this work, we present MM SAM-adapter, a novel framework that extends the capabilities of the Segment Anything Model (SAM) for multimodal semantic segmentation. The proposed method employs an adapter network that injects fused multimodal features into SAM's rich RGB features. This design enables the model to retain the strong generalization ability of RGB features while selectively incorporating auxiliary modalities only when they contribute additional cues. As a result, MM SAM-adapter achieves a balanced and efficient use of multimodal information. We evaluate our approach on three challenging benchmarks, DeLiVER, FMB, and MUSES, where MM SAM-adapter delivers state-of-the-art performance. To further analyze modality contributions, we partition DeLiVER and FMB into RGB-easy and RGB-hard subsets. Results consistently demonstrate that our framework outperforms competing methods in both favorable and adverse conditions, highlighting the effectiveness of multimodal adaptation for robust scene understanding. The code is available at the following link: https://github.com/iacopo97/Multimodal-SAM-Adapter.
中文摘要:MM SAM-adapter框架通过适配器网络将辅助传感器数据与RGB特征相融合,增强了多模态语义分割的鲁棒性,在多种复杂条件下均实现了最优性能。
English Summary: The MM SAM-adapter framework enhances multimodal semantic segmentation by integrating auxiliary sensor data with RGB features through an adapter network, achieving state-of-the-art robustness across diverse challenging conditions.

Authors:Fabien Allemand, Attilio Fiandrotti, Sumanta Chaudhuri, Alaa Eddine Mazouz
Title: Efficient Learned Image Compression Through Knowledge Distillation
Abstract:
Learned image compression sits at the intersection of machine learning and image processing. With advances in deep learning, neural network-based compression methods have emerged. In this process, an encoder maps the image to a low-dimensional latent space, which is then quantized, entropy-coded into a binary bitstream, and transmitted to the receiver. At the receiver end, the bitstream is entropy-decoded, and a decoder reconstructs an approximation of the original image. Recent research suggests that these models consistently outperform conventional codecs. However, they require significant processing power, making them unsuitable for real-time use on resource-constrained platforms, which hinders their deployment in mainstream applications. This study aims to reduce the resource requirements of neural networks used for image compression by leveraging knowledge distillation, a training paradigm where smaller neural networks, partially trained on the outputs of larger, more complex models, can achieve better performance than when trained independently. Our work demonstrates that knowledge distillation can be effectively applied to image compression tasks: i) across various architecture sizes, ii) to achieve different image quality/bit rate tradeoffs, and iii) to save processing and energy resources. This approach introduces new settings and hyperparameters, and future research could explore the impact of different teacher models, as well as alternative loss functions. Knowledge distillation could also be extended to transformer-based models. The code is publicly available at: https://github.com/FABallemand/PRIM .
中文: 本研究采用知识蒸馏技术优化基于神经网络的图像压缩方法,在降低计算资源消耗的同时,通过在不同架构与质量-码率权衡中保持性能,有效提升了模型实用性。
English: This study employs knowledge distillation to enhance neural network-based image compression by reducing computational demands while maintaining performance across various architectures and quality-bitrate tradeoffs.

Authors:Zhixin Zheng, Xinyu Wang, Chang Zou, Shaobo Wang, Linfeng Zhang
Title: Compute Only 16 Tokens in One Timestep: Accelerating Diffusion Transformers with Cluster-Driven Feature Caching
Abstract:
Diffusion transformers have gained significant attention in recent years for their ability to generate high-quality images and videos, yet still suffer from a huge computational cost due to their iterative denoising process. Recently, feature caching has been introduced to accelerate diffusion transformers by caching the feature computation in previous timesteps and reusing it in the following timesteps, which leverage the temporal similarity of diffusion models while ignoring the similarity in the spatial dimension. In this paper, we introduce Cluster-Driven Feature Caching (ClusCa) as an orthogonal and complementary perspective for previous feature caching. Specifically, ClusCa performs spatial clustering on tokens in each timestep, computes only one token in each cluster and propagates their information to all the other tokens, which is able to reduce the number of tokens by over 90%. Extensive experiments on DiT, FLUX and HunyuanVideo demonstrate its effectiveness in both text-to-image and text-to-video generation. Besides, it can be directly applied to any diffusion transformer without requirements for training. For instance, ClusCa achieves 4.96x acceleration on FLUX with an ImageReward of 99.49%, surpassing the original model by 0.51%. The code is available at https://github.com/Shenyi-Z/Cache4Diffusion.
中文: 本文提出聚类驱动特征缓存(ClusCa)方法,通过对令牌进行空间聚类将计算量减少90%以上,无需重新训练即可显著提升文生图和文生视频的生成速度。
English: This paper introduces Cluster-Driven Feature Caching (ClusCa), a method that accelerates diffusion transformers by performing spatial clustering to reduce token computations by over 90%, achieving significant speed improvements in text-to-image and text-to-video generation without requiring retraining.

Authors:Evan Murphy, Marco Viola, Vladimir A. Krylov
Title: A Stochastic Birth-and-Death Approach for Street Furniture Geolocation in Urban Environments
Abstract:
In this paper we address the problem of precise geolocation of street furniture in complex urban environments, which is a critical task for effective monitoring and maintenance of public infrastructure by local authorities and private stakeholders. To this end, we propose a probabilistic framework based on energy maps that encode the spatial likelihood of object locations. Representing the energy in a map-based geopositioned format allows the optimisation process to seamlessly integrate external geospatial information, such as GIS layers, road maps, or placement constraints, which improves contextual awareness and localisation accuracy. A stochastic birth-and-death optimisation algorithm is introduced to infer the most probable configuration of assets. We evaluate our approach using a realistic simulation informed by a geolocated dataset of street lighting infrastructure in Dublin city centre, demonstrating its potential for scalable and accurate urban asset mapping. The implementation of the algorithm will be made available in the GitHub repository https://github.com/EMurphy0108/SBD_Street_Furniture.
中文: 本文提出了一种基于能量映射的概率框架和随机生死算法,用于在城市环境中精确定位街道设施,通过整合外部地理空间数据提高了定位准确性,并利用都柏林街灯基础设施的模拟数据验证了其有效性。
English: This paper introduces a probabilistic framework using energy maps and a stochastic birth-and-death algorithm to accurately geolocate street furniture in urban settings, enhancing localization through integration with external geospatial data and demonstrating effectiveness via simulations with Dublin's street lighting infrastructure.

Authors:Jia Wang, Jie Hu, Xiaoqi Ma, Hanghang Ma, Yanbing Zeng, Xiaoming Wei
Title: MagicMirror: A Large-Scale Dataset and Benchmark for Fine-Grained Artifacts Assessment in Text-to-Image Generation
Abstract:
Text-to-image (T2I) generation has achieved remarkable progress in instruction following and aesthetics. However, a persistent challenge is the prevalence of physical artifacts, such as anatomical and structural flaws, which severely degrade perceptual quality and limit application. Given the diversity and complexity of these artifacts, a systematic and fine-grained evaluation framework is required, which is lacking in current benchmarks. To fill this gap, we introduce MagicMirror, a comprehensive framework for artifacts assessment. We first establish a detailed taxonomy of generated image artifacts. Guided by this taxonomy, we manually annotate MagicData340K, the first human-annotated large-scale dataset of 340K generated images with fine-grained artifact labels. Building on this dataset, we train MagicAssessor, a Vision-Language Model (VLM) that provides detailed assessments and corresponding labels. To overcome challenges like class imbalance and reward hacking, we design a novel data sampling strategy and a multi-level reward system for Group Relative Policy Optimization (GRPO). Finally, we leverage MagicAssessor to construct MagicBench, an automated benchmark for evaluating the image artifacts of current T2I models. Our evaluation with MagicBench reveals that despite their widespread adoption, even top-tier models like GPT-image-1 are consistently plagued by significant artifacts, highlighting artifact reduction as a critical frontier for future T2I development. Project page: https://wj-inf.github.io/MagicMirror-page/.

Authors:Rini Smita Thakur, Rajeev Ranjan Dwivedi, Vinod K Kurmi
Title: Grad-CL: Source Free Domain Adaptation with Gradient Guided Feature Disalignment
Abstract:
Accurate segmentation of the optic disc and cup is critical for the early diagnosis and management of ocular diseases such as glaucoma. However, segmentation models trained on one dataset often suffer significant performance degradation when applied to target data acquired under different imaging protocols or conditions. To address this challenge, we propose \textbf{Grad-CL}, a novel source-free domain adaptation framework that leverages a pre-trained source model and unlabeled target data to robustly adapt segmentation performance without requiring access to the original source data. Grad-CL combines a gradient-guided pseudolabel refinement module with a cosine similarity-based contrastive learning strategy. In the first stage, salient class-specific features are extracted via a gradient-based mechanism, enabling more accurate uncertainty quantification and robust prototype estimation for refining noisy pseudolabels. In the second stage, a contrastive loss based on cosine similarity is employed to explicitly enforce inter-class separability between the gradient-informed features of the optic cup and disc. Extensive experiments on challenging cross-domain fundus imaging datasets demonstrate that Grad-CL outperforms state-of-the-art unsupervised and source-free domain adaptation methods, achieving superior segmentation accuracy and improved boundary delineation. Project and code are available at https://visdomlab.github.io/GCL/.

Authors:Minsang Kong, Myeongjun Kim, Sang Gu Kang, Sang Hun Lee
Title: BEVTraj: Map-Free End-to-End Trajectory Prediction in Bird's-Eye View with Deformable Attention and Sparse Goal Proposals
Abstract:
In autonomous driving, trajectory prediction is essential for ensuring safe and efficient navigation. To improve prediction accuracy, recent approaches often rely on pre-built high-definition (HD) maps or real-time local map construction modules to incorporate static environmental information. However, pre-built HD maps are limited to specific regions and cannot adapt to transient changes. In addition, local map construction modules, which recognize only predefined elements, may fail to capture critical scene details or introduce errors that degrade prediction performance. To overcome these limitations, we propose Bird's-Eye View Trajectory Prediction (BEVTraj), a novel trajectory prediction framework that operates directly in the bird's-eye view (BEV) space utilizing real-time sensor data without relying on any pre-built maps. The BEVTraj leverages deformable attention to efficiently extract relevant context from dense BEV features. Furthermore, we introduce a Sparse Goal Candidate Proposal (SGCP) module, which enables full end-to-end prediction without requiring any post-processing steps. Extensive experiments demonstrate that the BEVTraj achieves performance comparable to state-of-the-art HD map-based models while offering greater flexibility by eliminating the dependency on pre-built maps. The source code is available at https://github.com/Kongminsang/bevtraj.
中文: BEVTraj是一种新型自动驾驶轨迹预测框架,直接在鸟瞰图空间使用实时传感器数据,通过可变形注意力和稀疏目标候选提议模块,在摆脱预建高精地图依赖的同时实现了可比性能。
English: BEVTraj is a novel autonomous driving trajectory prediction framework that uses real-time sensor data in bird's-eye view space, eliminating dependency on pre-built HD maps while achieving comparable performance through deformable attention and sparse goal candidate proposal modules.

Authors:Yue Zhou, Litong Feng, Mengcheng Lan, Xue Yang, Qingyun Li, Yiping Ke, Xue Jiang, Wayne Zhang
Title: Multimodal Mathematical Reasoning Embedded in Aerial Vehicle Imagery: Benchmarking, Analysis, and Exploration
Abstract:
Mathematical reasoning is critical for tasks such as precise distance and area computations, trajectory estimations, and spatial analysis in unmanned aerial vehicle (UAV) based remote sensing, yet current vision-language models (VLMs) have not been adequately tested in this domain. To address this gap, we introduce AVI-Math, the first benchmark to rigorously evaluate multimodal mathematical reasoning in aerial vehicle imagery, moving beyond simple counting tasks to include domain-specific knowledge in areas such as geometry, logic, and algebra. The dataset comprises 3,773 high-quality vehicle-related questions captured from UAV views, covering 6 mathematical subjects and 20 topics. The data, collected at varying altitudes and from multiple UAV angles, reflects real-world UAV scenarios, ensuring the diversity and complexity of the constructed mathematical problems. In this paper, we benchmark 14 prominent VLMs through a comprehensive evaluation and demonstrate that, despite their success on previous multimodal benchmarks, these models struggle with the reasoning tasks in AVI-Math. Our detailed analysis highlights significant limitations in the mathematical reasoning capabilities of current VLMs and suggests avenues for future research. Furthermore, we explore the use of Chain-of-Thought prompting and fine-tuning techniques, which show promise in addressing the reasoning challenges in AVI-Math. Our findings not only expose the limitations of VLMs in mathematical reasoning but also offer valuable insights for advancing UAV-based trustworthy VLMs in real-world applications. The code, and datasets will be released at https://github.com/VisionXLab/avi-math
中文: AVI-Math基准测试首次评估无人机图像中的多模态数学推理能力,发现现有视觉语言模型在此领域存在明显不足,为未来研究指明了方向。
English: The AVI-Math benchmark is introduced to evaluate multimodal mathematical reasoning in UAV imagery, revealing that current vision-language models struggle with these complex tasks despite their broader successes.

Authors:Jing Huang, Zhiya Tan, Shutao Gong, Fanwei Zeng, Joey Tianyi Zhou, Jianshu Li
Title: LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA
Abstract:
As large vision language models (VLMs) advance, their capabilities in multilingual visual question answering (mVQA) have significantly improved. Chain-of-thought (CoT) reasoning has been proven to enhance interpretability and complex reasoning. However, most existing approaches rely primarily on textual CoT and provide limited support for multilingual multimodal reasoning, constraining their deployment in real-world applications. To address this gap, we introduce \textbf{LaV-CoT}, the first Language-aware Visual CoT framework with Multi-Aspect Reward Optimization. LaV-CoT incorporates an interpretable multi-stage reasoning pipeline consisting of Text Summary with Bounding Box (BBox), Language Identification, Spatial Object-level Captioning, and Step-by-step Logical Reasoning. Following this reasoning pipeline, we design an automated data curation method that generates multilingual CoT annotations through iterative generation, correction, and refinement, enabling scalable and high-quality training data. To improve reasoning and generalization, LaV-CoT adopts a two-stage training paradigm combining Supervised Fine-Tuning (SFT) with Language-aware Group Relative Policy Optimization (GRPO), guided by verifiable multi-aspect rewards including language consistency, structural accuracy, and semantic alignment. Extensive evaluations on public datasets including MMMB, Multilingual MMBench, and MTVQA show that LaV-CoT achieves up to ~9.5% accuracy improvements over open-source baselines of similar size and even surpasses models with 2$\times$ larger scales by ~2.6%. Moreover, LaV-CoT outperforms advanced proprietary models such as GPT-4o-0513 and Gemini-2.5-flash. We further conducted an online A/B test to validate our method on real-world data, highlighting its effectiveness for industrial deployment. Our code is available at this link: \href{https://github.com/HJNVR/LaV-CoT}
中文: LaV-CoT框架通过语言感知的视觉思维链和多方面奖励优化,在多语言视觉问答任务中显著超越现有模型,实现了高达9.5%的准确率提升。
English: The LaV-CoT framework introduces a language-aware visual chain-of-thought approach with multi-aspect reward optimization, achieving significant accuracy improvements over existing models in multilingual visual question answering tasks.

Authors:Xiaodong Guo, Tong Liu, Yike Li, Zi'ang Lin, Zhihong Deng
Title: TUNI: Real-time RGB-T Semantic Segmentation with Unified Multi-Modal Feature Extraction and Cross-Modal Feature Fusion
Abstract:
RGB-thermal (RGB-T) semantic segmentation improves the environmental perception of autonomous platforms in challenging conditions. Prevailing models employ encoders pre-trained on RGB images to extract features from both RGB and infrared inputs, and design additional modules to achieve cross-modal feature fusion. This results in limited thermal feature extraction and suboptimal cross-modal fusion, while the redundant encoders further compromises the model's real-time efficiency. To address the above issues, we propose TUNI, with an RGB-T encoder consisting of multiple stacked blocks that simultaneously perform multi-modal feature extraction and cross-modal fusion. By leveraging large-scale pre-training with RGB and pseudo-thermal data, the RGB-T encoder learns to integrate feature extraction and fusion in a unified manner. By slimming down the thermal branch, the encoder achieves a more compact architecture. Moreover, we introduce an RGB-T local module to strengthen the encoder's capacity for cross-modal local feature fusion. The RGB-T local module employs adaptive cosine similarity to selectively emphasize salient consistent and distinct local features across RGB-T modalities. Experimental results show that TUNI achieves competitive performance with state-of-the-art models on FMB, PST900 and CART, with fewer parameters and lower computational cost. Meanwhile, it achieves an inference speed of 27 FPS on a Jetson Orin NX, demonstrating its real-time capability in deployment. Codes are available at https://github.com/xiaodonguo/TUNI.
中文: TUNI提出了一种统一的RGB-T编码器,集成特征提取与融合,实现高效语义分割,在降低复杂度的同时保持竞争力并具备实时推理能力。
English: TUNI introduces a unified RGB-T encoder that integrates feature extraction and fusion for efficient semantic segmentation, achieving competitive performance with reduced complexity and real-time inference.

Authors:Siying Liu, Zikai Wang, Hanle Zheng, Yifan Hu, Xilin Wang, Qingkai Yang, Jibin Wu, Hao Guo, Lei Deng
Title: ISTASTrack: Bridging ANN and SNN via ISTA Adapter for RGB-Event Tracking
Abstract:
RGB-Event tracking has become a promising trend in visual object tracking to leverage the complementary strengths of both RGB images and dynamic spike events for improved performance. However, existing artificial neural networks (ANNs) struggle to fully exploit the sparse and asynchronous nature of event streams. Recent efforts toward hybrid architectures combining ANNs and spiking neural networks (SNNs) have emerged as a promising solution in RGB-Event perception, yet effectively fusing features across heterogeneous paradigms remains a challenge. In this work, we propose ISTASTrack, the first transformer-based \textbf{A}NN-\textbf{S}NN hybrid \textbf{Track}er equipped with \textbf{ISTA} adapters for RGB-Event tracking. The two-branch model employs a vision transformer to extract spatial context from RGB inputs and a spiking transformer to capture spatio-temporal dynamics from event streams. To bridge the modality and paradigm gap between ANN and SNN features, we systematically design a model-based ISTA adapter for bidirectional feature interaction between the two branches, derived from sparse representation theory by unfolding the iterative shrinkage thresholding algorithm. Additionally, we incorporate a temporal downsampling attention module within the adapter to align multi-step SNN features with single-step ANN features in the latent space, improving temporal fusion. Experimental results on RGB-Event tracking benchmarks, such as FE240hz, VisEvent, COESOT, and FELT, have demonstrated that ISTASTrack achieves state-of-the-art performance while maintaining high energy efficiency, highlighting the effectiveness and practicality of hybrid ANN-SNN designs for robust visual tracking. The code is publicly available at https://github.com/lsying009/ISTASTrack.git.
中文摘要:ISTASTrack是一种基于Transformer的混合跟踪器,通过ISTA适配器融合人工神经网络和脉冲神经网络分支,在多个RGB-事件跟踪基准上实现了最优性能,同时保持高能效。
English Summary: ISTASTrack is a novel transformer-based hybrid tracker that integrates ANN and SNN branches with ISTA adapters for effective RGB-Event fusion, achieving state-of-the-art performance across multiple benchmarks while maintaining energy efficiency.

Authors:Anne Marthe Sophie Ngo Bibinbe, Chiron Bang, Patrick Gagnon, Jamie Ahloy-Dallaire, Eric R. Paquet
Title: An HMM-based framework for identity-aware long-term multi-object tracking from sparse and uncertain identification: use case on long-term tracking in livestock
Abstract:
The need for long-term multi-object tracking (MOT) is growing due to the demand for analyzing individual behaviors in videos that span several minutes. Unfortunately, due to identity switches between objects, the tracking performance of existing MOT approaches decreases over time, making them difficult to apply for long-term tracking. However, in many real-world applications, such as in the livestock sector, it is possible to obtain sporadic identifications for some of the animals from sources like feeders. To address the challenges of long-term MOT, we propose a new framework that combines both uncertain identities and tracking using a Hidden Markov Model (HMM) formulation. In addition to providing real-world identities to animals, our HMM framework improves the F1 score of ByteTrack, a leading MOT approach even with re-identification, on a 10 minute pig tracking dataset with 21 identifications at the pen's feeding station. We also show that our approach is robust to the uncertainty of identifications, with performance increasing as identities are provided more frequently. The improved performance of our HMM framework was also validated on the MOT17 and MOT20 benchmark datasets using both ByteTrack and FairMOT. The code for this new HMM framework and the new 10-minute pig tracking video dataset are available at: https://github.com/ngobibibnbe/uncertain-identity-aware-tracking
中文: 提出的隐马尔可夫模型框架通过整合零星的真实身份信息,有效解决了长期多目标跟踪中的身份切换问题,在牲畜追踪和标准数据集上均显著提升了跟踪性能。
English: The proposed Hidden Markov Model framework effectively addresses long-term multi-object tracking challenges by integrating sporadic real-world identifications, significantly improving tracking accuracy on both livestock and benchmark datasets.

Authors:Zhi Ying, Boxiang Rong, Jingyu Wang, Maoyuan Xu
Title: Chord: Chain of Rendering Decomposition for PBR Material Estimation from Generated Texture Images
Abstract:
Material creation and reconstruction are crucial for appearance modeling but traditionally require significant time and expertise from artists. While recent methods leverage visual foundation models to synthesize PBR materials from user-provided inputs, they often fall short in quality, flexibility, and user control. We propose a novel two-stage generate-and-estimate framework for PBR material generation. In the generation stage, a fine-tuned diffusion model synthesizes shaded, tileable texture images aligned with user input. In the estimation stage, we introduce a chained decomposition scheme that sequentially predicts SVBRDF channels by passing previously extracted representation as input into a single-step image-conditional diffusion model. Our method is efficient, high quality, and enables flexible user control. We evaluate our approach against existing material generation and estimation methods, demonstrating superior performance. Our material estimation method shows strong robustness on both generated textures and in-the-wild photographs. Furthermore, we highlight the flexibility of our framework across diverse applications, including text-to-material, image-to-material, structure-guided generation, and material editing.

Authors:Tim Broedermannn, Christos Sakaridis, Luigi Piccinelli, Wim Abbeloos, Luc Van Gool
Title: DGFusion: Depth-Guided Sensor Fusion for Robust Semantic Perception
Abstract:
Robust semantic perception for autonomous vehicles relies on effectively combining multiple sensors with complementary strengths and weaknesses. State-of-the-art sensor fusion approaches to semantic perception often treat sensor data uniformly across the spatial extent of the input, which hinders performance when faced with challenging conditions. By contrast, we propose a novel depth-guided multimodal fusion method that upgrades condition-aware fusion by integrating depth information. Our network, DGFusion, poses multimodal segmentation as a multi-task problem, utilizing the lidar measurements, which are typically available in outdoor sensor suites, both as one of the model's inputs and as ground truth for learning depth. Our corresponding auxiliary depth head helps to learn depth-aware features, which are encoded into spatially varying local depth tokens that condition our attentive cross-modal fusion. Together with a global condition token, these local depth tokens dynamically adapt sensor fusion to the spatially varying reliability of each sensor across the scene, which largely depends on depth. In addition, we propose a robust loss for our depth, which is essential for learning from lidar inputs that are typically sparse and noisy in adverse conditions. Our method achieves state-of-the-art panoptic and semantic segmentation performance on the challenging MUSES and DELIVER datasets. Code and models will be available at https://github.com/timbroed/DGFusion
Chinese: 提出的DGFusion网络采用深度引导的多模态融合方法,通过深度感知特征和局部深度标记动态调整传感器融合,在复杂数据集上实现了最先进的全景和语义分割性能。
English: The proposed DGFusion network introduces a depth-guided multimodal fusion method that dynamically adapts sensor fusion using depth-aware features and local depth tokens, achieving state-of-the-art panoptic and semantic segmentation performance on challenging datasets.

Authors:Moslem Yazdanpanah, Ali Bahri, Mehrdad Noori, Sahar Dastani, Gustavo Adolfo Vargas Hakim, David Osowiechi, Ismail Ben Ayed, Christian Desrosiers
Title: Purge-Gate: Backpropagation-Free Test-Time Adaptation for Point Clouds Classification via Token Purging
Abstract:
Test-time adaptation (TTA) is crucial for mitigating performance degradation caused by distribution shifts in 3D point cloud classification. In this work, we introduce Token Purging (PG), a novel backpropagation-free approach that removes tokens highly affected by domain shifts before they reach attention layers. Unlike existing TTA methods, PG operates at the token level, ensuring robust adaptation without iterative updates. We propose two variants: PG-SP, which leverages source statistics, and PG-SF, a fully source-free version relying on CLS-token-driven adaptation. Extensive evaluations on ModelNet40-C, ShapeNet-C, and ScanObjectNN-C demonstrate that PG-SP achieves an average of +10.3\% higher accuracy than state-of-the-art backpropagation-free methods, while PG-SF sets new benchmarks for source-free adaptation. Moreover, PG is 12.4 times faster and 5.5 times more memory efficient than our baseline, making it suitable for real-world deployment. Code is available at \hyperlink{https://github.com/MosyMosy/Purge-Gate}{https://github.com/MosyMosy/Purge-Gate}
Chinese: 本文提出Token Purging(PG)方法,这是一种无需反向传播的测试时自适应技术,通过过滤受域偏移影响的特征标记来提升3D点云分类性能,在精度和效率上均超越现有方法。
English: This paper introduces Token Purging (PG), a backpropagation-free test-time adaptation method for 3D point cloud classification that removes domain-shifted tokens before attention layers, achieving superior accuracy and efficiency over existing approaches.

Authors:Jiahao Wang, Yufeng Yuan, Rujie Zheng, Youtian Lin, Jian Gao, Lin-Zhuo Chen, Yajie Bao, Yi Zhang, Chang Zeng, Yanxi Zhou, Xiaoxiao Long, Hao Zhu, Zhaoxiang Zhang, Xun Cao, Yao Yao
Title: SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
Abstract:
Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion. To this end, we collect \textbf{SpatialVID}, a dataset consists of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions. Specifically, we collect more than 21,000 hours of raw video, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions. Analysis of SpatialVID's data statistics reveals a richness and diversity that directly foster improved model generalization and performance, establishing it as a key asset for the video and 3D vision research community.
Chinese: 空间智能虽取得重大进展,但受限于训练数据不足;为此开发了SpatialVID数据集,包含大量多样化视频和丰富3D标注,有效提升模型泛化能力与性能。
English: Significant advances in spatial intelligence are hindered by limited training data, leading to the creation of SpatialVID, a large-scale dataset with diverse videos and rich 3D annotations to enhance model generalization and performance.

Authors:Bingkui Tong, Jiaer Xia, Sifeng Shang, Kaiyang Zhou
Title: Measuring Epistemic Humility in Multimodal Large Language Models
Abstract:
Hallucinations in multimodal large language models (MLLMs) -- where the model generates content inconsistent with the input image -- pose significant risks in real-world applications, from misinformation in visual question answering to unsafe errors in decision-making. Existing benchmarks primarily test recognition accuracy, i.e., evaluating whether models can select the correct answer among distractors. This overlooks an equally critical capability for trustworthy AI: recognizing when none of the provided options are correct, a behavior reflecting epistemic humility. We present HumbleBench, a new hallucination benchmark designed to evaluate MLLMs' ability to reject plausible but incorrect answers across three hallucination types: object, relation, and attribute. Built from a panoptic scene graph dataset, we leverage fine-grained scene graph annotations to extract ground-truth entities and relations, and prompt GPT-4-Turbo to generate multiple-choice questions, followed by a rigorous manual filtering process. Each question includes a "None of the above" option, requiring models not only to recognize correct visual information but also to identify when no provided answer is valid. We evaluate a variety of state-of-the-art MLLMs -- including both general-purpose and specialized reasoning models -- on HumbleBench and share valuable findings and insights with the community. By incorporating explicit false-option rejection, HumbleBench fills a key gap in current evaluation suites, providing a more realistic measure of MLLM reliability in safety-critical settings. Our code and dataset are released publicly and can be accessed at https://github.com/maifoundations/HumbleBench.
Chinese: HumbleBench是一个新的基准测试,通过引入“以上都不是”选项来评估多模态大语言模型拒绝看似合理但错误答案的能力,填补了在安全关键应用中评估认知谦逊和可靠性的重要空白。
English: HumbleBench is a new benchmark designed to assess multimodal large language models' ability to reject plausible but incorrect answers by incorporating a "None of the above" option, addressing the critical gap in evaluating epistemic humility and reliability in safety-critical applications.

Authors:Sijun Dong, Yuxuan Hu, LiBo Wang, Geng Chen, Xiaoliang Meng
Title: PeftCD: Leveraging Vision Foundation Models with Parameter-Efficient Fine-Tuning for Remote Sensing Change Detection
Abstract:
To tackle the prevalence of pseudo changes, the scarcity of labeled samples, and the difficulty of cross-domain generalization in multi-temporal and multi-source remote sensing imagery, we propose PeftCD, a change detection framework built upon Vision Foundation Models (VFMs) with Parameter-Efficient Fine-Tuning (PEFT). At its core, PeftCD employs a weight-sharing Siamese encoder derived from a VFM, into which LoRA and Adapter modules are seamlessly integrated. This design enables highly efficient task adaptation by training only a minimal set of additional parameters. To fully unlock the potential of VFMs, we investigate two leading backbones: the Segment Anything Model v2 (SAM2), renowned for its strong segmentation priors, and DINOv3, a state-of-the-art self-supervised representation learner. The framework is complemented by a deliberately lightweight decoder, ensuring the focus remains on the powerful feature representations from the backbones. Extensive experiments demonstrate that PeftCD achieves state-of-the-art performance across multiple public datasets, including SYSU-CD (IoU 73.81%), WHUCD (92.05%), MSRSCD (64.07%), MLCD (76.89%), CDD (97.01%), S2Looking (52.25%) and LEVIR-CD (85.62%), with notably precise boundary delineation and strong suppression of pseudo-changes. In summary, PeftCD presents an optimal balance of accuracy, efficiency, and generalization. It offers a powerful and scalable paradigm for adapting large-scale VFMs to real-world remote sensing change detection applications. The code and pretrained models will be released at https://github.com/dyzy41/PeftCD.
中文: PeftCD是一种基于视觉基础模型的高效参数微调变化检测框架,能有效应对伪变化和跨域泛化挑战,在多个数据集上以少量可训练参数实现了最优性能。
English: PeftCD is a parameter-efficient change detection framework leveraging Vision Foundation Models to address pseudo changes and cross-domain generalization, achieving state-of-the-art accuracy across multiple datasets with minimal trainable parameters.

Authors:Akshit Achara, Esther Puyol Anton, Alexander Hammers, Andrew P. King
Title: Invisible Attributes, Visible Biases: Exploring Demographic Shortcuts in MRI-based Alzheimer's Disease Classification
Abstract:
Magnetic resonance imaging (MRI) is the gold standard for brain imaging. Deep learning (DL) algorithms have been proposed to aid in the diagnosis of diseases such as Alzheimer's disease (AD) from MRI scans. However, DL algorithms can suffer from shortcut learning, in which spurious features, not directly related to the output label, are used for prediction. When these features are related to protected attributes, they can lead to performance bias against underrepresented protected groups, such as those defined by race and sex. In this work, we explore the potential for shortcut learning and demographic bias in DL based AD diagnosis from MRI. We first investigate if DL algorithms can identify race or sex from 3D brain MRI scans to establish the presence or otherwise of race and sex based distributional shifts. Next, we investigate whether training set imbalance by race or sex can cause a drop in model performance, indicating shortcut learning and bias. Finally, we conduct a quantitative and qualitative analysis of feature attributions in different brain regions for both the protected attribute and AD classification tasks. Through these experiments, and using multiple datasets and DL models (ResNet and SwinTransformer), we demonstrate the existence of both race and sex based shortcut learning and bias in DL based AD classification. Our work lays the foundation for fairer DL diagnostic tools in brain MRI. The code is provided at https://github.com/acharaakshit/ShortMR
中文摘要:本研究揭示了基于磁共振成像的阿尔茨海默病深度学习诊断模型存在与种族和性别相关的捷径学习及人口统计学偏差,可能影响不同人群诊断的公平性。
English Summary: This study demonstrates that deep learning models for Alzheimer's disease diagnosis from MRI scans exhibit shortcut learning and demographic bias related to race and sex, potentially compromising diagnostic fairness across different population groups.

Authors:Sirui Xu, Dongting Li, Yucheng Zhang, Xiyan Xu, Qi Long, Ziyin Wang, Yunzhi Lu, Shuchang Dong, Hezi Jiang, Akshat Gupta, Yu-Xiong Wang, Liang-Yan Gui
Title: InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation
Abstract:
While large-scale human motion capture datasets have advanced human motion generation, modeling and generating dynamic 3D human-object interactions (HOIs) remain challenging due to dataset limitations. Existing datasets often lack extensive, high-quality motion and annotation and exhibit artifacts such as contact penetration, floating, and incorrect hand motions. To address these issues, we introduce InterAct, a large-scale 3D HOI benchmark featuring dataset and methodological advancements. First, we consolidate and standardize 21.81 hours of HOI data from diverse sources, enriching it with detailed textual annotations. Second, we propose a unified optimization framework to enhance data quality by reducing artifacts and correcting hand motions. Leveraging the principle of contact invariance, we maintain human-object relationships while introducing motion variations, expanding the dataset to 30.70 hours. Third, we define six benchmarking tasks and develop a unified HOI generative modeling perspective, achieving state-of-the-art performance. Extensive experiments validate the utility of our dataset as a foundational resource for advancing 3D human-object interaction generation. To support continued research in this area, the dataset is publicly available at https://github.com/wzyabcas/InterAct, and will be actively maintained.
中文:InterAct基准通过整合优化21.81小时数据至30.70小时高质量交互,配以详细文本标注,解决了现有三维人物交互数据集的缺陷,同时建立统一生成建模任务并实现最优性能。
English: The InterAct benchmark addresses limitations in existing 3D human-object interaction datasets by consolidating and optimizing 21.81 hours of data into 30.70 hours of high-quality interactions with detailed annotations, while establishing unified generative modeling tasks that achieve state-of-the-art performance.

Authors:Dohun Lee, Hyeonho Jeong, Jiwook Kim, Duygu Ceylan, Jong Chul Ye
Title: Improving Video Diffusion Transformer Training by Multi-Feature Fusion and Alignment from Self-Supervised Vision Encoders
Abstract:
Video diffusion models have advanced rapidly in the recent years as a result of series of architectural innovations (e.g., diffusion transformers) and use of novel training objectives (e.g., flow matching). In contrast, less attention has been paid to improving the feature representation power of such models. In this work, we show that training video diffusion models can benefit from aligning the intermediate features of the video generator with feature representations of pre-trained vision encoders. We propose a new metric and conduct an in-depth analysis of various vision encoders to evaluate their discriminability and temporal consistency, thereby assessing their suitability for video feature alignment. Based on the analysis, we present Align4Gen which provides a novel multi-feature fusion and alignment method integrated into video diffusion model training. We evaluate Align4Gen both for unconditional and class-conditional video generation tasks and show that it results in improved video generation as quantified by various metrics. Full video results are available on our project page: https://align4gen.github.io/align4gen/

Authors:Jian Zhu, Xin Zou, Xi Wang, Ning Zhang, Bian Wu, Yao Yang, Ying Zhou, Lingfang Zeng, Chang Tang, Cheng Luo
Title: Generative Diffusion Contrastive Network for Multi-View Clustering
Abstract:
In recent years, Multi-View Clustering (MVC) has been significantly advanced under the influence of deep learning. By integrating heterogeneous data from multiple views, MVC enhances clustering analysis, making multi-view fusion critical to clustering performance. However, there is a problem of low-quality data in multi-view fusion. This problem primarily arises from two reasons: 1) Certain views are contaminated by noisy data. 2) Some views suffer from missing data. This paper proposes a novel Stochastic Generative Diffusion Fusion (SGDF) method to address this problem. SGDF leverages a multiple generative mechanism for the multi-view feature of each sample. It is robust to low-quality data. Building on SGDF, we further present the Generative Diffusion Contrastive Network (GDCN). Extensive experiments show that GDCN achieves the state-of-the-art results in deep MVC tasks. The source code is publicly available at https://github.com/HackerHyper/GDCN.
中文: 本文提出了一种新颖的随机生成扩散融合方法和生成扩散对比网络,通过鲁棒的多视图融合有效解决了多视图聚类中的低质量数据问题,实现了最先进的性能。
English: This paper introduces a novel Stochastic Generative Diffusion Fusion (SGDF) method and the Generative Diffusion Contrastive Network (GDCN) to address low-quality data issues in Multi-View Clustering, achieving state-of-the-art performance through robust multi-view fusion.

Authors:Ha Linh Nguyen, Tze Ho Elden Tse, Angela Yao
Title: Improving Human Motion Plausibility with Body Momentum
Abstract:
Many studies decompose human motion into local motion in a frame attached to the root joint and global motion of the root joint in the world frame, treating them separately. However, these two components are not independent. Global movement arises from interactions with the environment, which are, in turn, driven by changes in the body configuration. Motion models often fail to precisely capture this physical coupling between local and global dynamics, while deriving global trajectories from joint torques and external forces is computationally expensive and complex. To address these challenges, we propose using whole-body linear and angular momentum as a constraint to link local motion with global movement. Since momentum reflects the aggregate effect of joint-level dynamics on the body's movement through space, it provides a physically grounded way to relate local joint behavior to global displacement. Building on this insight, we introduce a new loss term that enforces consistency between the generated momentum profiles and those observed in ground-truth data. Incorporating our loss reduces foot sliding and jitter, improves balance, and preserves the accuracy of the recovered motion. Code and data are available at the project page https://hlinhn.github.io/momentum_bmvc.

Authors:Peisong Wen, Qianqian Xu, Siran Dai, Runmin Cong, Qingming Huang
Title: Semantic Concentration for Self-Supervised Dense Representations Learning
Abstract:
Recent advances in image-level self-supervised learning (SSL) have made significant progress, yet learning dense representations for patches remains challenging. Mainstream methods encounter an over-dispersion phenomenon that patches from the same instance/category scatter, harming downstream performance on dense tasks. This work reveals that image-level SSL avoids over-dispersion by involving implicit semantic concentration. Specifically, the non-strict spatial alignment ensures intra-instance consistency, while shared patterns, i.e., similar parts of within-class instances in the input space, ensure inter-image consistency. Unfortunately, these approaches are infeasible for dense SSL due to their spatial sensitivity and complicated scene-centric data. These observations motivate us to explore explicit semantic concentration for dense SSL. First, to break the strict spatial alignment, we propose to distill the patch correspondences. Facing noisy and imbalanced pseudo labels, we propose a noise-tolerant ranking loss. The core idea is extending the Average Precision (AP) loss to continuous targets, such that its decision-agnostic and adaptive focusing properties prevent the student model from being misled. Second, to discriminate the shared patterns from complicated scenes, we propose the object-aware filter to map the output space to an object-based space. Specifically, patches are represented by learnable prototypes of objects via cross-attention. Last but not least, empirical studies across various tasks soundly support the effectiveness of our method. Code is available in https://github.com/KID-7391/CoTAP.
中文摘要:本研究针对密集自监督学习中的过度分散问题,提出通过带噪声容忍排序损失的块对应蒸馏和对象感知过滤来实现显式语义集中,有效提升了多任务下的表示学习性能。
English Summary: This study addresses the challenge of over-dispersion in dense self-supervised learning by proposing explicit semantic concentration through patch correspondence distillation with noise-tolerant ranking loss and object-aware filtering to enhance representation learning across various tasks.

Authors:Yuchan Jie, Yushen Xu, Xiaosong Li, Fuqiang Zhou, Jianming Lv, Huafeng Li
Title: FS-Diff: Semantic guidance and clarity-aware simultaneous multimodal image fusion and super-resolution
Abstract:
As an influential information fusion and low-level vision technique, image fusion integrates complementary information from source images to yield an informative fused image. A few attempts have been made in recent years to jointly realize image fusion and super-resolution. However, in real-world applications such as military reconnaissance and long-range detection missions, the target and background structures in multimodal images are easily corrupted, with low resolution and weak semantic information, which leads to suboptimal results in current fusion techniques. In response, we propose FS-Diff, a semantic guidance and clarity-aware joint image fusion and super-resolution method. FS-Diff unifies image fusion and super-resolution as a conditional generation problem. It leverages semantic guidance from the proposed clarity sensing mechanism for adaptive low-resolution perception and cross-modal feature extraction. Specifically, we initialize the desired fused result as pure Gaussian noise and introduce the bidirectional feature Mamba to extract the global features of the multimodal images. Moreover, utilizing the source images and semantics as conditions, we implement a random iterative denoising process via a modified U-Net network. This network istrained for denoising at multiple noise levels to produce high-resolution fusion results with cross-modal features and abundant semantic information. We also construct a powerful aerial view multiscene (AVMS) benchmark covering 600 pairs of images. Extensive joint image fusion and super-resolution experiments on six public and our AVMS datasets demonstrated that FS-Diff outperforms the state-of-the-art methods at multiple magnifications and can recover richer details and semantics in the fused images. The code is available at https://github.com/XylonXu01/FS-Diff.
Chinese: FS-Diff是一种创新的联合图像融合与超分辨率方法,通过语义引导和清晰度感知机制生成具有丰富细节和语义信息的高分辨率融合图像,在多个数据集上均优于现有技术。
English: FS-Diff is a novel joint image fusion and super-resolution method that leverages semantic guidance and clarity-aware mechanisms to generate high-resolution fused images with enhanced details and semantic information, outperforming existing techniques across multiple datasets.

Authors:Umaima Rahman, Raza Imam, Mohammad Yaqub, Dwarikanath Mahapatra
Title: Decoupling Clinical and Class-Agnostic Features for Reliable Few-Shot Adaptation under Shift
Abstract:
Medical vision-language models (VLMs) offer promise for clinical decision support, yet their reliability under distribution shifts remains a major concern for safe deployment. These models often learn task-agnostic correlations due to variability in imaging protocols and free-text reports, limiting their generalizability and increasing the risk of failure in real-world settings. We propose DRiFt, a structured feature decoupling framework that explicitly separates clinically relevant signals from task-agnostic noise using parameter-efficient tuning (LoRA) and learnable prompt tokens. To enhance cross-modal alignment and reduce uncertainty, we curate high-quality, clinically grounded image-text pairs by generating captions for a diverse medical dataset. Our approach improves in-distribution performance by +11.4% Top-1 accuracy and +3.3% Macro-F1 over prior prompt-based methods, while maintaining strong robustness across unseen datasets. Ablation studies reveal that disentangling task-relevant features and careful alignment significantly enhance model generalization and reduce unpredictable behavior under domain shift. These insights contribute toward building safer, more trustworthy VLMs for clinical use. The code is available at https://github.com/rumaima/DRiFt.
中文摘要:DRiFt框架通过解耦临床相关特征与任务无关噪声,提升了医学视觉语言模型的分布内性能与跨数据集鲁棒性,为临床安全应用提供了更可靠的解决方案。
English Summary: The DRiFt framework enhances medical vision-language models by decoupling clinical features from task-agnostic noise, improving both in-distribution performance and robustness across datasets for safer clinical deployment.

Authors:Dimitrios Anastasiou, Razvan Caramalau, Nazir Sirajudeen, Matthew Boal, Philip Edwards, Justin Collins, John Kelly, Ashwin Sridhar, Maxine Tran, Faiz Mumtaz, Nevil Pavithran, Nader Francis, Danail Stoyanov, Evangelos B. Mazomenos
Title: Exploring Pre-training Across Domains for Few-Shot Surgical Skill Assessment
Abstract:
Automated surgical skill assessment (SSA) is a central task in surgical computer vision. Developing robust SSA models is challenging due to the scarcity of skill annotations, which are time-consuming to produce and require expert consensus. Few-shot learning (FSL) offers a scalable alternative enabling model development with minimal supervision, though its success critically depends on effective pre-training. While widely studied for several surgical downstream tasks, pre-training has remained largely unexplored in SSA. In this work, we formulate SSA as a few-shot task and investigate how self-supervised pre-training strategies affect downstream few-shot SSA performance. We annotate a publicly available robotic surgery dataset with Objective Structured Assessment of Technical Skill (OSATS) scores, and evaluate various pre-training sources across three few-shot settings. We quantify domain similarity and analyze how domain gap and the inclusion of procedure-specific data into pre-training influence transferability. Our results show that small but domain-relevant datasets can outperform large scale, less aligned ones, achieving accuracies of 60.16%, 66.03%, and 73.65% in the 1-, 2-, and 5-shot settings, respectively. Moreover, incorporating procedure-specific data into pre-training with a domain-relevant external dataset significantly boosts downstream performance, with an average gain of +1.22% in accuracy and +2.28% in F1-score; however, applying the same strategy with less similar but large-scale sources can instead lead to performance degradation. Code and models are available at https://github.com/anastadimi/ssa-fsl.
中文: 本研究探讨了自监督预训练对小样本手术技能评估的影响,结果表明领域相关数据集优于规模更大但相关性较低的来源,且加入手术特定数据可显著提升性能。
English: This study explores how self-supervised pre-training impacts few-shot surgical skill assessment, demonstrating that domain-relevant datasets outperform larger but less aligned sources and that incorporating procedure-specific data enhances performance.

Authors:Hui Li, Yi You, Qiqi Chen, Bingfeng Zhang, George Q. Huang
Title: Fine-Grained Customized Fashion Design with Image-into-Prompt benchmark and dataset from LMM
Abstract:
Generative AI evolves the execution of complex workflows in industry, where the large multimodal model empowers fashion design in the garment industry. Current generation AI models magically transform brainstorming into fancy designs easily, but the fine-grained customization still suffers from text uncertainty without professional background knowledge from end-users. Thus, we propose the Better Understanding Generation (BUG) workflow with LMM to automatically create and fine-grain customize the cloth designs from chat with image-into-prompt. Our framework unleashes users' creative potential beyond words and also lowers the barriers of clothing design/editing without further human involvement. To prove the effectiveness of our model, we propose a new FashionEdit dataset that simulates the real-world clothing design workflow, evaluated from generation similarity, user satisfaction, and quality. The code and dataset: https://github.com/detectiveli/FashionEdit.
中文: 生成式AI革新了服装行业复杂工作流程,将创意轻松转化为设计,但精细定制仍受文本模糊性限制,因此我们提出BUG工作流,通过图像转提示自动生成和微调服装设计,降低设计门槛并释放用户创造力。
English: Generative AI enhances complex workflows in the garment industry by enabling easy transformation of ideas into designs, yet struggles with fine-grained customization due to text ambiguity, leading to the proposed BUG workflow for automated and precise clothing design from conversational inputs.

Authors:Illia Volkov, Nikita Kisel, Klara Janouskova, Jiri Matas
Title: Image Recognition with Vision and Language Embeddings of VLMs
Abstract:
Vision-language models (VLMs) have enabled strong zero-shot classification through image-text alignment. Yet, their purely visual inference capabilities remain under-explored. In this work, we conduct a comprehensive evaluation of both language-guided and vision-only image classification with a diverse set of dual-encoder VLMs, including both well-established and recent models such as SigLIP 2 and RADIOv2.5. The performance is compared in a standard setup on the ImageNet-1k validation set and its label-corrected variant. The key factors affecting accuracy are analysed, including prompt design, class diversity, the number of neighbours in k-NN, and reference set size. We show that language and vision offer complementary strengths, with some classes favouring textual prompts and others better handled by visual similarity. To exploit this complementarity, we introduce a simple, learning-free fusion method based on per-class precision that improves classification performance. The code is available at: https://github.com/gonikisgo/bmvc2025-vlm-image-recognition.
Chinese: 视觉语言模型在零样本图像分类中展现出互补优势,语言引导和纯视觉方法在不同类别上各有所长,据此提出了一种无需学习的融合策略,有效提升了分类性能。
English: Vision-language models demonstrate complementary strengths in zero-shot image classification, with language-guided and vision-only approaches each excelling in different categories, leading to a proposed fusion method that enhances performance without additional training.

Authors:Zhengzhao Lai, Youbin Zheng, Zhenyang Cai, Haonan Lyu, Jinpu Yang, Hongqing Liang, Yan Hu, Benyou Wang
Title: Can Multimodal LLMs See Materials Clearly? A Multimodal Benchmark on Materials Characterization
Abstract:
Materials characterization is fundamental to acquiring materials information, revealing the processing-microstructure-property relationships that guide material design and optimization. While multimodal large language models (MLLMs) have recently shown promise in generative and predictive tasks within materials science, their capacity to understand real-world characterization imaging data remains underexplored. To bridge this gap, we present MatCha, the first benchmark for materials characterization image understanding, comprising 1,500 questions that demand expert-level domain expertise. MatCha encompasses four key stages of materials research comprising 21 distinct tasks, each designed to reflect authentic challenges faced by materials scientists. Our evaluation of state-of-the-art MLLMs on MatCha reveals a significant performance gap compared to human experts. These models exhibit degradation when addressing questions requiring higher-level expertise and sophisticated visual perception. Simple few-shot and chain-of-thought prompting struggle to alleviate these limitations. These findings highlight that existing MLLMs still exhibit limited adaptability to real-world materials characterization scenarios. We hope MatCha will facilitate future research in areas such as new material discovery and autonomous scientific agents. MatCha is available at https://github.com/FreedomIntelligence/MatCha.
中文摘要:MatCha作为首个材料表征图像理解的基准,揭示了当前多模态大语言模型在需要高级领域知识和视觉分析的复杂任务中,其表现远逊于人类专家。
English Summary: MatCha is introduced as the first benchmark for materials characterization image understanding, revealing that current multimodal large language models significantly underperform human experts in tasks requiring advanced domain knowledge and visual analysis.

Authors:Anthony P. Addison, Felix Wagner, Wentian Xu, Natalie Voets, Konstantinos Kamnitsas
Title: Modality-Agnostic Input Channels Enable Segmentation of Brain lesions in Multimodal MRI with Sequences Unavailable During Training
Abstract:
Segmentation models are important tools for the detection and analysis of lesions in brain MRI. Depending on the type of brain pathology that is imaged, MRI scanners can acquire multiple, different image modalities (contrasts). Most segmentation models for multimodal brain MRI are restricted to fixed modalities and cannot effectively process new ones at inference. Some models generalize to unseen modalities but may lose discriminative modality-specific information. This work aims to develop a model that can perform inference on data that contain image modalities unseen during training, previously seen modalities, and heterogeneous combinations of both, thus allowing a user to utilize any available imaging modalities. We demonstrate this is possible with a simple, thus practical alteration to the U-net architecture, by integrating a modality-agnostic input channel or pathway, alongside modality-specific input channels. To train this modality-agnostic component, we develop an image augmentation scheme that synthesizes artificial MRI modalities. Augmentations differentially alter the appearance of pathological and healthy brain tissue to create artificial contrasts between them while maintaining realistic anatomical integrity. We evaluate the method using 8 MRI databases that include 5 types of pathologies (stroke, tumours, traumatic brain injury, multiple sclerosis and white matter hyperintensities) and 8 modalities (T1, T1+contrast, T2, PD, SWI, DWI, ADC and FLAIR). The results demonstrate that the approach preserves the ability to effectively process MRI modalities encountered during training, while being able to process new, unseen modalities to improve its segmentation. Project code: https://github.com/Anthony-P-Addison/AGN-MOD-SEG
中文: 本研究提出了一种改进的U-net架构,通过引入模态无关通道和图像增强策略生成人工MRI对比度,能够在保持解剖真实性的同时,有效分割训练中见过和未见过的脑部病变成像模态。
English: This study introduces a modified U-net architecture with a modality-agnostic pathway and an image augmentation strategy to create artificial MRI contrasts, enabling effective segmentation of brain lesions across both seen and unseen imaging modalities while maintaining anatomical realism.

Authors:Jing Hao, Yuxuan Fan, Yanpeng Sun, Kaixin Guo, Lizhuo Lin, Jinrong Yang, Qi Yong H. Ai, Lun M. Wong, Hao Tang, Kuo Feng Hung
Title: Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis
Abstract:
Recent advances in large vision-language models (LVLMs) have demonstrated strong performance on general-purpose medical tasks. However, their effectiveness in specialized domains such as dentistry remains underexplored. In particular, panoramic X-rays, a widely used imaging modality in oral radiology, pose interpretative challenges due to dense anatomical structures and subtle pathological cues, which are not captured by existing medical benchmarks or instruction datasets. To this end, we introduce MMOral, the first large-scale multimodal instruction dataset and benchmark tailored for panoramic X-ray interpretation. MMOral consists of 20,563 annotated images paired with 1.3 million instruction-following instances across diverse task types, including attribute extraction, report generation, visual question answering, and image-grounded dialogue. In addition, we present MMOral-Bench, a comprehensive evaluation suite covering five key diagnostic dimensions in dentistry. We evaluate 64 LVLMs on MMOral-Bench and find that even the best-performing model, i.e., GPT-4o, only achieves 41.45% accuracy, revealing significant limitations of current models in this domain. To promote the progress of this specific domain, we also propose OralGPT, which conducts supervised fine-tuning (SFT) upon Qwen2.5-VL-7B with our meticulously curated MMOral instruction dataset. Remarkably, a single epoch of SFT yields substantial performance enhancements for LVLMs, e.g., OralGPT demonstrates a 24.73% improvement. Both MMOral and OralGPT hold significant potential as a critical foundation for intelligent dentistry and enable more clinically impactful multimodal AI systems in the dental field. The dataset, model, benchmark, and evaluation suite are available at https://github.com/isbrycee/OralGPT.
中文摘要:大型视觉语言模型在通用医疗任务中表现优异,但在牙科全景X光片解读方面存在局限,为此开发了MMOral数据集和OralGPT模型,通过微调显著提升了诊断性能。
English Summary: Large vision-language models show promise for medical tasks but struggle with dental panoramic X-rays, leading to the creation of MMOral dataset and OralGPT model, which significantly improves diagnostic accuracy through fine-tuning.

Authors:Jiesi Hu, Jianfeng Cao, Yanwu Yang, Chenfei Ye, Yixuan Zhang, Hanyang Peng, Ting Ma
Title: Medverse: A Universal Model for Full-Resolution 3D Medical Image Segmentation, Transformation and Enhancement
Abstract:
In-context learning (ICL) offers a promising paradigm for universal medical image analysis, enabling models to perform diverse image processing tasks without retraining. However, current ICL models for medical imaging remain limited in two critical aspects: they cannot simultaneously achieve high-fidelity predictions and global anatomical understanding, and there is no unified model trained across diverse medical imaging tasks (e.g., segmentation and enhancement) and anatomical regions. As a result, the full potential of ICL in medical imaging remains underexplored. Thus, we present \textbf{Medverse}, a universal ICL model for 3D medical imaging, trained on 22 datasets covering diverse tasks in universal image segmentation, transformation, and enhancement across multiple organs, imaging modalities, and clinical centers. Medverse employs a next-scale autoregressive in-context learning framework that progressively refines predictions from coarse to fine, generating consistent, full-resolution volumetric outputs and enabling multi-scale anatomical awareness. We further propose a blockwise cross-attention module that facilitates long-range interactions between context and target inputs while preserving computational efficiency through spatial sparsity. Medverse is extensively evaluated on a broad collection of held-out datasets covering previously unseen clinical centers, organs, species, and imaging modalities. Results demonstrate that Medverse substantially outperforms existing ICL baselines and establishes a novel paradigm for in-context learning. Code and model weights will be made publicly available. Our model are publicly available at https://github.com/jiesihu/Medverse.
中文: Medverse提出了一种通用的3D医学影像上下文学习模型,通过渐进式细化预测和多尺度解剖感知,在多样化任务和解剖区域中实现高保真预测和全局解剖理解,显著优于现有基准模型。
English: Medverse introduces a universal in-context learning model for 3D medical imaging that achieves high-fidelity predictions and global anatomical awareness across diverse tasks and anatomical regions, significantly outperforming existing baselines.

Authors:Yuiko Uchida, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama
Title: Objectness Similarity: Capturing Object-Level Fidelity in 3D Scene Evaluation
Abstract:
This paper presents Objectness SIMilarity (OSIM), a novel evaluation metric for 3D scenes that explicitly focuses on "objects," which are fundamental units of human visual perception. Existing metrics assess overall image quality, leading to discrepancies with human perception. Inspired by neuropsychological insights, we hypothesize that human recognition of 3D scenes fundamentally involves attention to individual objects. OSIM enables object-centric evaluations by leveraging an object detection model and its feature representations to quantify the "objectness" of each object in the scene. Our user study demonstrates that OSIM aligns more closely with human perception compared to existing metrics. We also analyze the characteristics of OSIM using various approaches. Moreover, we re-evaluate recent 3D reconstruction and generation models under a standardized experimental setup to clarify advancements in this field. The code is available at https://github.com/Objectness-Similarity/OSIM.
中文摘要:本文提出OSIM这一面向3D场景的物体中心化评估新指标,通过物体检测模型量化场景中各物体的“物体性”,用户研究表明其比现有指标更符合人类感知,并重新评估了当前主流3D重建与生成模型。
English Summary: This paper introduces OSIM, an object-centric evaluation metric for 3D scenes that aligns more closely with human perception by quantifying objectness through object detection models, as validated by user studies and comparative analyses.

Authors:Junhao Xing, Ryohei Miyakawa, Yang Yang, Xinpeng Liu, Risa Shinoda, Hiroaki Santo, Yosuke Toda, Fumio Okura
Title: Zero-shot Hierarchical Plant Segmentation via Foundation Segmentation Models and Text-to-image Attention
Abstract:
Foundation segmentation models achieve reasonable leaf instance extraction from top-view crop images without training (i.e., zero-shot). However, segmenting entire plant individuals with each consisting of multiple overlapping leaves remains challenging. This problem is referred to as a hierarchical segmentation task, typically requiring annotated training datasets, which are often species-specific and require notable human labor. To address this, we introduce ZeroPlantSeg, a zero-shot segmentation for rosette-shaped plant individuals from top-view images. We integrate a foundation segmentation model, extracting leaf instances, and a vision-language model, reasoning about plants' structures to extract plant individuals without additional training. Evaluations on datasets with multiple plant species, growth stages, and shooting environments demonstrate that our method surpasses existing zero-shot methods and achieves better cross-domain performance than supervised methods. Implementations are available at https://github.com/JunhaoXing/ZeroPlantSeg.
Chinese: ZeroPlantSeg是一种零样本分割方法,通过结合基础分割模型和视觉语言模型,无需训练即可从顶视图中有效分割出完整的莲座状植物个体,其性能优于现有方法。
English: ZeroPlantSeg is a zero-shot method that combines a foundation segmentation model with a vision-language model to effectively segment entire rosette-shaped plant individuals from top-view images, outperforming existing approaches without requiring training data.

Authors:Jianqin Gao, Tianqi Wang, Yu Zhang, Yishu Zhang, Chenyuan Wang, Allan Dong, Zihao Wang
Title: FPI-Det: a face--phone Interaction Dataset for phone-use detection and understanding
Abstract:
The widespread use of mobile devices has created new challenges for vision systems in safety monitoring, workplace productivity assessment, and attention management. Detecting whether a person is using a phone requires not only object recognition but also an understanding of behavioral context, which involves reasoning about the relationship between faces, hands, and devices under diverse conditions. Existing generic benchmarks do not fully capture such fine-grained human--device interactions. To address this gap, we introduce the FPI-Det, containing 22{,}879 images with synchronized annotations for faces and phones across workplace, education, transportation, and public scenarios. The dataset features extreme scale variation, frequent occlusions, and varied capture conditions. We evaluate representative YOLO and DETR detectors, providing baseline results and an analysis of performance across object sizes, occlusion levels, and environments. Source code and dataset is available at https://github.com/KvCgRv/FPI-Det.
中文摘要:FPI-Det数据集通过在多场景下提供人脸与手机的同步标注,解决了现有基准在细粒度人机交互检测方面的不足,为手机使用行为识别建立了新标准。
English Summary: The FPI-Det dataset addresses limitations in current benchmarks by providing detailed annotations for detecting phone usage through human-device interactions across various challenging scenarios.

Authors:Jifeng Shen, Haibo Zhan, Xin Zuo, Heng Fan, Xiaohui Yuan, Jun Li, Wankou Yang
Title: IRDFusion: Iterative Relation-Map Difference guided Feature Fusion for Multispectral Object Detection
Abstract:
Current multispectral object detection methods often retain extraneous background or noise during feature fusion, limiting perceptual performance. To address this, we propose an innovative feature fusion framework based on cross-modal feature contrastive and screening strategy, diverging from conventional approaches. The proposed method adaptively enhances salient structures by fusing object-aware complementary cross-modal features while suppressing shared background interference. Our solution centers on two novel, specially designed modules: the Mutual Feature Refinement Module (MFRM) and the Differential Feature Feedback Module (DFFM). The MFRM enhances intra- and inter-modal feature representations by modeling their relationships, thereby improving cross-modal alignment and discriminative power. Inspired by feedback differential amplifiers, the DFFM dynamically computes inter-modal differential features as guidance signals and feeds them back to the MFRM, enabling adaptive fusion of complementary information while suppressing common-mode noise across modalities. To enable robust feature learning, the MFRM and DFFM are integrated into a unified framework, which is formally formulated as an Iterative Relation-Map Differential Guided Feature Fusion mechanism, termed IRDFusion. IRDFusion enables high-quality cross-modal fusion by progressively amplifying salient relational signals through iterative feedback, while suppressing feature noise, leading to significant performance gains. In extensive experiments on FLIR, LLVIP and M$^3$FD datasets, IRDFusion achieves state-of-the-art performance and consistently outperforms existing methods across diverse challenging scenarios, demonstrating its robustness and effectiveness. Code will be available at https://github.com/61s61min/IRDFusion.git.
中文摘要:本文提出IRDFusion框架,通过跨模态特征对比与筛选策略实现自适应特征融合,在抑制背景干扰的同时增强显著目标特征,在多个数据集上达到最优性能。
English Summary: This paper introduces IRDFusion, a novel feature fusion framework that enhances salient object structures through iterative cross-modal feature refinement while suppressing background noise, achieving state-of-the-art performance on multiple datasets.

Authors:Qiuhui Chen, Xuancheng Yao, Huping Ye, Yi Hong
Title: Enhancing 3D Medical Image Understanding with Pretraining Aided by 2D Multimodal Large Language Models
Abstract:
Understanding 3D medical image volumes is critical in the medical field, yet existing 3D medical convolution and transformer-based self-supervised learning (SSL) methods often lack deep semantic comprehension. Recent advancements in multimodal large language models (MLLMs) provide a promising approach to enhance image understanding through text descriptions. To leverage these 2D MLLMs for improved 3D medical image understanding, we propose Med3DInsight, a novel pretraining framework that integrates 3D image encoders with 2D MLLMs via a specially designed plane-slice-aware transformer module. Additionally, our model employs a partial optimal transport based alignment, demonstrating greater tolerance to noise introduced by potential noises in LLM-generated content. Med3DInsight introduces a new paradigm for scalable multimodal 3D medical representation learning without requiring human annotations. Extensive experiments demonstrate our state-of-the-art performance on two downstream tasks, i.e., segmentation and classification, across various public datasets with CT and MRI modalities, outperforming current SSL methods. Med3DInsight can be seamlessly integrated into existing 3D medical image understanding networks, potentially enhancing their performance. Our source code, generated datasets, and pre-trained models will be available at https://github.com/Qybc/Med3DInsight.
中文: Med3DInsight是一种创新的预训练框架,通过平面切片感知变换器和部分最优传输对齐技术,将3D医学图像编码器与2D多模态大语言模型相结合,在无需人工标注的情况下实现了分割和分类任务的最先进性能。
English: Med3DInsight is a novel pretraining framework that integrates 3D medical image encoders with 2D multimodal large language models through a plane-slice-aware transformer and partial optimal transport alignment, achieving state-of-the-art performance in segmentation and classification tasks without human annotations.

Authors:Umair Hassan
Title: COCO-Urdu: A Large-Scale Urdu Image-Caption Dataset with Multimodal Quality Estimation
Abstract:
Urdu, spoken by over 250 million people, remains critically under-served in multimodal and vision-language research. The absence of large-scale, high-quality datasets has limited the development of Urdu-capable systems and reinforced biases in multilingual vision-language models trained primarily on high-resource languages. To address this gap, we present COCO-Urdu, a large-scale image-caption dataset derived from MS COCO, containing 59,000 images and 319,000 Urdu captions selected through stratified sampling to preserve the original distribution. Captions were translated using SeamlessM4T v2 and validated with a hybrid multimodal quality estimation framework that integrates COMET-Kiwi for translation quality, CLIP-based similarity for visual grounding, and BERTScore with back-translation for semantic consistency; low-scoring captions were iteratively refined using open-source large language models. We further benchmark COCO-Urdu on BLEU, SacreBLEU, and chrF, reporting consistently strong results. To the best of our knowledge, COCO-Urdu is the largest publicly available Urdu captioning dataset. By releasing both the dataset and the quality estimation pipeline, we aim to reduce language bias in multimodal research and establish a foundation for inclusive vision-language systems.
中文摘要:COCO-Urdu数据集通过为5.9万张图像提供31.9万条经过质量验证的乌尔都语标注,解决了乌尔都语多模态资源匮乏的问题,成为最大的公开乌尔都语标注数据集,旨在减少视觉语言研究中的语言偏见。
English Summary: The COCO-Urdu dataset addresses the scarcity of Urdu multimodal resources by providing 319,000 quality-validated Urdu captions for 59,000 images, establishing the largest public Urdu captioning dataset to reduce language bias in vision-language research.

Authors:Andrew Bell, Yan Kit Choi, Steffen E Petersen, Andrew King, Muhummad Sohaib Nazir, Alistair A Young
Title: Implicit Neural Representations of Intramyocardial Motion and Strain
Abstract:
Automatic quantification of intramyocardial motion and strain from tagging MRI remains an important but challenging task. We propose a method using implicit neural representations (INRs), conditioned on learned latent codes, to predict continuous left ventricular (LV) displacement -- without requiring inference-time optimisation. Evaluated on 452 UK Biobank test cases, our method achieved the best tracking accuracy (2.14 mm RMSE) and the lowest combined error in global circumferential (2.86%) and radial (6.42%) strain compared to three deep learning baselines. In addition, our method is $\sim$380$\times$ faster than the most accurate baseline. These results highlight the suitability of INR-based models for accurate and scalable analysis of myocardial strain in large CMR datasets. The code can be found at https://github.com/andrewjackbell/Displacement-INR
中文: 本研究提出了一种基于隐式神经表示的方法,用于从标记MRI中精确量化左心室运动,在英国生物银行数据上实现了卓越的跟踪精度和效率。
English: This study introduces a method using implicit neural representations to accurately quantify left ventricular motion from tagging MRI, achieving superior tracking accuracy and efficiency on UK Biobank data.

Authors:Magdalena Wysocki, Felix Duelmer, Ananya Bal, Nassir Navab, Mohammad Farid Azampour
Title: UltrON: Ultrasound Occupancy Networks
Abstract:
In free-hand ultrasound imaging, sonographers rely on expertise to mentally integrate partial 2D views into 3D anatomical shapes. Shape reconstruction can assist clinicians in this process. Central to this task is the choice of shape representation, as it determines how accurately and efficiently the structure can be visualized, analyzed, and interpreted. Implicit representations, such as SDF and occupancy function, offer a powerful alternative to traditional voxel- or mesh-based methods by modeling continuous, smooth surfaces with compact storage, avoiding explicit discretization. Recent studies demonstrate that SDF can be effectively optimized using annotations derived from segmented B-mode ultrasound images. Yet, these approaches hinge on precise annotations, overlooking the rich acoustic information embedded in B-mode intensity. Moreover, implicit representation approaches struggle with the ultrasound's view-dependent nature and acoustic shadowing artifacts, which impair reconstruction. To address the problems resulting from occlusions and annotation dependency, we propose an occupancy-based representation and introduce \gls{UltrON} that leverages acoustic features to improve geometric consistency in weakly-supervised optimization regime. We show that these features can be obtained from B-mode images without additional annotation cost. Moreover, we propose a novel loss function that compensates for view-dependency in the B-mode images and facilitates occupancy optimization from multiview ultrasound. By incorporating acoustic properties, \gls{UltrON} generalizes to shapes of the same anatomy. We show that \gls{UltrON} mitigates the limitations of occlusions and sparse labeling and paves the way for more accurate 3D reconstruction. Code and dataset will be available at https://github.com/magdalena-wysocki/ultron.
在自由手超声成像中,提出的UltrON方法利用声学特征和基于占据率的表示,通过解决遮挡和标注依赖性问题来增强三维形状重建,无需额外标注成本。
In free-hand ultrasound imaging, the proposed UltrON method uses acoustic features and an occupancy-based representation to enhance 3D shape reconstruction by addressing occlusion and annotation dependency issues without extra labeling costs.

Authors:Puskal Khadka, Rodrigue Rizk, Longwei Wang, KC Santosh
Title: CoSwin: Convolution Enhanced Hierarchical Shifted Window Attention For Small-Scale Vision
Abstract:
Vision Transformers (ViTs) have achieved impressive results in computer vision by leveraging self-attention to model long-range dependencies. However, their emphasis on global context often comes at the expense of local feature extraction in small datasets, particularly due to the lack of key inductive biases such as locality and translation equivariance. To mitigate this, we propose CoSwin, a novel feature-fusion architecture that augments the hierarchical shifted window attention with localized convolutional feature learning. Specifically, CoSwin integrates a learnable local feature enhancement module into each attention block, enabling the model to simultaneously capture fine-grained spatial details and global semantic structure. We evaluate CoSwin on multiple image classification benchmarks including CIFAR-10, CIFAR-100, MNIST, SVHN, and Tiny ImageNet. Our experimental results show consistent performance gains over state-of-the-art convolutional and transformer-based models. Notably, CoSwin achieves improvements of 2.17% on CIFAR-10, 4.92% on CIFAR-100, 0.10% on MNIST, 0.26% on SVHN, and 4.47% on Tiny ImageNet over the baseline Swin Transformer. These improvements underscore the effectiveness of local-global feature fusion in enhancing the generalization and robustness of transformers for small-scale vision. Code and pretrained weights available at https://github.com/puskal-khadka/coswin
中文:CoSwin通过将局部卷积特征与分层注意力相结合来增强视觉Transformer,在小型图像分类基准上凭借有效的局部-全局特征融合实现了更优性能。
English: CoSwin enhances Vision Transformers by integrating localized convolutional features with hierarchical attention, achieving superior performance on small-scale image classification benchmarks through effective local-global feature fusion.

Authors:Lisa Dunlap, Joseph E. Gonzalez, Trevor Darrell, Fabian Caba Heilbron, Josef Sivic, Bryan Russell
Title: Discovering Divergent Representations between Text-to-Image Models
Abstract:
In this paper, we investigate when and how visual representations learned by two different generative models diverge. Given two text-to-image models, our goal is to discover visual attributes that appear in images generated by one model but not the other, along with the types of prompts that trigger these attribute differences. For example, "flames" might appear in one model's outputs when given prompts expressing strong emotions, while the other model does not produce this attribute given the same prompts. We introduce CompCon (Comparing Concepts), an evolutionary search algorithm that discovers visual attributes more prevalent in one model's output than the other, and uncovers the prompt concepts linked to these visual differences. To evaluate CompCon's ability to find diverging representations, we create an automated data generation pipeline to produce ID2, a dataset of 60 input-dependent differences, and compare our approach to several LLM- and VLM-powered baselines. Finally, we use CompCon to compare popular text-to-image models, finding divergent representations such as how PixArt depicts prompts mentioning loneliness with wet streets and Stable Diffusion 3.5 depicts African American people in media professions. Code at: https://github.com/adobe-research/CompCon
中文摘要:本文提出CompCon进化算法,用于发现不同文本到图像模型输出中视觉属性的差异及其触发提示,并通过对比PixArt和Stable Diffusion 3.5等流行模型验证了其有效性。
English Summary: This paper introduces CompCon, an evolutionary algorithm that identifies visual attributes more prevalent in one text-to-image model's outputs than another's and reveals the prompt concepts causing these differences, demonstrated through comparisons of popular models like PixArt and Stable Diffusion 3.5.

Authors:Rogerio Guimaraes, Frank Xiao, Pietro Perona, Markus Marks
Title: Diffusion-Based Action Recognition Generalizes to Untrained Domains
Abstract:
Humans can recognize the same actions despite large context and viewpoint variations, such as differences between species (walking in spiders vs. horses), viewpoints (egocentric vs. third-person), and contexts (real life vs movies). Current deep learning models struggle with such generalization. We propose using features generated by a Vision Diffusion Model (VDM), aggregated via a transformer, to achieve human-like action recognition across these challenging conditions. We find that generalization is enhanced by the use of a model conditioned on earlier timesteps of the diffusion process to highlight semantic information over pixel level details in the extracted features. We experimentally explore the generalization properties of our approach in classifying actions across animal species, across different viewing angles, and different recording contexts. Our model sets a new state-of-the-art across all three generalization benchmarks, bringing machine action recognition closer to human-like robustness. Project page: https://www.vision.caltech.edu/actiondiff. Code: https://github.com/frankyaoxiao/ActionDiff
中文: 本文提出了一种利用视觉扩散模型特征并通过变换器聚合的新方法,实现了跨物种、视角和场景的人类水平动作识别,在泛化基准测试中创下了最新最优性能。
English: This paper introduces a novel method using Vision Diffusion Model features aggregated by a transformer to achieve human-like action recognition across species, viewpoints, and contexts, setting new state-of-the-art results in generalization benchmarks.

Authors:Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Title: Recurrence Meets Transformers for Universal Multimodal Retrieval
Abstract:
With the rapid advancement of multimodal retrieval and its application in LLMs and multimodal LLMs, increasingly complex retrieval tasks have emerged. Existing methods predominantly rely on task-specific fine-tuning of vision-language models and are limited to single-modality queries or documents. In this paper, we propose ReT-2, a unified retrieval model that supports multimodal queries, composed of both images and text, and searches across multimodal document collections where text and images coexist. ReT-2 leverages multi-layer representations and a recurrent Transformer architecture with LSTM-inspired gating mechanisms to dynamically integrate information across layers and modalities, capturing fine-grained visual and textual details. We evaluate ReT-2 on the challenging M2KR and M-BEIR benchmarks across different retrieval configurations. Results demonstrate that ReT-2 consistently achieves state-of-the-art performance across diverse settings, while offering faster inference and reduced memory usage compared to prior approaches. When integrated into retrieval-augmented generation pipelines, ReT-2 also improves downstream performance on Encyclopedic-VQA and InfoSeek datasets. Our source code and trained models are publicly available at: https://github.com/aimagelab/ReT-2
中文: ReT-2是一种统一的多模态检索模型,采用带门控机制的循环Transformer动态整合跨模态信息,在多个基准测试中实现最优性能,同时提升效率并改善下游任务表现。
English: ReT-2 is a unified multimodal retrieval model that employs a recurrent Transformer with gating mechanisms to dynamically integrate cross-modal information, achieving state-of-the-art performance across benchmarks while enhancing efficiency and downstream task results.

Authors:David Stotko, Reinhard Klein
Title: SAFT: Shape and Appearance of Fabrics from Template via Differentiable Physical Simulations from Monocular Video
Abstract:
The reconstruction of three-dimensional dynamic scenes is a well-established yet challenging task within the domain of computer vision. In this paper, we propose a novel approach that combines the domains of 3D geometry reconstruction and appearance estimation for physically based rendering and present a system that is able to perform both tasks for fabrics, utilizing only a single monocular RGB video sequence as input. In order to obtain realistic and high-quality deformations and renderings, a physical simulation of the cloth geometry and differentiable rendering are employed. In this paper, we introduce two novel regularization terms for the 3D reconstruction task that improve the plausibility of the reconstruction by addressing the depth ambiguity problem in monocular video. In comparison with the most recent methods in the field, we have reduced the error in the 3D reconstruction by a factor of 2.64 while requiring a medium runtime of 30 min per scene. Furthermore, the optimized motion achieves sufficient quality to perform an appearance estimation of the deforming object, recovering sharp details from this single monocular RGB video.
中文: 本文提出了一种新颖方法,仅使用单目RGB视频即可实现织物的三维动态场景重建,通过物理模拟和可微分渲染将几何重建与外观估计相结合,从而获得高质量结果。
English: This paper introduces a novel method for 3D dynamic scene reconstruction of fabrics using a single monocular RGB video, combining geometry reconstruction with appearance estimation through physical simulation and differentiable rendering to achieve high-quality results.

Authors:Lena Wild, Rafael Valencia, Patric Jensfelt
Title: ArgoTweak: Towards Self-Updating HD Maps through Structured Priors
Abstract:
Reliable integration of prior information is crucial for self-verifying and self-updating HD maps. However, no public dataset includes the required triplet of prior maps, current maps, and sensor data. As a result, existing methods must rely on synthetic priors, which create inconsistencies and lead to a significant sim2real gap. To address this, we introduce ArgoTweak, the first dataset to complete the triplet with realistic map priors. At its core, ArgoTweak employs a bijective mapping framework, breaking down large-scale modifications into fine-grained atomic changes at the map element level, thus ensuring interpretability. This paradigm shift enables accurate change detection and integration while preserving unchanged elements with high fidelity. Experiments show that training models on ArgoTweak significantly reduces the sim2real gap compared to synthetic priors. Extensive ablations further highlight the impact of structured priors and detailed change annotations. By establishing a benchmark for explainable, prior-aided HD mapping, ArgoTweak advances scalable, self-improving mapping solutions. The dataset, baselines, map modification toolbox, and further resources are available at https://kth-rpl.github.io/ArgoTweak/.

Authors:Michael J. Munje, Chen Tang, Shuijing Liu, Zichao Hu, Yifeng Zhu, Jiaxun Cui, Garrett Warnell, Joydeep Biswas, Peter Stone
Title: SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation
Abstract:
Robot navigation in dynamic, human-centered environments requires socially-compliant decisions grounded in robust scene understanding. Recent Vision-Language Models (VLMs) exhibit promising capabilities such as object recognition, common-sense reasoning, and contextual understanding-capabilities that align with the nuanced requirements of social robot navigation. However, it remains unclear whether VLMs can accurately understand complex social navigation scenes (e.g., inferring the spatial-temporal relations among agents and human intentions), which is essential for safe and socially compliant robot navigation. While some recent works have explored the use of VLMs in social robot navigation, no existing work systematically evaluates their ability to meet these necessary conditions. In this paper, we introduce the Social Navigation Scene Understanding Benchmark (SocialNav-SUB), a Visual Question Answering (VQA) dataset and benchmark designed to evaluate VLMs for scene understanding in real-world social robot navigation scenarios. SocialNav-SUB provides a unified framework for evaluating VLMs against human and rule-based baselines across VQA tasks requiring spatial, spatiotemporal, and social reasoning in social robot navigation. Through experiments with state-of-the-art VLMs, we find that while the best-performing VLM achieves an encouraging probability of agreeing with human answers, it still underperforms simpler rule-based approach and human consensus baselines, indicating critical gaps in social scene understanding of current VLMs. Our benchmark sets the stage for further research on foundation models for social robot navigation, offering a framework to explore how VLMs can be tailored to meet real-world social robot navigation needs. An overview of this paper along with the code and data can be found at https://larg.github.io/socialnav-sub .

Authors:Marius Dähling, Sebastian Krebs, J. Marius Zöllner
Title: CrowdQuery: Density-Guided Query Module for Enhanced 2D and 3D Detection in Crowded Scenes
Abstract:
This paper introduces a novel method for end-to-end crowd detection that leverages object density information to enhance existing transformer-based detectors. We present CrowdQuery (CQ), whose core component is our CQ module that predicts and subsequently embeds an object density map. The embedded density information is then systematically integrated into the decoder. Existing density map definitions typically depend on head positions or object-based spatial statistics. Our method extends these definitions to include individual bounding box dimensions. By incorporating density information into object queries, our method utilizes density-guided queries to improve detection in crowded scenes. CQ is universally applicable to both 2D and 3D detection without requiring additional data. Consequently, we are the first to design a method that effectively bridges 2D and 3D detection in crowded environments. We demonstrate the integration of CQ into both a general 2D and 3D transformer-based object detector, introducing the architectures CQ2D and CQ3D. CQ is not limited to the specific transformer models we selected. Experiments on the STCrowd dataset for both 2D and 3D domains show significant performance improvements compared to the base models, outperforming most state-of-the-art methods. When integrated into a state-of-the-art crowd detector, CQ can further improve performance on the challenging CrowdHuman dataset, demonstrating its generalizability. The code is released at https://github.com/mdaehl/CrowdQuery.
中文: 本文提出CrowdQuery方法,通过将包含边界框尺寸的物体密度图嵌入对象查询中,有效提升了基于Transformer的检测器在拥挤场景中的2D和3D检测性能,并在多个数据集上实现显著改进。
English: This paper presents CrowdQuery, a novel method that enhances transformer-based detectors by integrating object density maps with bounding box dimensions into object queries, significantly improving 2D and 3D crowd detection performance across multiple datasets.

Authors:Sike Xiang, Shuang Chen, Amir Atapour-Abarghouei
Title: BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion
Abstract:
As multimodal large language models (MLLMs) advance, their large-scale architectures pose challenges for deployment in resource-constrained environments. In the age of large models, where energy efficiency, computational scalability and environmental sustainability are paramount, the development of lightweight and high-performance models is critical for real-world applications. As such, we propose a lightweight MLLM framework for end-to-end visual question answering. Our proposed approach centres on BreezeCLIP, a compact yet powerful vision-language encoder optimised for efficient multimodal understanding. With only 1.2 billion parameters overall, our model significantly reduces computational cost while achieving performance comparable to standard-size MLLMs. Experiments conducted on multiple datasets further validate its effectiveness in balancing accuracy and efficiency. The modular and extensible design enables generalisation to broader multimodal tasks. The proposed lightweight vision-language framework is denoted as BcQLM (BreezeCLIP-enhanced Q-Gated Multimodal Language Model). It offers a promising path toward deployable MLLMs under practical hardware constraints. The source code is available at https://github.com/thico0224/BcQLM.
中文: 提出的轻量级BcQLM框架通过紧凑型BreezeCLIP编码器,仅用12亿参数就实现了与标准多模态模型相当的性能,为资源受限环境提供了高效的部署方案。
English: The proposed lightweight BcQLM framework with its compact BreezeCLIP encoder achieves performance comparable to standard multimodal models while using only 1.2 billion parameters, offering an efficient solution for deployment in resource-constrained environments.

Authors:Stefan Podgorski, Sourav Garg, Mehdi Hosseinzadeh, Lachlan Mares, Feras Dayoub, Ian Reid
Title: TANGO: Traversability-Aware Navigation with Local Metric Control for Topological Goals
Abstract:
Visual navigation in robotics traditionally relies on globally-consistent 3D maps or learned controllers, which can be computationally expensive and difficult to generalize across diverse environments. In this work, we present a novel RGB-only, object-level topometric navigation pipeline that enables zero-shot, long-horizon robot navigation without requiring 3D maps or pre-trained controllers. Our approach integrates global topological path planning with local metric trajectory control, allowing the robot to navigate towards object-level sub-goals while avoiding obstacles. We address key limitations of previous methods by continuously predicting local trajectory using monocular depth and traversability estimation, and incorporating an auto-switching mechanism that falls back to a baseline controller when necessary. The system operates using foundational models, ensuring open-set applicability without the need for domain-specific fine-tuning. We demonstrate the effectiveness of our method in both simulated environments and real-world tests, highlighting its robustness and deployability. Our approach outperforms existing state-of-the-art methods, offering a more adaptable and effective solution for visual navigation in open-set environments. The source code is made publicly available: https://github.com/podgorki/TANGO.
中文摘要:本研究提出了一种仅使用RGB图像的物体级拓扑导航系统,无需3D地图或预训练控制器即可实现零样本长距离机器人导航,通过全局路径规划与局部轨迹控制的结合,在开放环境中展现出优于现有方法的适应性和有效性。
English Summary: This study introduces a novel RGB-only, object-level topometric navigation system that enables zero-shot, long-range robot navigation without relying on 3D maps or pre-trained controllers, outperforming existing methods through integrated global planning and local control with open-set applicability.

Authors:Zhen Tian, Christos Anagnostopoulos, Qiyuan Wang, Zhiwei Gao
Title: Multi-Modal Robust Enhancement for Coastal Water Segmentation: A Systematic HSV-Guided Framework
Abstract:
Coastal water segmentation from satellite imagery presents unique challenges due to complex spectral characteristics and irregular boundary patterns. Traditional RGB-based approaches often suffer from training instability and poor generalization in diverse maritime environments. This paper introduces a systematic robust enhancement framework, referred to as Robust U-Net, that leverages HSV color space supervision and multi-modal constraints for improved coastal water segmentation. Our approach integrates five synergistic components: HSV-guided color supervision, gradient-based coastline optimization, morphological post-processing, sea area cleanup, and connectivity control. Through comprehensive ablation studies, we demonstrate that HSV supervision provides the highest impact (0.85 influence score), while the complete framework achieves superior training stability (84\% variance reduction) and enhanced segmentation quality. Our method shows consistent improvements across multiple evaluation metrics while maintaining computational efficiency. For reproducibility, our training configurations and code are available here: https://github.com/UofgCoastline/ICASSP-2026-Robust-Unet.
中文: 本文提出Robust U-Net框架,通过HSV颜色监督和多模态约束改进了卫星图像中的海岸线水域分割,实现了更好的训练稳定性和分割质量。
English: This paper introduces Robust U-Net, a framework that enhances coastal water segmentation in satellite imagery through HSV color supervision and multi-modal constraints, achieving improved training stability and segmentation quality.

Authors:Xinhao Yan, Jiachen Xu, Yang Li, Changfeng Ma, Yunhan Yang, Chunshi Wang, Zibo Zhao, Zeqiang Lai, Yunfei Zhao, Zhuo Chen, Chunchao Guo
Title: X-Part: high fidelity and structure coherent shape decomposition
Abstract:
Generating 3D shapes at part level is pivotal for downstream applications such as mesh retopology, UV mapping, and 3D printing. However, existing part-based generation methods often lack sufficient controllability and suffer from poor semantically meaningful decomposition. To this end, we introduce X-Part, a controllable generative model designed to decompose a holistic 3D object into semantically meaningful and structurally coherent parts with high geometric fidelity. X-Part exploits the bounding box as prompts for the part generation and injects point-wise semantic features for meaningful decomposition. Furthermore, we design an editable pipeline for interactive part generation. Extensive experimental results show that X-Part achieves state-of-the-art performance in part-level shape generation. This work establishes a new paradigm for creating production-ready, editable, and structurally sound 3D assets. Codes will be released for public research.

Authors:Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, Zhiyong Wu
Title: HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning
Abstract:
Human-Centric Video Generation (HCVG) methods seek to synthesize human videos from multimodal inputs, including text, image, and audio. Existing methods struggle to effectively coordinate these heterogeneous modalities due to two challenges: the scarcity of training data with paired triplet conditions and the difficulty of collaborating the sub-tasks of subject preservation and audio-visual sync with multimodal inputs. In this work, we present HuMo, a unified HCVG framework for collaborative multimodal control. For the first challenge, we construct a high-quality dataset with diverse and paired text, reference images, and audio. For the second challenge, we propose a two-stage progressive multimodal training paradigm with task-specific strategies. For the subject preservation task, to maintain the prompt following and visual generation abilities of the foundation model, we adopt the minimal-invasive image injection strategy. For the audio-visual sync task, besides the commonly adopted audio cross-attention layer, we propose a focus-by-predicting strategy that implicitly guides the model to associate audio with facial regions. For joint learning of controllabilities across multimodal inputs, building on previously acquired capabilities, we progressively incorporate the audio-visual sync task. During inference, for flexible and fine-grained multimodal control, we design a time-adaptive Classifier-Free Guidance strategy that dynamically adjusts guidance weights across denoising steps. Extensive experimental results demonstrate that HuMo surpasses specialized state-of-the-art methods in sub-tasks, establishing a unified framework for collaborative multimodal-conditioned HCVG. Project Page: https://phantom-video.github.io/HuMo.

Authors:Piyush Bagad, Andrew Zisserman
Title: Chirality in Action: Time-Aware Video Representation Learning by Latent Straightening
Abstract:
Our objective is to develop compact video representations that are sensitive to visual change over time. To measure such time-sensitivity, we introduce a new task: chiral action recognition, where one needs to distinguish between a pair of temporally opposite actions, such as "opening vs. closing a door", "approaching vs. moving away from something", "folding vs. unfolding paper", etc. Such actions (i) occur frequently in everyday life, (ii) require understanding of simple visual change over time (in object state, size, spatial position, count . . . ), and (iii) are known to be poorly represented by many video embeddings. Our goal is to build time aware video representations which offer linear separability between these chiral pairs. To that end, we propose a self-supervised adaptation recipe to inject time-sensitivity into a sequence of frozen image features. Our model is based on an auto-encoder with a latent space with inductive bias inspired by perceptual straightening. We show that this results in a compact but time-sensitive video representation for the proposed task across three datasets: Something-Something, EPIC-Kitchens, and Charade. Our method (i) outperforms much larger video models pre-trained on large-scale video datasets, and (ii) leads to an improvement in classification performance on standard benchmarks when combined with these existing models.

Authors:Rongsheng Wang, Fenghe Tang, Qingsong Yao, Rui Yan, Xu Zhang, Zhen Huang, Haoran Lai, Zhiyang He, Xiaodong Tao, Zihang Jiang, Shaohua Kevin Zhou
Title: SimCroP: Radiograph Representation Learning with Similarity-driven Cross-granularity Pre-training
Abstract:
Medical vision-language pre-training shows great potential in learning representative features from massive paired radiographs and reports. However, in computed tomography (CT) scans, the distribution of lesions which contain intricate structures is characterized by spatial sparsity. Besides, the complex and implicit relationships between different pathological descriptions in each sentence of the report and their corresponding sub-regions in radiographs pose additional challenges. In this paper, we propose a Similarity-Driven Cross-Granularity Pre-training (SimCroP) framework on chest CTs, which combines similarity-driven alignment and cross-granularity fusion to improve radiograph interpretation. We first leverage multi-modal masked modeling to optimize the encoder for understanding precise low-level semantics from radiographs. Then, similarity-driven alignment is designed to pre-train the encoder to adaptively select and align the correct patches corresponding to each sentence in reports. The cross-granularity fusion module integrates multimodal information across instance level and word-patch level, which helps the model better capture key pathology structures in sparse radiographs, resulting in improved performance for multi-scale downstream tasks. SimCroP is pre-trained on a large-scale paired CT-reports dataset and validated on image classification and segmentation tasks across five public datasets. Experimental results demonstrate that SimCroP outperforms both cutting-edge medical self-supervised learning methods and medical vision-language pre-training methods. Codes and models are available at https://github.com/ToniChopp/SimCroP.
中文: SimCroP框架通过相似性驱动对齐和跨粒度融合,提升了胸部CT扫描的医学视觉语言预训练能力,能更有效地解析稀疏病灶和复杂报告关系,在下游任务中表现卓越。
English: The SimCroP framework enhances medical vision-language pre-training for chest CT scans by combining similarity-driven alignment and cross-granularity fusion to better interpret sparse lesions and complex report relationships, achieving superior performance in downstream tasks.

Authors:Yuelin Guo, Haoyu He, Zhiyuan Chen, Zitong Huang, Renhao Lu, Lu Shi, Zejun Wang, Weizhe Zhang
Title: Dual-Thresholding Heatmaps to Cluster Proposals for Weakly Supervised Object Detection
Abstract:
Weakly supervised object detection (WSOD) has attracted significant attention in recent years, as it does not require box-level annotations. State-of-the-art methods generally adopt a multi-module network, which employs WSDDN as the multiple instance detection network module and multiple instance refinement modules to refine performance. However, these approaches suffer from three key limitations. First, existing methods tend to generate pseudo GT boxes that either focus only on discriminative parts, failing to capture the whole object, or cover the entire object but fail to distinguish between adjacent intra-class instances. Second, the foundational WSDDN architecture lacks a crucial background class representation for each proposal and exhibits a large semantic gap between its branches. Third, prior methods discard ignored proposals during optimization, leading to slow convergence. To address these challenges, we first design a heatmap-guided proposal selector (HGPS) algorithm, which utilizes dual thresholds on heatmaps to pre-select proposals, enabling pseudo GT boxes to both capture the full object extent and distinguish between adjacent intra-class instances. We then present a weakly supervised basic detection network (WSBDN), which augments each proposal with a background class representation and uses heatmaps for pre-supervision to bridge the semantic gap between matrices. At last, we introduce a negative certainty supervision loss on ignored proposals to accelerate convergence. Extensive experiments on the challenging PASCAL VOC 2007 and 2012 datasets demonstrate the effectiveness of our framework. We achieve mAP/mCorLoc scores of 58.5%/81.8% on VOC 2007 and 55.6%/80.5% on VOC 2012, performing favorably against the state-of-the-art WSOD methods. Our code is publicly available at https://github.com/gyl2565309278/DTH-CP.
Chinese: 本文提出了一种新颖的弱监督目标检测框架,通过设计热图引导的候选框选择器、增强背景类别表示的检测网络和负确定性监督损失,有效解决了现有方法的三个关键缺陷,在PASCAL VOC数据集上取得了领先性能。
English: This paper introduces a novel weakly supervised object detection framework that addresses key limitations in existing methods by proposing a heatmap-guided proposal selector, an enhanced detection network with background class representation, and a negative certainty supervision loss, achieving state-of-the-art performance on PASCAL VOC datasets.

Authors:Long Gao, Yunhe Zhang, Yan Jiang, Weiying Xie, Yunsong Li
Title: Hyperspectral Mamba for Hyperspectral Object Tracking
Abstract:
Hyperspectral object tracking holds great promise due to the rich spectral information and fine-grained material distinctions in hyperspectral images, which are beneficial in challenging scenarios. While existing hyperspectral trackers have made progress by either transforming hyperspectral data into false-color images or incorporating modality fusion strategies, they often fail to capture the intrinsic spectral information, temporal dependencies, and cross-depth interactions. To address these limitations, a new hyperspectral object tracking network equipped with Mamba (HyMamba), is proposed. It unifies spectral, cross-depth, and temporal modeling through state space modules (SSMs). The core of HyMamba lies in the Spectral State Integration (SSI) module, which enables progressive refinement and propagation of spectral features with cross-depth and temporal spectral information. Embedded within each SSI, the Hyperspectral Mamba (HSM) module is introduced to learn spatial and spectral information synchronously via three directional scanning SSMs. Based on SSI and HSM, HyMamba constructs joint features from false-color and hyperspectral inputs, and enhances them through interaction with original spectral features extracted from raw hyperspectral images. Extensive experiments conducted on seven benchmark datasets demonstrate that HyMamba achieves state-of-the-art performance. For instance, it achieves 73.0\% of the AUC score and 96.3\% of the DP@20 score on the HOTC2020 dataset. The code will be released at https://github.com/lgao001/HyMamba.
中文摘要:提出的HyMamba网络通过状态空间模块统一了光谱、跨深度和时间建模,在基准数据集上实现了最先进的性能,推动了高光谱目标跟踪的发展。
English Summary: The proposed HyMamba network advances hyperspectral object tracking by unifying spectral, cross-depth, and temporal modeling through state space modules, achieving state-of-the-art performance on benchmark datasets.

Authors:Seongho Kim, Sejong Ryu, Hyoukjun You, Je Hyeong Hong
Title: GTA-Crime: A Synthetic Dataset and Generation Framework for Fatal Violence Detection with Adversarial Snippet-Level Domain Adaptation
Abstract:
Recent advancements in video anomaly detection (VAD) have enabled identification of various criminal activities in surveillance videos, but detecting fatal incidents such as shootings and stabbings remains difficult due to their rarity and ethical issues in data collection. Recognizing this limitation, we introduce GTA-Crime, a fatal video anomaly dataset and generation framework using Grand Theft Auto 5 (GTA5). Our dataset contains fatal situations such as shootings and stabbings, captured from CCTV multiview perspectives under diverse conditions including action types, weather, time of day, and viewpoints. To address the rarity of such scenarios, we also release a framework for generating these types of videos. Additionally, we propose a snippet-level domain adaptation strategy using Wasserstein adversarial training to bridge the gap between synthetic GTA-Crime features and real-world features like UCF-Crime. Experimental results validate our GTA-Crime dataset and demonstrate that incorporating GTA-Crime with our domain adaptation strategy consistently enhances real world fatal violence detection accuracy. Our dataset and the data generation framework are publicly available at https://github.com/ta-ho/GTA-Crime.
中文: GTA-Crime数据集和生成框架利用《侠盗猎车手5》的合成数据解决了视频中罕见致命事件检测的难题,并通过领域自适应技术提升了在真实场景中的检测准确性。
English: The GTA-Crime dataset and generation framework address the challenge of detecting rare fatal incidents in videos by using synthetic data from Grand Theft Auto 5, enhanced with domain adaptation to improve real-world detection accuracy.

Authors:Jingjing Liu, Yinchao Han, Xianchao Xiu, Jianhua Zhang, Wanquan Liu
Title: Lightweight Deep Unfolding Networks with Enhanced Robustness for Infrared Small Target Detection
Abstract:
Infrared small target detection (ISTD) is one of the key techniques in image processing. Although deep unfolding networks (DUNs) have demonstrated promising performance in ISTD due to their model interpretability and data adaptability, existing methods still face significant challenges in parameter lightweightness and noise robustness. In this regard, we propose a highly lightweight framework based on robust principal component analysis (RPCA) called L-RPCANet. Technically, a hierarchical bottleneck structure is constructed to reduce and increase the channel dimension in the single-channel input infrared image to achieve channel-wise feature refinement, with bottleneck layers designed in each module to extract features. This reduces the number of channels in feature extraction and improves the lightweightness of network parameters. Furthermore, a noise reduction module is embedded to enhance the robustness against complex noise. In addition, squeeze-and-excitation networks (SENets) are leveraged as a channel attention mechanism to focus on the varying importance of different features across channels, thereby achieving excellent performance while maintaining both lightweightness and robustness. Extensive experiments on the ISTD datasets validate the superiority of our proposed method compared with state-of-the-art methods covering RPCANet, DRPCANet, and RPCANet++. The code will be available at https://github.com/xianchaoxiu/L-RPCANet.
Chinese: 本文提出L-RPCANet,一种基于鲁棒主成分分析的高度轻量化框架,通过分层瓶颈结构、降噪模块和通道注意力机制,在红外小目标检测中实现了参数轻量化和噪声鲁棒性的显著提升。
English: This paper introduces L-RPCANet, a highly lightweight and robust framework for infrared small target detection that enhances parameter efficiency and noise resilience through hierarchical bottleneck structures, a noise reduction module, and channel attention mechanisms.

Authors:Sasan Sharifipour, Constantino Álvarez Casado, Mohammad Sabokrou, Miguel Bordallo López
Title: APML: Adaptive Probabilistic Matching Loss for Robust 3D Point Cloud Reconstruction
Abstract:
Training deep learning models for point cloud prediction tasks such as shape completion and generation depends critically on loss functions that measure discrepancies between predicted and ground-truth point sets. Commonly used functions such as Chamfer Distance (CD), HyperCD, and InfoCD rely on nearest-neighbor assignments, which often induce many-to-one correspondences, leading to point congestion in dense regions and poor coverage in sparse regions. These losses also involve non-differentiable operations due to index selection, which may affect gradient-based optimization. Earth Mover Distance (EMD) enforces one-to-one correspondences and captures structural similarity more effectively, but its cubic computational complexity limits its practical use. We propose the Adaptive Probabilistic Matching Loss (APML), a fully differentiable approximation of one-to-one matching that leverages Sinkhorn iterations on a temperature-scaled similarity matrix derived from pairwise distances. We analytically compute the temperature to guarantee a minimum assignment probability, eliminating manual tuning. APML achieves near-quadratic runtime, comparable to Chamfer-based losses, and avoids non-differentiable operations. When integrated into state-of-the-art architectures (PoinTr, PCN, FoldingNet) on ShapeNet benchmarks and on a spatiotemporal Transformer (CSI2PC) that generates 3D human point clouds from WiFi CSI measurements, APM loss yields faster convergence, superior spatial distribution, especially in low-density regions, and improved or on-par quantitative performance without additional hyperparameter search. The code is available at: https://github.com/apm-loss/apml.
中文摘要:提出的自适应概率匹配损失(APML)通过可微分且计算高效的近似一对一匹配方法,克服了现有点云损失函数的局限性,在多种架构和基准测试中实现了更优性能。
English Summary: The proposed Adaptive Probabilistic Matching Loss (APML) overcomes limitations of existing point cloud loss functions by providing a differentiable, computationally efficient approximation of one-to-one matching, achieving superior performance across various architectures and benchmarks.

Authors:Hyungjin Chung, Hyelin Nam, Jiyeon Kim, Hyojun Go, Byeongjun Park, Junho Kim, Joonseok Lee, Seongsu Ha, Byung-Hoon Kim
Title: Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs
Abstract:
Video Large Language Models (VideoLLMs) face a critical bottleneck: increasing the number of input frames to capture fine-grained temporal detail leads to prohibitive computational costs and performance degradation from long context lengths. We introduce Video Parallel Scaling (VPS), an inference-time method that expands a model's perceptual bandwidth without increasing its context window. VPS operates by running multiple parallel inference streams, each processing a unique, disjoint subset of the video's frames. By aggregating the output probabilities from these complementary streams, VPS integrates a richer set of visual information than is possible with a single pass. We theoretically show that this approach effectively contracts the Chinchilla scaling law by leveraging uncorrelated visual evidence, thereby improving performance without additional training. Extensive experiments across various model architectures and scales (2B-32B) on benchmarks such as Video-MME and EventHallusion demonstrate that VPS consistently and significantly improves performance. It scales more favorably than other parallel alternatives (e.g. Self-consistency) and is complementary to other decoding strategies, offering a memory-efficient and robust framework for enhancing the temporal reasoning capabilities of VideoLLMs.
中文: 视频并行扩展(VPS)是一种推理时方法,通过并行处理视频中互不重叠的帧子集并整合输出结果,在不增加计算成本或额外训练的情况下,有效提升了视频大语言模型的时序推理能力。
English: Video Parallel Scaling (VPS) is an inference-time method that enhances VideoLLMs' temporal reasoning by processing disjoint frame subsets in parallel streams and aggregating their outputs, effectively improving performance without increasing computational costs or requiring additional training.

Authors:Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, Junyuan Deng, Kaiwen Zhang, Yang Wu, Tianyi Yan, Shenyuan Gao, Song Wang, Linfeng Li, Liang Pan, Yong Liu, Jianke Zhu, Wei Tsang Ooi, Steven C. H. Hoi, Ziwei Liu
Title: 3D and 4D World Modeling: A Survey
Abstract:
World modeling has become a cornerstone in AI research, enabling agents to understand, represent, and predict the dynamic environments they inhabit. While prior work largely emphasizes generative methods for 2D image and video data, they overlook the rapidly growing body of work that leverages native 3D and 4D representations such as RGB-D imagery, occupancy grids, and LiDAR point clouds for large-scale scene modeling. At the same time, the absence of a standardized definition and taxonomy for ``world models'' has led to fragmented and sometimes inconsistent claims in the literature. This survey addresses these gaps by presenting the first comprehensive review explicitly dedicated to 3D and 4D world modeling and generation. We establish precise definitions, introduce a structured taxonomy spanning video-based (VideoGen), occupancy-based (OccGen), and LiDAR-based (LiDARGen) approaches, and systematically summarize datasets and evaluation metrics tailored to 3D/4D settings. We further discuss practical applications, identify open challenges, and highlight promising research directions, aiming to provide a coherent and foundational reference for advancing the field. A systematic summary of existing literature is available at https://github.com/worldbench/survey
中文总结:本综述首次系统梳理了3D与4D世界建模研究,通过建立明确定义和分类框架,并系统总结专用数据集与评估指标,填补了该领域的研究空白。
English Summary: This survey provides the first comprehensive review of 3D and 4D world modeling, establishing clear definitions and a structured taxonomy while systematically analyzing datasets and evaluation metrics to address current research gaps.

Authors:Heeji Yoon, Jaewoo Jung, Junwan Kim, Hyungyu Choi, Heeseong Shin, Sangbeom Lim, Honggyu An, Chaehyun Kim, Jisang Han, Donghyun Kim, Chanho Eom, Sunghwan Hong, Seungryong Kim
Title: Visual Representation Alignment for Multimodal Large Language Models
Abstract:
Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting or spatial reasoning. We attribute this gap to the prevailing text-only supervision paradigm, which provides only indirect guidance for the visual pathway and often leads MLLMs to discard fine-grained visual details during training. In this paper, we present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models (VFMs). By explicitly enforcing this alignment, VIRAL enables the model not only to retain critical visual details from the input vision encoder but also to complement additional visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs. Our experiments demonstrate consistent improvements across all tasks on widely adopted multimodal benchmarks. Furthermore, we conduct comprehensive ablation studies to validate the key design choices underlying our framework. We believe this simple finding opens up an important direction for the effective integration of visual information in training MLLMs.

Authors:Zheng Geng, Nan Wang, Shaocong Xu, Chongjie Ye, Bohan Li, Zhaoxi Chen, Sida Peng, Hao Zhao
Title: One View, Many Worlds: Single-Image to 3D Object Meets Generative Domain Randomization for One-Shot 6D Pose Estimation
Abstract:
Estimating the 6D pose of arbitrary unseen objects from a single reference image is critical for robotics operating in the long-tail of real-world instances. However, this setting is notoriously challenging: 3D models are rarely available, single-view reconstructions lack metric scale, and domain gaps between generated models and real-world images undermine robustness. We propose OnePoseViaGen, a pipeline that tackles these challenges through two key components. First, a coarse-to-fine alignment module jointly refines scale and pose by combining multi-view feature matching with render-and-compare refinement. Second, a text-guided generative domain randomization strategy diversifies textures, enabling effective fine-tuning of pose estimators with synthetic data. Together, these steps allow high-fidelity single-view 3D generation to support reliable one-shot 6D pose estimation. On challenging benchmarks (YCBInEOAT, Toyota-Light, LM-O), OnePoseViaGen achieves state-of-the-art performance far surpassing prior approaches. We further demonstrate robust dexterous grasping with a real robot hand, validating the practicality of our method in real-world manipulation. Project page: https://gzwsama.github.io/OnePoseviaGen.github.io/

Authors:Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, Hengshuang Zhao
Title: Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
Abstract:
Recent advances in large multimodal models have leveraged image-based tools with reinforcement learning to tackle visual problems. However, existing open-source approaches often exhibit monotonous reasoning patterns and allow only a limited number of interaction turns, making them inadequate for difficult tasks that require trial-and-error exploration. In this work, we address this limitation by scaling up tool-based interactions and introduce Mini-o3, a system that executes deep, multi-turn reasoning -- spanning tens of steps -- and achieves state-of-the-art performance on challenging visual search tasks. Our recipe for reproducing OpenAI o3-style behaviors comprises three key components. First, we construct the Visual Probe Dataset, a collection of thousands of challenging visual search problems designed for exploratory reasoning. Second, we develop an iterative data collection pipeline to obtain cold-start trajectories that exhibit diverse reasoning patterns, including depth-first search, trial-and-error, and goal maintenance. Third, we propose an over-turn masking strategy that prevents penalization of over-turn responses (those that hit the maximum number of turns) during reinforcement learning, thereby balancing training-time efficiency with test-time scalability. Despite training with an upper bound of only six interaction turns, our model generates trajectories that naturally scale to tens of turns at inference time, with accuracy improving as the number of turns increases. Extensive experiments demonstrate that Mini-o3 produces rich reasoning patterns and deep thinking paths, effectively solving challenging visual search problems.
Chinese: 近期大型多模态模型的进展使Mini-o3系统通过数十步的深度多轮推理,在复杂视觉搜索任务中实现最优性能,解决了现有方法推理模式单一和交互轮次有限的问题。
English: Recent advances in large multimodal models have enabled Mini-o3 to achieve state-of-the-art performance on challenging visual search tasks through deep, multi-turn reasoning spanning tens of steps, addressing limitations of monotonous reasoning and limited interaction turns in existing approaches.

Authors:Boammani Aser Lompo, Marc Haraoui
Title: Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images
Abstract:
Visual reasoning over structured data such as tables is a critical capability for modern vision-language models (VLMs), yet current benchmarks remain limited in scale, diversity, or reasoning depth, especially when it comes to rendered table images. Addressing this gap, we introduce Visual-TableQA, a large-scale, open-domain multimodal dataset specifically designed to evaluate and enhance visual reasoning over complex tabular data. Our generation pipeline is modular, scalable, and fully autonomous, involving multiple reasoning LLMs collaborating across distinct roles: generation, validation, and inspiration. Visual-TableQA comprises 2.5k richly structured LaTeX-rendered tables and 6k reasoning-intensive QA pairs, all produced at a cost of under USD 100. To promote diversity and creativity, our pipeline performs multi-model collaborative data generation via cross-model prompting ('inspiration') and LLM-jury filtering. Stronger models seed layouts and topics that weaker models elaborate, collectively distilling diverse reasoning patterns and visual structures into the dataset. Empirical results show that models fine-tuned on Visual-TableQA generalize robustly to external benchmarks, outperforming several proprietary models despite the dataset's synthetic nature. The full pipeline and resources are publicly available at https://github.com/AI-4-Everyone/Visual-TableQA.
Chinese: 针对表格图像视觉推理基准的不足,我们推出了Visual-TableQA——通过经济高效的自动生成流程构建的大规模多模态数据集,有效提升了模型对复杂表格数据的推理能力。
English: To address limitations in visual reasoning benchmarks for table images, we introduce Visual-TableQA, a large-scale multimodal dataset generated through a cost-effective, autonomous pipeline that enhances model performance on complex tabular data.

Authors:Kimiaki Shirahama, Miki Yanobu, Kaduki Yamashita, Miho Ohsaki
Title: Feature Space Analysis by Guided Diffusion Model
Abstract:
One of the key issues in Deep Neural Networks (DNNs) is the black-box nature of their internal feature extraction process. Targeting vision-related domains, this paper focuses on analysing the feature space of a DNN by proposing a decoder that can generate images whose features are guaranteed to closely match a user-specified feature. Owing to this guarantee that is missed in past studies, our decoder allows us to evidence which of various image attributes are encoded into the user-specified feature. Our decoder is implemented as a guided diffusion model that guides the reverse image generation of a pre-trained diffusion model to minimise the Euclidean distance between the feature of a clean image estimated at each step and the user-specified feature. One practical advantage of our decoder is that it can analyse feature spaces of different DNNs with no additional training and run on a single COTS GPU. The experimental results targeting CLIP's image encoder, ResNet-50 and vision transformer demonstrate that images generated by our decoder have features remarkably similar to the user-specified ones and reveal valuable insights into these DNNs' feature spaces.
中文: 本文提出一种基于引导扩散的解码器,能生成与用户指定特征高度匹配的图像,无需额外训练即可分析不同深度神经网络的特征空间,揭示其编码的图像属性。
English: This paper introduces a decoder using guided diffusion to generate images matching user-specified features, enabling analysis of DNN feature spaces without additional training and revealing encoded image attributes.

Authors:Maja Schlereth, Moritz Schillinger, Katharina Breininger
Title: Faster, Self-Supervised Super-Resolution for Anisotropic Multi-View MRI Using a Sparse Coordinate Loss
Abstract:
Acquiring images in high resolution is often a challenging task. Especially in the medical sector, image quality has to be balanced with acquisition time and patient comfort. To strike a compromise between scan time and quality for Magnetic Resonance (MR) imaging, two anisotropic scans with different low-resolution (LR) orientations can be acquired. Typically, LR scans are analyzed individually by radiologists, which is time consuming and can lead to inaccurate interpretation. To tackle this, we propose a novel approach for fusing two orthogonal anisotropic LR MR images to reconstruct anatomical details in a unified representation. Our multi-view neural network is trained in a self-supervised manner, without requiring corresponding high-resolution (HR) data. To optimize the model, we introduce a sparse coordinate-based loss, enabling the integration of LR images with arbitrary scaling. We evaluate our method on MR images from two independent cohorts. Our results demonstrate comparable or even improved super-resolution (SR) performance compared to state-of-the-art (SOTA) self-supervised SR methods for different upsampling scales. By combining a patient-agnostic offline and a patient-specific online phase, we achieve a substantial speed-up of up to ten times for patient-specific reconstruction while achieving similar or better SR quality. Code is available at https://github.com/MajaSchle/tripleSR.
中文摘要:本文提出一种自监督神经网络,通过融合两个正交低分辨率磁共振扫描来重建高分辨率解剖细节,在实现与先进方法相当或更优超分辨率性能的同时,将患者特异性重建速度最高提升十倍。
English Summary: This paper introduces a self-supervised neural network that fuses two orthogonal low-resolution MR scans to reconstruct high-resolution anatomical details, achieving comparable or superior super-resolution performance with up to ten times faster patient-specific reconstruction than state-of-the-art methods.

Authors:Hugo Blanc, Jean-Emmanuel Deschaud, Alexis Paljic
Title: RayGaussX: Accelerating Gaussian-Based Ray Marching for Real-Time and High-Quality Novel View Synthesis
Abstract:
RayGauss has achieved state-of-the-art rendering quality for novel-view synthesis on synthetic and indoor scenes by representing radiance and density fields with irregularly distributed elliptical basis functions, rendered via volume ray casting using a Bounding Volume Hierarchy (BVH). However, its computational cost prevents real-time rendering on real-world scenes. Our approach, RayGaussX, builds on RayGauss by introducing key contributions that accelerate both training and inference. Specifically, we incorporate volumetric rendering acceleration strategies such as empty-space skipping and adaptive sampling, enhance ray coherence, and introduce scale regularization to reduce false-positive intersections. Additionally, we propose a new densification criterion that improves density distribution in distant regions, leading to enhanced graphical quality on larger scenes. As a result, RayGaussX achieves 5x to 12x faster training and 50x to 80x higher rendering speeds (FPS) on real-world datasets while improving visual quality by up to +0.56 dB in PSNR. Project page with videos and code: https://raygaussx.github.io/.

Authors:Yimin Pan, Matthias Nießner, Tobias Kirschstein
Title: HairGS: Hair Strand Reconstruction based on 3D Gaussian Splatting
Abstract:
Human hair reconstruction is a challenging problem in computer vision, with growing importance for applications in virtual reality and digital human modeling. Recent advances in 3D Gaussians Splatting (3DGS) provide efficient and explicit scene representations that naturally align with the structure of hair strands. In this work, we extend the 3DGS framework to enable strand-level hair geometry reconstruction from multi-view images. Our multi-stage pipeline first reconstructs detailed hair geometry using a differentiable Gaussian rasterizer, then merges individual Gaussian segments into coherent strands through a novel merging scheme, and finally refines and grows the strands under photometric supervision. While existing methods typically evaluate reconstruction quality at the geometric level, they often neglect the connectivity and topology of hair strands. To address this, we propose a new evaluation metric that serves as a proxy for assessing topological accuracy in strand reconstruction. Extensive experiments on both synthetic and real-world datasets demonstrate that our method robustly handles a wide range of hairstyles and achieves efficient reconstruction, typically completing within one hour. The project page can be found at: https://yimin-pan.github.io/hair-gs/

Authors:Chunhang Zheng, Zichang Ren, Dou Li
Title: SEEC: Segmentation-Assisted Multi-Entropy Models for Learned Lossless Image Compression
Abstract:
Recently, learned image compression has attracted considerable attention due to its superior performance over traditional methods. However, most existing approaches employ a single entropy model to estimate the probability distribution of pixel values across the entire image, which limits their ability to capture the diverse statistical characteristics of different semantic regions. To overcome this limitation, we propose Segmentation-Assisted Multi-Entropy Models for Lossless Image Compression (SEEC). Our framework utilizes semantic segmentation to guide the selection and adaptation of multiple entropy models, enabling more accurate probability distribution estimation for distinct semantic regions. Specifically, SEEC first extracts image features and then applies semantic segmentation to identify different regions, each assigned a specialized entropy model to better capture its unique statistical properties. Finally, a multi-channel discrete logistic mixture likelihood is employed to model the pixel value distributions effectively. Experimental results on benchmark datasets demonstrate that SEEC achieves state-of-the-art compression ratios while introducing only minimal encoding and decoding latency. With superior performance, the proposed model also supports Regions of Interest (ROIs) coding condition on the provided segmentation mask. Our code is available at https://github.com/chunbaobao/SEEC.
中文:提出的SEEC框架通过语义分割为不同图像区域应用多个专用熵模型,以最低延迟实现了最先进的无损图像压缩性能。
English: The proposed SEEC framework enhances lossless image compression by using semantic segmentation to apply multiple specialized entropy models for different image regions, achieving state-of-the-art compression ratios with minimal latency.

Authors:Sung Ju Lee, Nam Ik Cho
Title: Semantic Watermarking Reinvented: Enhancing Robustness and Generation Quality with Fourier Integrity
Abstract:
Semantic watermarking techniques for latent diffusion models (LDMs) are robust against regeneration attacks, but often suffer from detection performance degradation due to the loss of frequency integrity. To tackle this problem, we propose a novel embedding method called Hermitian Symmetric Fourier Watermarking (SFW), which maintains frequency integrity by enforcing Hermitian symmetry. Additionally, we introduce a center-aware embedding strategy that reduces the vulnerability of semantic watermarking due to cropping attacks by ensuring robust information retention. To validate our approach, we apply these techniques to existing semantic watermarking schemes, enhancing their frequency-domain structures for better robustness and retrieval accuracy. Extensive experiments demonstrate that our methods achieve state-of-the-art verification and identification performance, surpassing previous approaches across various attack scenarios. Ablation studies confirm the impact of SFW on detection capabilities, the effectiveness of the center-aware embedding against cropping, and how message capacity influences identification accuracy. Notably, our method achieves the highest detection accuracy while maintaining superior image fidelity, as evidenced by FID and CLIP scores. Conclusively, our proposed SFW is shown to be an effective framework for balancing robustness and image fidelity, addressing the inherent trade-offs in semantic watermarking. Code available at https://github.com/thomas11809/SFWMark
中文摘要:提出的埃尔米特对称傅里叶水印(SFW)方法通过保持频率完整性和抗裁剪攻击的中心感知嵌入策略,显著提升了语义水印的鲁棒性,在各种攻击场景下实现了最优的检测性能与图像保真度。
English Summary: The proposed Hermitian Symmetric Fourier Watermarking (SFW) method with center-aware embedding enhances semantic watermarking by preserving frequency integrity and resisting cropping attacks, achieving state-of-the-art robustness and image fidelity across various attack scenarios.

Authors:Peijin Xie, Shun Qian, Bingquan Liu, Dexin Wang, Lin Sun, Xiangzheng Zhang
Title: TextlessRAG: End-to-End Visual Document RAG by Speech Without Text
Abstract:
Document images encapsulate a wealth of knowledge, while the portability of spoken queries enables broader and flexible application scenarios. Yet, no prior work has explored knowledge base question answering over visual document images with queries provided directly in speech. We propose TextlessRAG, the first end-to-end framework for speech-based question answering over large-scale document images. Unlike prior methods, TextlessRAG eliminates ASR, TTS and OCR, directly interpreting speech, retrieving relevant visual knowledge, and generating answers in a fully textless pipeline. To further boost performance, we integrate a layout-aware reranking mechanism to refine retrieval. Experiments demonstrate substantial improvements in both efficiency and accuracy. To advance research in this direction, we also release the first bilingual speech--document RAG dataset, featuring Chinese and English voice queries paired with multimodal document content. Both the dataset and our pipeline will be made available at repository:https://github.com/xiepeijinhit-hue/textlessrag
中文摘要:TextlessRAG是首个实现语音直接对文档图像进行问答的端到端框架,通过布局感知重排机制提升检索效果,在消除文本转换步骤的同时显著提高了效率与准确性。
English Summary: TextlessRAG is the first end-to-end framework that enables speech-based question answering directly on document images, eliminating traditional text conversion steps while improving efficiency and accuracy through layout-aware retrieval refinement.

Authors:Kiet T. Nguyen, Chanhuyk Lee, Donggyun Kim, Dong Hoon Lee, Seunghoon Hong
Title: Universal Few-Shot Spatial Control for Diffusion Models
Abstract:
Spatial conditioning in pretrained text-to-image diffusion models has significantly improved fine-grained control over the structure of generated images. However, existing control adapters exhibit limited adaptability and incur high training costs when encountering novel spatial control conditions that differ substantially from the training tasks. To address this limitation, we propose Universal Few-Shot Control (UFC), a versatile few-shot control adapter capable of generalizing to novel spatial conditions. Given a few image-condition pairs of an unseen task and a query condition, UFC leverages the analogy between query and support conditions to construct task-specific control features, instantiated by a matching mechanism and an update on a small set of task-specific parameters. Experiments on six novel spatial control tasks show that UFC, fine-tuned with only 30 annotated examples of novel tasks, achieves fine-grained control consistent with the spatial conditions. Notably, when fine-tuned with 0.1% of the full training data, UFC achieves competitive performance with the fully supervised baselines in various control tasks. We also show that UFC is applicable agnostically to various diffusion backbones and demonstrate its effectiveness on both UNet and DiT architectures. Code is available at https://github.com/kietngt00/UFC.
中文摘要:本文提出的通用少样本控制适配器(UFC)能让预训练扩散模型仅需少量标注数据即可快速适应新型空间控制任务,在多种架构上均达到与全监督方法相当的性能表现。
English Summary: The proposed Universal Few-Shot Control (UFC) adapter enables pretrained diffusion models to quickly adapt to new spatial control tasks using minimal training data, achieving performance comparable to fully supervised methods across various architectures.

Authors:Saad Lahlali, Alexandre Fournier Montgieux, Nicolas Granger, Hervé Le Borgne, Quoc Cuong Pham
Title: MVAT: Multi-View Aware Teacher for Weakly Supervised 3D Object Detection
Abstract:
Annotating 3D data remains a costly bottleneck for 3D object detection, motivating the development of weakly supervised annotation methods that rely on more accessible 2D box annotations. However, relying solely on 2D boxes introduces projection ambiguities since a single 2D box can correspond to multiple valid 3D poses. Furthermore, partial object visibility under a single viewpoint setting makes accurate 3D box estimation difficult. We propose MVAT, a novel framework that leverages temporal multi-view present in sequential data to address these challenges. Our approach aggregates object-centric point clouds across time to build 3D object representations as dense and complete as possible. A Teacher-Student distillation paradigm is employed: The Teacher network learns from single viewpoints but targets are derived from temporally aggregated static objects. Then the Teacher generates high quality pseudo-labels that the Student learns to predict from a single viewpoint for both static and moving objects. The whole framework incorporates a multi-view 2D projection loss to enforce consistency between predicted 3D boxes and all available 2D annotations. Experiments on the nuScenes and Waymo Open datasets demonstrate that MVAT achieves state-of-the-art performance for weakly supervised 3D object detection, significantly narrowing the gap with fully supervised methods without requiring any 3D box annotations. % \footnote{Code available upon acceptance} Our code is available in our public repository (\href{https://github.com/CEA-LIST/MVAT}{code}).
中文: MVAT是一种新颖的框架,利用时序多视角数据和师生蒸馏范式,无需3D标注即可实现最先进的弱监督3D物体检测。
English: MVAT is a novel framework that utilizes temporal multi-view data and a Teacher-Student distillation paradigm to achieve state-of-the-art weakly supervised 3D object detection without requiring 3D annotations.

Authors:Wenshuo Gao, Xicheng Lan, Luyao Zhang, Shuai Yang
Title: LINR Bridge: Vector Graphic Animation via Neural Implicits and Video Diffusion Priors
Abstract:
Vector graphics, known for their scalability and user-friendliness, provide a unique approach to visual content compared to traditional pixel-based images. Animation of these graphics, driven by the motion of their elements, offers enhanced comprehensibility and controllability but often requires substantial manual effort. To automate this process, we propose a novel method that integrates implicit neural representations with text-to-video diffusion models for vector graphic animation. Our approach employs layered implicit neural representations to reconstruct vector graphics, preserving their inherent properties such as infinite resolution and precise color and shape constraints, which effectively bridges the large domain gap between vector graphics and diffusion models. The neural representations are then optimized using video score distillation sampling, which leverages motion priors from pretrained text-to-video diffusion models. Finally, the vector graphics are warped to match the representations resulting in smooth animation. Experimental results validate the effectiveness of our method in generating vivid and natural vector graphic animations, demonstrating significant improvement over existing techniques that suffer from limitations in flexibility and animation quality.

Authors:Patrick Wienholt, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn
Title: MedicalPatchNet: A Patch-Based Self-Explainable AI Architecture for Chest X-ray Classification
Abstract:
Deep neural networks excel in radiological image classification but frequently suffer from poor interpretability, limiting clinical acceptance. We present MedicalPatchNet, an inherently self-explainable architecture for chest X-ray classification that transparently attributes decisions to distinct image regions. MedicalPatchNet splits images into non-overlapping patches, independently classifies each patch, and aggregates predictions, enabling intuitive visualization of each patch's diagnostic contribution without post-hoc techniques. Trained on the CheXpert dataset (223,414 images), MedicalPatchNet matches the classification performance (AUROC 0.907 vs. 0.908) of EfficientNet-B0, while substantially improving interpretability: MedicalPatchNet demonstrates substantially improved interpretability with higher pathology localization accuracy (mean hit-rate 0.485 vs. 0.376 with Grad-CAM) on the CheXlocalize dataset. By providing explicit, reliable explanations accessible even to non-AI experts, MedicalPatchNet mitigates risks associated with shortcut learning, thus improving clinical trust. Our model is publicly available with reproducible training and inference scripts and contributes to safer, explainable AI-assisted diagnostics across medical imaging domains. We make the code publicly available: https://github.com/TruhnLab/MedicalPatchNet
中文:MedicalPatchNet是一种自解释性胸部X光分类模型,在保持与EfficientNet-B0相当性能的同时,通过将决策透明归因于特定图像区域而显著提升可解释性,且无需后处理技术。
English: MedicalPatchNet is a self-explainable chest X-ray classification model that matches EfficientNet-B0's performance while significantly improving interpretability by transparently attributing decisions to specific image regions without post-hoc methods.

Authors:Wenshuo Gao, Xicheng Lan, Shuai Yang
Title: ANYPORTAL: Zero-Shot Consistent Video Background Replacement
Abstract:
Despite the rapid advancements in video generation technology, creating high-quality videos that precisely align with user intentions remains a significant challenge. Existing methods often fail to achieve fine-grained control over video details, limiting their practical applicability. We introduce ANYPORTAL, a novel zero-shot framework for video background replacement that leverages pre-trained diffusion models. Our framework collaboratively integrates the temporal prior of video diffusion models with the relighting capabilities of image diffusion models in a zero-shot setting. To address the critical challenge of foreground consistency, we propose a Refinement Projection Algorithm, which enables pixel-level detail manipulation to ensure precise foreground preservation. ANYPORTAL is training-free and overcomes the challenges of achieving foreground consistency and temporally coherent relighting. Experimental results demonstrate that ANYPORTAL achieves high-quality results on consumer-grade GPUs, offering a practical and efficient solution for video content creation and editing.

Authors:Pooya Khosravi, Kun Han, Anthony T. Wu, Arghavan Rezvani, Zexin Feng, Xiaohui Xie
Title: XOCT: Enhancing OCT to OCTA Translation via Cross-Dimensional Supervised Multi-Scale Feature Learning
Abstract:
Optical Coherence Tomography Angiography (OCTA) and its derived en-face projections provide high-resolution visualization of the retinal and choroidal vasculature, which is critical for the rapid and accurate diagnosis of retinal diseases. However, acquiring high-quality OCTA images is challenging due to motion sensitivity and the high costs associated with software modifications for conventional OCT devices. Moreover, current deep learning methods for OCT-to-OCTA translation often overlook the vascular differences across retinal layers and struggle to reconstruct the intricate, dense vascular details necessary for reliable diagnosis. To overcome these limitations, we propose XOCT, a novel deep learning framework that integrates Cross-Dimensional Supervision (CDS) with a Multi-Scale Feature Fusion (MSFF) network for layer-aware vascular reconstruction. Our CDS module leverages 2D layer-wise en-face projections, generated via segmentation-weighted z-axis averaging, as supervisory signals to compel the network to learn distinct representations for each retinal layer through fine-grained, targeted guidance. Meanwhile, the MSFF module enhances vessel delineation through multi-scale feature extraction combined with a channel reweighting strategy, effectively capturing vascular details at multiple spatial scales. Our experiments on the OCTA-500 dataset demonstrate XOCT's improvements, especially for the en-face projections which are significant for clinical evaluation of retinal pathologies, underscoring its potential to enhance OCTA accessibility, reliability, and diagnostic value for ophthalmic disease detection and monitoring. The code is available at https://github.com/uci-cbcl/XOCT.
中文: 提出的XOCT深度学习框架结合跨维度监督和多尺度特征融合,优化了OCTA成像中分层感知的血管重建,提升了视网膜疾病诊断的可靠性与应用普及性。
English: The proposed XOCT deep learning framework integrates cross-dimensional supervision and multi-scale feature fusion to enhance layer-aware vascular reconstruction in OCTA imaging, improving diagnostic reliability and accessibility for retinal diseases.

Authors:Xudong Lu, Zhi Zheng, Yi Wan, Yongxiang Yao, Annan Wang, Renrui Zhang, Panwang Xia, Qiong Wu, Qingyun Li, Weifeng Lin, Xiangyu Zhao, Peifeng Ma, Xue Yang, Hongsheng Li
Title: GLEAM: Learning to Match and Explain in Cross-View Geo-Localization
Abstract:
Cross-View Geo-Localization (CVGL) focuses on identifying correspondences between images captured from distinct perspectives of the same geographical location. However, existing CVGL approaches are typically restricted to a single view or modality, and their direct visual matching strategy lacks interpretability: they only determine whether two images correspond, without explaining the rationale behind the match. In this paper, we present GLEAM-C, a foundational CVGL model that unifies multiple views and modalities-including UAV imagery, street maps, panoramic views, and ground photographs-by aligning them exclusively with satellite imagery. Our framework enhances training efficiency through optimized implementation while achieving accuracy comparable to prior modality-specific CVGL models through a two-phase training strategy. Moreover, to address the lack of interpretability in traditional CVGL methods, we leverage the reasoning capabilities of multimodal large language models (MLLMs) to propose a new task, GLEAM-X, which combines cross-view correspondence prediction with explainable reasoning. To support this task, we construct a bilingual benchmark using GPT-4o and Doubao-1.5-Thinking-Vision-Pro to generate training and testing data. The test set is further refined through detailed human revision, enabling systematic evaluation of explainable cross-view reasoning and advancing transparency and scalability in geo-localization. Together, GLEAM-C and GLEAM-X form a comprehensive CVGL pipeline that integrates multi-modal, multi-view alignment with interpretable correspondence analysis, unifying accurate cross-view matching with explainable reasoning and advancing Geo-Localization by enabling models to better Explain And Match. Code and datasets used in this work will be made publicly accessible at https://github.com/Lucky-Lance/GLEAM.
中文: GLEAM-C是一个基础性跨视角地理定位模型,通过将无人机图像、街景地图、全景视图和地面照片等多模态数据与卫星图像对齐,实现了与现有方法相当的效率与精度;而GLEAM-X则利用多模态大语言模型引入可解释推理机制,解决了传统地理定位方法缺乏解释性的问题。
English: GLEAM-C is a foundational cross-view geo-localization model that unifies multiple views and modalities by aligning them with satellite imagery, achieving efficiency and accuracy comparable to prior methods, while GLEAM-X introduces explainable reasoning through multimodal large language models to address interpretability in geo-localization.

Authors:Ze-Xin Yin, Jiaxiong Qiu, Liu Liu, Xinjie Wang, Wei Sui, Zhizhong Su, Jian Yang, Jin Xie
Title: DreamLifting: A Plug-in Module Lifting MV Diffusion Models for 3D Asset Generation
Abstract:
The labor- and experience-intensive creation of 3D assets with physically based rendering (PBR) materials demands an autonomous 3D asset creation pipeline. However, most existing 3D generation methods focus on geometry modeling, either baking textures into simple vertex colors or leaving texture synthesis to post-processing with image diffusion models. To achieve end-to-end PBR-ready 3D asset generation, we present Lightweight Gaussian Asset Adapter (LGAA), a novel framework that unifies the modeling of geometry and PBR materials by exploiting multi-view (MV) diffusion priors from a novel perspective. The LGAA features a modular design with three components. Specifically, the LGAA Wrapper reuses and adapts network layers from MV diffusion models, which encapsulate knowledge acquired from billions of images, enabling better convergence in a data-efficient manner. To incorporate multiple diffusion priors for geometry and PBR synthesis, the LGAA Switcher aligns multiple LGAA Wrapper layers encapsulating different knowledge. Then, a tamed variational autoencoder (VAE), termed LGAA Decoder, is designed to predict 2D Gaussian Splatting (2DGS) with PBR channels. Finally, we introduce a dedicated post-processing procedure to effectively extract high-quality, relightable mesh assets from the resulting 2DGS. Extensive quantitative and qualitative experiments demonstrate the superior performance of LGAA with both text-and image-conditioned MV diffusion models. Additionally, the modular design enables flexible incorporation of multiple diffusion priors, and the knowledge-preserving scheme leads to efficient convergence trained on merely 69k multi-view instances. Our code, pre-trained weights, and the dataset used will be publicly available via our project page: https://zx-yin.github.io/dreamlifting/.
中文: 本文提出了轻量化高斯资产适配器(LGAA),一种利用多视角扩散先验统一几何与PBR材质建模的模块化框架,通过仅需6.9万样本的高效训练即可实现端到端3D资产生成,并提取出高质量可重照明的网格资源。
English: The paper introduces the Lightweight Gaussian Asset Adapter (LGAA), a modular framework that unifies geometry and PBR material modeling using multi-view diffusion priors for end-to-end 3D asset generation, achieving efficient convergence and high-quality results with minimal data.

Authors:Erencem Ozbey, Dimitrios I. Diochnos
Title: Dimensionally Reduced Open-World Clustering: DROWCULA
Abstract:
Working with annotated data is the cornerstone of supervised learning. Nevertheless, providing labels to instances is a task that requires significant human effort. Several critical real-world applications make things more complicated because no matter how many labels may have been identified in a task of interest, it could be the case that examples corresponding to novel classes may appear in the future. Not unsurprisingly, prior work in this, so-called, `open-world' context has focused a lot on semi-supervised approaches. Focusing on image classification, somehow paradoxically, we propose a fully unsupervised approach to the problem of determining the novel categories in a particular dataset. Our approach relies on estimating the number of clusters using Vision Transformers, which utilize attention mechanisms to generate vector embeddings. Furthermore, we incorporate manifold learning techniques to refine these embeddings by exploiting the intrinsic geometry of the data, thereby enhancing the overall image clustering performance. Overall, we establish new State-of-the-Art results on single-modal clustering and Novel Class Discovery on CIFAR-10, CIFAR-100, ImageNet-100, and Tiny ImageNet. We do so, both when the number of clusters is known or unknown ahead of time. The code is available at: https://github.com/DROWCULA/DROWCULA.
中文: 本文提出了一种完全无监督的方法,通过结合视觉变换器进行聚类和流形学习优化嵌入,在多个数据集上实现了最先进的图像分类新类别发现效果,无论聚类数量是否已知。
English: This paper introduces a fully unsupervised method for identifying novel classes in image classification by combining Vision Transformers for clustering with manifold learning to enhance embeddings, achieving state-of-the-art results across multiple datasets even when the cluster count is unknown.

Authors:Yingsheng Wang, Shuo Lu, Jian Liang, Aihua Zheng, Ran He
Title: Frustratingly Easy Feature Reconstruction for Out-of-Distribution Detection
Abstract:
Out-of-distribution (OOD) detection helps models identify data outside the training categories, crucial for security applications. While feature-based post-hoc methods address this by evaluating data differences in the feature space without changing network parameters, they often require access to training data, which may not be suitable for some data privacy scenarios. This may not be suitable in scenarios where data privacy protection is a concern. In this paper, we propose a simple yet effective post-hoc method, termed Classifier-based Feature Reconstruction (ClaFR), from the perspective of subspace projection. It first performs an orthogonal decomposition of the classifier's weights to extract the class-known subspace, then maps the original data features into this subspace to obtain new data representations. Subsequently, the OOD score is determined by calculating the feature reconstruction error of the data within the subspace. Compared to existing OOD detection algorithms, our method does not require access to training data while achieving leading performance on multiple OOD benchmarks. Our code is released at https://github.com/Aie0923/ClaFR.
Chinese: 提出的基于分类器的特征重构方法通过子空间投影和特征重构误差,在不访问训练数据的情况下实现分布外检测,既保护了数据隐私又达到了领先性能。
English: The proposed Classifier-based Feature Reconstruction (ClaFR) method enables out-of-distribution detection without accessing training data by utilizing subspace projection and feature reconstruction error, achieving state-of-the-art performance while addressing privacy concerns.

Authors:Cedric Caruzzo, Jong Chul Ye
Title: CellPainTR: Generalizable Representation Learning for Cross-Dataset Cell Painting Analysis
Abstract:
Large-scale biological discovery requires integrating massive, heterogeneous datasets like those from the JUMP Cell Painting consortium, but technical batch effects and a lack of generalizable models remain critical roadblocks. To address this, we introduce CellPainTR, a Transformer-based architecture designed to learn foundational representations of cellular morphology that are robust to batch effects. Unlike traditional methods that require retraining on new data, CellPainTR's design, featuring source-specific context tokens, allows for effective out-of-distribution (OOD) generalization to entirely unseen datasets without fine-tuning. We validate CellPainTR on the large-scale JUMP dataset, where it outperforms established methods like ComBat and Harmony in both batch integration and biological signal preservation. Critically, we demonstrate its robustness through a challenging OOD task on the unseen Bray et al. dataset, where it maintains high performance despite significant domain and feature shifts. Our work represents a significant step towards creating truly foundational models for image-based profiling, enabling more reliable and scalable cross-study biological analysis.
中文摘要:为解决批次效应问题,我们开发了基于Transformer的CellPainTR模型,它能学习通用的细胞形态表征,无需重新训练即可实现优异的批次整合和跨数据集泛化能力。
English Summary: To overcome batch effects and enable robust biological discovery, we developed CellPainTR, a Transformer model that learns generalized cellular morphology representations, achieving superior batch integration and out-of-distribution generalization without retraining.

Authors:Wenhao Li, Mengyuan Liu, Hong Liu, Pichao Wang, Shijian Lu, Nicu Sebe
Title: H$_{2}$OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers
Abstract:
Transformers have been successfully applied in the field of video-based 3D human pose estimation. However, the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this paper, we present a hierarchical plug-and-play pruning-and-recovering framework, called Hierarchical Hourglass Tokenizer (H$_{2}$OT), for efficient transformer-based 3D human pose estimation from videos. H$_{2}$OT begins with progressively pruning pose tokens of redundant frames and ends with recovering full-length sequences, resulting in a few pose tokens in the intermediate transformer blocks and thus improving the model efficiency. It works with two key modules, namely, a Token Pruning Module (TPM) and a Token Recovering Module (TRM). TPM dynamically selects a few representative tokens to eliminate the redundancy of video frames, while TRM restores the detailed spatio-temporal information based on the selected tokens, thereby expanding the network output to the original full-length temporal resolution for fast inference. Our method is general-purpose: it can be easily incorporated into common VPT models on both seq2seq and seq2frame pipelines while effectively accommodating different token pruning and recovery strategies. In addition, our H$_{2}$OT reveals that maintaining the full pose sequence is unnecessary, and a few pose tokens of representative frames can achieve both high efficiency and estimation accuracy. Extensive experiments on multiple benchmark datasets demonstrate both the effectiveness and efficiency of the proposed method. Code and models are available at https://github.com/NationalGAILab/HoT.
中文: 本文提出的H₂OT分层即插即用框架通过剪枝冗余姿态令牌并恢复完整序列,显著提升了基于视频的3D人体姿态估计效率,在降低计算成本的同时保持高精度。
English: This paper introduces H₂OT, a hierarchical plug-and-play framework that enhances the efficiency of video-based 3D human pose estimation by pruning redundant pose tokens and recovering full sequences, achieving high performance with reduced computational costs.

Authors:Qi Lv, Weijie Kong, Hao Li, Jia Zeng, Zherui Qiu, Delin Qu, Haoming Song, Qizhi Chen, Xiang Deng, Jiangmiao Pang
Title: F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions
Abstract:
Executing language-conditioned tasks in dynamic visual environments remains a central challenge in embodied AI. Existing Vision-Language-Action (VLA) models predominantly adopt reactive state-to-action mappings, often leading to short-sighted behaviors and poor robustness in dynamic scenes. In this paper, we introduce F1, a pretrained VLA framework which integrates the visual foresight generation into decision-making pipeline. F1 adopts a Mixture-of-Transformer architecture with dedicated modules for perception, foresight generation, and control, thereby bridging understanding, generation, and actions. At its core, F1 employs a next-scale prediction mechanism to synthesize goal-conditioned visual foresight as explicit planning targets. By forecasting plausible future visual states, F1 reformulates action generation as a foresight-guided inverse dynamics problem, enabling actions that implicitly achieve visual goals. To endow F1 with robust and generalizable capabilities, we propose a three-stage training recipe on an extensive dataset comprising over 330k trajectories across 136 diverse tasks. This training scheme enhances modular reasoning and equips the model with transferable visual foresight, which is critical for complex and dynamic environments. Extensive evaluations on real-world tasks and simulation benchmarks demonstrate F1 consistently outperforms existing approaches, achieving substantial gains in both task success rate and generalization ability.

Authors:Wenxuan Huang, Shuang Chen, Zheyong Xie, Shaosheng Cao, Shixiang Tang, Yufan Shen, Qingyu Yin, Wenbo Hu, Xiaoman Wang, Yuntian Tang, Junbo Qiao, Yue Guo, Yao Hu, Zhenfei Yin, Philip Torr, Yu Cheng, Wanli Ouyang, Shaohui Lin
Title: Interleaving Reasoning for Better Text-to-Image Generation
Abstract:
Unified multimodal understanding and generation models recently have achieve significant improvement in image generation capability, yet a large gap remains in instruction following and detail preservation compared to systems that tightly couple comprehension with generation such as GPT-4o. Motivated by recent advances in interleaving reasoning, we explore whether such reasoning can further improve Text-to-Image (T2I) generation. We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis: the model first produces a text-based thinking to guide an initial image, then reflects on the result to refine fine-grained details, visual quality, and aesthetics while preserving semantics. To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals: (1) strengthening the initial think-and-generate stage to establish core content and base quality, and (2) enabling high-quality textual reflection and faithful implementation of those refinements in a subsequent image. We curate IRGL-300K, a dataset organized into six decomposed learning modes that jointly cover learning text-based thinking, and full thinking-image trajectories. Starting from a unified foundation model that natively emits interleaved text-image outputs, our two-stage training first builds robust thinking and reflection, then efficiently tunes the IRG pipeline in the full thinking-image trajectory data. Extensive experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN, alongside substantial improvements in visual quality and fine-grained fidelity. The code, model weights and datasets will be released in: https://github.com/Osilly/Interleaving-Reasoning-Generation .
Chinese: 提出的交错推理生成(IRG)框架通过交替进行文本推理与图像合成,有效提升了文本到图像生成中的细节保持与指令遵循能力,并采用两阶段训练方法在多个基准测试中实现了最先进的性能。
English: The proposed Interleaving Reasoning Generation (IRG) framework alternates between text-based reasoning and image synthesis to enhance detail preservation and instruction following in text-to-image generation, achieving state-of-the-art performance across multiple benchmarks through a two-stage training approach.

Authors:Bing Han, Chen Zhu, Dong Han, Rui Yu, Songliang Cao, Jianhui Wu, Scott Chapman, Zijian Wang, Bangyou Zheng, Wei Guo, Marie Weiss, Benoit de Solan, Andreas Hund, Lukas Roth, Kirchgessner Norbert, Andrea Visioni, Yufeng Ge, Wenjuan Li, Alexis Comar, Dong Jiang, Dejun Han, Fred Baret, Yanfeng Ding, Hao Lu, Shouyang Liu
Title: FoMo4Wheat: Toward reliable crop vision foundation models with globally curated data
Abstract:
Vision-driven field monitoring is central to digital agriculture, yet models built on general-domain pretrained backbones often fail to generalize across tasks, owing to the interaction of fine, variable canopy structures with fluctuating field conditions. We present FoMo4Wheat, one of the first crop-domain vision foundation model pretrained with self-supervision on ImAg4Wheat, the largest and most diverse wheat image dataset to date (2.5 million high-resolution images collected over a decade at 30 global sites, spanning >2,000 genotypes and >500 environmental conditions). This wheat-specific pretraining yields representations that are robust for wheat and transferable to other crops and weeds. Across ten in-field vision tasks at canopy and organ levels, FoMo4Wheat models consistently outperform state-of-the-art models pretrained on general-domain dataset. These results demonstrate the value of crop-specific foundation models for reliable in-field perception and chart a path toward a universal crop foundation model with cross-species and cross-task capabilities. FoMo4Wheat models and the ImAg4Wheat dataset are publicly available online: https://github.com/PheniX-Lab/FoMo4Wheat and https://huggingface.co/PheniX-Lab/FoMo4Wheat. The demonstration website is: https://fomo4wheat.phenix-lab.com/.
中文: FoMo4Wheat是首个基于大规模小麦图像数据集ImAg4Wheat预训练的作物领域视觉基础模型,在十项田间任务中表现卓越,不仅对小麦具有强鲁棒性,还能迁移应用于其他作物。
English: FoMo4Wheat is a pioneering crop-domain vision foundation model pretrained on the extensive ImAg4Wheat dataset, demonstrating superior performance across ten in-field tasks and robustness for wheat applications while being transferable to other crops.

Authors:Morteza Kiani Haftlang, Mohammadhossein Malmir, Foroutan Parand, Umberto Michelucci, Safouane El Ghazouali
Title: Barlow-Swin: Toward a novel siamese-based segmentation architecture using Swin-Transformers
Abstract:
Medical image segmentation is a critical task in clinical workflows, particularly for the detection and delineation of pathological regions. While convolutional architectures like U-Net have become standard for such tasks, their limited receptive field restricts global context modeling. Recent efforts integrating transformers have addressed this, but often result in deep, computationally expensive models unsuitable for real-time use. In this work, we present a novel end-to-end lightweight architecture designed specifically for real-time binary medical image segmentation. Our model combines a Swin Transformer-like encoder with a U-Net-like decoder, connected via skip pathways to preserve spatial detail while capturing contextual information. Unlike existing designs such as Swin Transformer or U-Net, our architecture is significantly shallower and competitively efficient. To improve the encoder's ability to learn meaningful features without relying on large amounts of labeled data, we first train it using Barlow Twins, a self-supervised learning method that helps the model focus on important patterns by reducing unnecessary repetition in the learned features. After this pretraining, we fine-tune the entire model for our specific task. Experiments on benchmark binary segmentation tasks demonstrate that our model achieves competitive accuracy with substantially reduced parameter count and faster inference, positioning it as a practical alternative for deployment in real-time and resource-limited clinical environments. The code for our method is available at Github repository: https://github.com/mkianih/Barlow-Swin.
中文: 本文提出了一种轻量级实时医学图像二值分割模型,结合类Swin Transformer编码器与U-Net解码器,通过自监督预训练实现参数量更少、推理更快且精度相当的性能。
English: This paper introduces a lightweight, real-time binary medical image segmentation model that combines a Swin Transformer-like encoder with a U-Net decoder, using self-supervised pretraining to achieve competitive accuracy with fewer parameters and faster inference.

Authors:Matteo Muratori, Joël Seytre
Title: ToonOut: Fine-tuned Background-Removal for Anime Characters
Abstract:
While state-of-the-art background removal models excel at realistic imagery, they frequently underperform in specialized domains such as anime-style content, where complex features like hair and transparency present unique challenges. To address this limitation, we collected and annotated a custom dataset of 1,228 high-quality anime images of characters and objects, and fine-tuned the open-sourced BiRefNet model on this dataset. This resulted in marked improvements in background removal accuracy for anime-style images, increasing from 95.3% to 99.5% for our newly introduced Pixel Accuracy metric. We are open-sourcing the code, the fine-tuned model weights, as well as the dataset at: https://github.com/MatteoKartoon/BiRefNet.
Chinese: 本研究针对背景移除模型在动漫风格内容中的表现不佳问题,通过在1,228张标注动漫图像数据集上微调BiRefNet模型,将像素精度从95.3%显著提升至99.5%,并公开了所有相关资源。
English: The study addresses the underperformance of background removal models in anime-style content by fine-tuning the BiRefNet model on a custom dataset of 1,228 annotated images, significantly improving accuracy from 95.3% to 99.5% and releasing all resources publicly.

Authors:Simon Pezold, Jérôme A. Kurylec, Jan S. Liechti, Beat P. Müller, Joël L. Lavanchy
Title: Leveraging Generic Foundation Models for Multimodal Surgical Data Analysis
Abstract:
We investigate how both the adaptation of a generic foundation model via transfer learning and the integration of complementary modalities from the operating room (OR) can support surgical data science. To this end, we use V-JEPA as the single-modality foundation of a multimodal model for minimally invasive surgery support. We analyze how the model's downstream performance can benefit (a) from finetuning on unlabeled surgical video data and (b) from providing additional time-resolved data streams from the OR in a multimodal setup. In an in-house dataset of liver surgery videos, we analyze the tasks of predicting hospital length of stay and postoperative complications. In videos of the public HeiCo dataset, we analyze the task of surgical phase recognition. As a baseline, we apply pretrained V-JEPA to all tasks. We then finetune it on unlabeled, held-out videos to investigate its change in performance after domain adaptation. Following the idea of modular decision support networks, we integrate additional data streams from the OR by training a separate encoder to form a shared representation space with V-JEPA's embeddings. Our experiments show that finetuning on domain-specific data increases model performance. On the in-house data, integrating additional time-resolved data likewise benefits the model. On the HeiCo data, accuracy of the pretrained video-only, single-modality baseline setup is on par with the top-performing submissions of the EndoVis2017 challenge, while finetuning on domain-specific data increases accuracy further. Our results thus demonstrate how surgical data science can leverage public, generic foundation models. Likewise, they indicate the potential of domain adaptation and of integrating suitable complementary data streams from the OR. To support further research, we release our code and model weights at https://github.com/DigitalSurgeryLab-Basel/ML-CDS-2025.
中文: 本研究证明,通过在未标记手术视频上微调V-JEPA基础模型并整合多模态手术室数据流,能显著提升住院时长预测和手术阶段识别等外科数据科学任务的性能。
English: This study demonstrates that finetuning the V-JEPA foundation model on unlabeled surgical videos and integrating multimodal OR data streams significantly enhances performance in surgical data science tasks like length-of-stay prediction and phase recognition.

Authors:Yufeng Cheng, Wenxu Wu, Shaojin Wu, Mengqi Huang, Fei Ding, Qian He
Title: UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward
Abstract:
Recent advancements in image customization exhibit a wide range of application prospects due to stronger customization capabilities. However, since we humans are more sensitive to faces, a significant challenge remains in preserving consistent identity while avoiding identity confusion with multi-reference images, limiting the identity scalability of customization models. To address this, we present UMO, a Unified Multi-identity Optimization framework, designed to maintain high-fidelity identity preservation and alleviate identity confusion with scalability. With "multi-to-multi matching" paradigm, UMO reformulates multi-identity generation as a global assignment optimization problem and unleashes multi-identity consistency for existing image customization methods generally through reinforcement learning on diffusion models. To facilitate the training of UMO, we develop a scalable customization dataset with multi-reference images, consisting of both synthesised and real parts. Additionally, we propose a new metric to measure identity confusion. Extensive experiments demonstrate that UMO not only improves identity consistency significantly, but also reduces identity confusion on several image customization methods, setting a new state-of-the-art among open-source methods along the dimension of identity preserving. Code and model: https://github.com/bytedance/UMO
中文摘要:UMO框架通过全局分配优化和强化学习,提升了图像定制中的多身份一致性,有效减少身份混淆,并在多个定制方法中实现了最先进的身份保持效果。
English Summary: The UMO framework enhances image customization by optimizing multi-identity preservation through a global assignment approach and reinforcement learning, significantly reducing identity confusion while maintaining high fidelity across diverse reference images.

Authors:Changfeng Ma, Yang Li, Xinhao Yan, Jiachen Xu, Yunhan Yang, Chunshi Wang, Zibo Zhao, Yanwen Guo, Zhuo Chen, Chunchao Guo
Title: P3-SAM: Native 3D Part Segmentation
Abstract:
Segmenting 3D assets into their constituent parts is crucial for enhancing 3D understanding, facilitating model reuse, and supporting various applications such as part generation. However, current methods face limitations such as poor robustness when dealing with complex objects and cannot fully automate the process. In this paper, we propose a native 3D point-promptable part segmentation model termed P$^3$-SAM, designed to fully automate the segmentation of any 3D objects into components. Inspired by SAM, P$^3$-SAM consists of a feature extractor, multiple segmentation heads, and an IoU predictor, enabling interactive segmentation for users. We also propose an algorithm to automatically select and merge masks predicted by our model for part instance segmentation. Our model is trained on a newly built dataset containing nearly 3.7 million models with reasonable segmentation labels. Comparisons show that our method achieves precise segmentation results and strong robustness on any complex objects, attaining state-of-the-art performance. Our project page is available at https://murcherful.github.io/P3-SAM/.

Authors:Sai Kartheek Reddy Kasu, Mohammad Zia Ur Rehman, Shahid Shafi Dar, Rishi Bharat Junghare, Dhanvin Sanjay Namboodiri, Nagendra Kumar
Title: D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning
Abstract:
Dark humor in online memes poses unique challenges due to its reliance on implicit, sensitive, and culturally contextual cues. To address the lack of resources and methods for detecting dark humor in multimodal content, we introduce a novel dataset of 4,379 Reddit memes annotated for dark humor, target category (gender, mental health, violence, race, disability, and other), and a three-level intensity rating (mild, moderate, severe). Building on this resource, we propose a reasoning-augmented framework that first generates structured explanations for each meme using a Large Vision-Language Model (VLM). Through a Role-Reversal Self-Loop, VLM adopts the author's perspective to iteratively refine its explanations, ensuring completeness and alignment. We then extract textual features from both the OCR transcript and the self-refined reasoning via a text encoder, while visual features are obtained using a vision transformer. A Tri-stream Cross-Reasoning Network (TCRNet) fuses these three streams, text, image, and reasoning, via pairwise attention mechanisms, producing a unified representation for classification. Experimental results demonstrate that our approach outperforms strong baselines across three tasks: dark humor detection, target identification, and intensity prediction. The dataset, annotations, and code are released to facilitate further research in multimodal humor understanding and content moderation. Code and Dataset are available at: https://github.com/Sai-Kartheek-Reddy/D-Humor-Dark-Humor-Understanding-via-Multimodal-Open-ended-Reasoning
中文摘要:本研究针对网络迷因中的黑色幽默检测,提出了一个新的数据集和增强推理框架,通过整合视觉、文本及自我优化的推理特征,采用三流网络在多项任务中显著超越现有方法。
English Summary: This study introduces a novel dataset and a reasoning-augmented framework for detecting dark humor in online memes, which outperforms existing methods by integrating visual, textual, and self-refined reasoning features through a tri-stream network.

Authors:Qing Xu, Wenting Duan, Zhen Chen
Title: Co-Seg: Mutual Prompt-Guided Collaborative Learning for Tissue and Nuclei Segmentation
Abstract:
Histopathology image analysis is critical yet challenged by the demand of segmenting tissue regions and nuclei instances for tumor microenvironment and cellular morphology analysis. Existing studies focused on tissue semantic segmentation or nuclei instance segmentation separately, but ignored the inherent relationship between these two tasks, resulting in insufficient histopathology understanding. To address this issue, we propose a Co-Seg framework for collaborative tissue and nuclei segmentation. Specifically, we introduce a novel co-segmentation paradigm, allowing tissue and nuclei segmentation tasks to mutually enhance each other. To this end, we first devise a region-aware prompt encoder (RP-Encoder) to provide high-quality semantic and instance region prompts as prior constraints. Moreover, we design a mutual prompt mask decoder (MP-Decoder) that leverages cross-guidance to strengthen the contextual consistency of both tasks, collaboratively computing semantic and instance segmentation masks. Extensive experiments on the PUMA dataset demonstrate that the proposed Co-Seg surpasses state-of-the-arts in the semantic, instance and panoptic segmentation of tumor tissues and nuclei instances. The source code is available at https://github.com/xq141839/Co-Seg.
中文: 提出的Co-Seg框架通过区域感知提示编码器和互提示掩码解码器实现组织与细胞核的协同分割,在PUMA数据集上取得了最优性能。
English: The proposed Co-Seg framework introduces collaborative tissue and nuclei segmentation through a region-aware prompt encoder and mutual prompt mask decoder, achieving state-of-the-art performance on the PUMA dataset.

Authors:Xin Kong, Daniel Watson, Yannick Strümpler, Michael Niemeyer, Federico Tombari
Title: CausNVS: Autoregressive Multi-view Diffusion for Flexible 3D Novel View Synthesis
Abstract:
Multi-view diffusion models have shown promise in 3D novel view synthesis, but most existing methods adopt a non-autoregressive formulation. This limits their applicability in world modeling, as they only support a fixed number of views and suffer from slow inference due to denoising all frames simultaneously. To address these limitations, we propose CausNVS, a multi-view diffusion model in an autoregressive setting, which supports arbitrary input-output view configurations and generates views sequentially. We train CausNVS with causal masking and per-frame noise, using pairwise-relative camera pose encodings (CaPE) for precise camera control. At inference time, we combine a spatially-aware sliding-window with key-value caching and noise conditioning augmentation to mitigate drift. Our experiments demonstrate that CausNVS supports a broad range of camera trajectories, enables flexible autoregressive novel view synthesis, and achieves consistently strong visual quality across diverse settings. Project page: https://kxhit.github.io/CausNVS.html.
中文:CausNVS提出了一种自回归多视角扩散模型,通过精确的相机控制和增强技术,实现了灵活的新视角合成,并在不同设置下保持出色的视觉质量。
English: CausNVS introduces an autoregressive multi-view diffusion model that enables flexible novel view synthesis with precise camera control and strong visual quality across various configurations.

Authors:Jibai Lin, Bo Ma, Yating Yang, Xi Zhou, Rong Ma, Turghun Osman, Ahtamjan Ahmat, Rui Dong, Lei Wang
Title: TIDE: Achieving Balanced Subject-Driven Image Generation via Target-Instructed Diffusion Enhancement
Abstract:
Subject-driven image generation (SDIG) aims to manipulate specific subjects within images while adhering to textual instructions, a task crucial for advancing text-to-image diffusion models. SDIG requires reconciling the tension between maintaining subject identity and complying with dynamic edit instructions, a challenge inadequately addressed by existing methods. In this paper, we introduce the Target-Instructed Diffusion Enhancing (TIDE) framework, which resolves this tension through target supervision and preference learning without test-time fine-tuning. TIDE pioneers target-supervised triplet alignment, modelling subject adaptation dynamics using a (reference image, instruction, target images) triplet. This approach leverages the Direct Subject Diffusion (DSD) objective, training the model with paired "winning" (balanced preservation-compliance) and "losing" (distorted) targets, systematically generated and evaluated via quantitative metrics. This enables implicit reward modelling for optimal preservation-compliance balance. Experimental results on standard benchmarks demonstrate TIDE's superior performance in generating subject-faithful outputs while maintaining instruction compliance, outperforming baseline methods across multiple quantitative metrics. TIDE's versatility is further evidenced by its successful application to diverse tasks, including structural-conditioned generation, image-to-image generation, and text-image interpolation. Our code is available at https://github.com/KomJay520/TIDE.
中文摘要:TIDE框架通过目标监督的三元组对齐和偏好学习,在无需测试时微调的情况下解决了主题驱动图像生成中身份保持与指令遵循的平衡难题,在多个基准测试中展现出卓越性能。
English Summary: The TIDE framework enhances subject-driven image generation by balancing subject identity preservation with instruction compliance through target-supervised triplet alignment and preference learning, achieving superior performance across multiple benchmarks without test-time fine-tuning.

Authors:Zhongxiang Xie, Shuangxi Miao, Yuhan Jiang, Zhewei Zhang, Jing Yao, Xuecao Li, Jianxi Huang, Pedram Ghamisi
Title: FSG-Net: Frequency-Spatial Synergistic Gated Network for High-Resolution Remote Sensing Change Detection
Abstract:
Change detection from high-resolution remote sensing images lies as a cornerstone of Earth observation applications, yet its efficacy is often compromised by two critical challenges. First, false alarms are prevalent as models misinterpret radiometric variations from temporal shifts (e.g., illumination, season) as genuine changes. Second, a non-negligible semantic gap between deep abstract features and shallow detail-rich features tends to obstruct their effective fusion, culminating in poorly delineated boundaries. To step further in addressing these issues, we propose the Frequency-Spatial Synergistic Gated Network (FSG-Net), a novel paradigm that aims to systematically disentangle semantic changes from nuisance variations. Specifically, FSG-Net first operates in the frequency domain, where a Discrepancy-Aware Wavelet Interaction Module (DAWIM) adaptively mitigates pseudo-changes by discerningly processing different frequency components. Subsequently, the refined features are enhanced in the spatial domain by a Synergistic Temporal-Spatial Attention Module (STSAM), which amplifies the saliency of genuine change regions. To finally bridge the semantic gap, a Lightweight Gated Fusion Unit (LGFU) leverages high-level semantics to selectively gate and integrate crucial details from shallow layers. Comprehensive experiments on the CDD, GZ-CD, and LEVIR-CD benchmarks validate the superiority of FSG-Net, establishing a new state-of-the-art with F1-scores of 94.16%, 89.51%, and 91.27%, respectively. The code will be made available at https://github.com/zxXie-Air/FSG-Net after a possible publication.
中文摘要:FSG-Net通过频率-空间协同处理机制解决遥感变化检测中的误报和特征融合难题,在多个基准测试中实现了最优性能。
English Summary: The proposed FSG-Net addresses false alarms and feature fusion challenges in remote sensing change detection through frequency-spatial synergistic processing, achieving state-of-the-art performance on multiple benchmarks.

Authors:Yixiao Li, Xin Li, Chris Wei Zhou, Shuo Xing, Hadi Amirpour, Xiaoshuai Hao, Guanghui Yue, Baoquan Zhao, Weide Liu, Xiaoyuan Yang, Zhengzhong Tu, Xinyu Li, Chuanbiao Song, Chenqi Zhang, Jun Lan, Huijia Zhu, Weiqiang Wang, Xiaoyan Sun, Shishun Tian, Dongyang Yan, Weixia Zhang, Junlin Chen, Wei Sun, Zhihua Wang, Zhuohang Shi, Zhizun Luo, Hang Ouyang, Tianxin Xiao, Fan Yang, Zhaowang Wu, Kaixin Deng
Title: VQualA 2025 Challenge on Image Super-Resolution Generated Content Quality Assessment: Methods and Results
Abstract:
This paper presents the ISRGC-Q Challenge, built upon the Image Super-Resolution Generated Content Quality Assessment (ISRGen-QA) dataset, and organized as part of the Visual Quality Assessment (VQualA) Competition at the ICCV 2025 Workshops. Unlike existing Super-Resolution Image Quality Assessment (SR-IQA) datasets, ISRGen-QA places a greater emphasis on SR images generated by the latest generative approaches, including Generative Adversarial Networks (GANs) and diffusion models. The primary goal of this challenge is to analyze the unique artifacts introduced by modern super-resolution techniques and to evaluate their perceptual quality effectively. A total of 108 participants registered for the challenge, with 4 teams submitting valid solutions and fact sheets for the final testing phase. These submissions demonstrated state-of-the-art (SOTA) performance on the ISRGen-QA dataset. The project is publicly available at: https://github.com/Lighting-YXLI/ISRGen-QA.
Chinese: ISRGC-Q挑战赛基于ISRGen-QA数据集,是ICCV 2025研讨会的一部分,重点评估由GAN和扩散模型等先进生成方法产生的超分辨率图像的感知质量及伪影,最终提交方案达到了最先进的性能水平。
English: The ISRGC-Q Challenge, based on the ISRGen-QA dataset and part of the ICCV 2025 Workshops, focuses on assessing perceptual quality and artifacts in super-resolution images produced by advanced generative models like GANs and diffusion models, with top submissions achieving state-of-the-art results.

Authors:Jeongmin Yu, Susang Kim, Kisu Lee, Taekyoung Kwon, Won-Yong Shin, Ha Young Kim
Title: Multi-View Slot Attention Using Paraphrased Texts for Face Anti-Spoofing
Abstract:
Recent face anti-spoofing (FAS) methods have shown remarkable cross-domain performance by employing vision-language models like CLIP. However, existing CLIP-based FAS models do not fully exploit CLIP's patch embedding tokens, failing to detect critical spoofing clues. Moreover, these models rely on a single text prompt per class (e.g., 'live' or 'fake'), which limits generalization. To address these issues, we propose MVP-FAS, a novel framework incorporating two key modules: Multi-View Slot attention (MVS) and Multi-Text Patch Alignment (MTPA). Both modules utilize multiple paraphrased texts to generate generalized features and reduce dependence on domain-specific text. MVS extracts local detailed spatial features and global context from patch embeddings by leveraging diverse texts with multiple perspectives. MTPA aligns patches with multiple text representations to improve semantic robustness. Extensive experiments demonstrate that MVP-FAS achieves superior generalization performance, outperforming previous state-of-the-art methods on cross-domain datasets. Code: https://github.com/Elune001/MVP-FAS.
中文: MVP-FAS框架通过多视角槽位注意力和多文本补丁对齐模块,充分利用CLIP的补丁嵌入和多样化文本提示,显著提升了人脸防伪的跨领域泛化性能。
English: The proposed MVP-FAS framework enhances face anti-spoofing by leveraging multi-view slot attention and multi-text patch alignment to better utilize CLIP's patch embeddings and multiple text prompts, achieving superior cross-domain generalization.

Authors:Ruiming Du, Guangxun Zhai, Tian Qiu, Yu Jiang
Title: Towards scalable organ level 3D plant segmentation: Bridging the data algorithm computing gap
Abstract:
The precise characterization of plant morphology provides valuable insights into plant environment interactions and genetic evolution. A key technology for extracting this information is 3D segmentation, which delineates individual plant organs from complex point clouds. Despite significant progress in general 3D computer vision domains, the adoption of 3D segmentation for plant phenotyping remains limited by three major challenges: i) the scarcity of large-scale annotated datasets, ii) technical difficulties in adapting advanced deep neural networks to plant point clouds, and iii) the lack of standardized benchmarks and evaluation protocols tailored to plant science. This review systematically addresses these barriers by: i) providing an overview of existing 3D plant datasets in the context of general 3D segmentation domains, ii) systematically summarizing deep learning-based methods for point cloud semantic and instance segmentation, iii) introducing Plant Segmentation Studio (PSS), an open-source framework for reproducible benchmarking, and iv) conducting extensive quantitative experiments to evaluate representative networks and sim-to-real learning strategies. Our findings highlight the efficacy of sparse convolutional backbones and transformer-based instance segmentation, while also emphasizing the complementary role of modeling-based and augmentation-based synthetic data generation for sim-to-real learning in reducing annotation demands. In general, this study bridges the gap between algorithmic advances and practical deployment, providing immediate tools for researchers and a roadmap for developing data-efficient and generalizable deep learning solutions in 3D plant phenotyping. Data and code are available at https://github.com/perrydoremi/PlantSegStudio.
中文: 本综述通过评估深度学习方法、引入基准测试框架,并证明稀疏卷积网络和合成数据的有效性,解决了3D分割在植物表型分析中的局限性,从而弥合了算法进展与实际应用之间的差距。
English: This review tackles the limitations of 3D segmentation in plant phenotyping by evaluating deep learning methods, introducing a benchmarking framework, and demonstrating the effectiveness of sparse convolutional networks and synthetic data to bridge algorithmic advances with practical applications.

Authors:Jiangnan Xie, Xiaolong Zheng, Liang Zheng
Title: Prototype-Aware Multimodal Alignment for Open-Vocabulary Visual Grounding
Abstract:
Visual Grounding (VG) aims to utilize given natural language queries to locate specific target objects within images. While current transformer-based approaches demonstrate strong localization performance in standard scene (i.e, scenarios without any novel objects), they exhibit notable limitations in open-vocabulary scene (i.e, both familiar and novel object categories during testing). These limitations primarily stem from three key factors: (1) imperfect alignment between visual and linguistic modalities, (2) insufficient cross-modal feature fusion, and (3) ineffective utilization of semantic prototype information. To overcome these challenges, we present Prototype-Aware Multimodal Learning (PAML), an innovative framework that systematically addresses these issues through several key components: First, we leverage ALBEF to establish robust cross-modal alignment during initial feature encoding. Subsequently, our Visual Discriminative Feature Encoder selectively enhances salient object representations while suppressing irrelevant visual context. The framework then incorporates a novel prototype discovering and inheriting mechanism that extracts and aggregates multi-neighbor semantic prototypes to facilitate open-vocabulary recognition. These enriched features undergo comprehensive multimodal integration through our Multi-stage Decoder before final bounding box regression. Extensive experiments across five benchmark datasets validate our approach, showing competitive performance in standard scene while achieving state-of-the-art results in open-vocabulary scene. Our code is available at https://github.com/plankXie/PAML.
中文:提出的原型感知多模态学习(PAML)框架通过增强跨模态对齐、特征融合和语义原型利用,克服了开放词汇视觉定位中的局限性,实现了最先进的性能。
English: The proposed Prototype-Aware Multimodal Learning (PAML) framework overcomes limitations in open-vocabulary visual grounding by enhancing cross-modal alignment, feature fusion, and semantic prototype utilization, achieving state-of-the-art performance.

Authors:Lucas Wojcik, Luiz Coelho, Roger Granada, David Menotti
Title: Exploring Light-Weight Object Recognition for Real-Time Document Detection
Abstract:
Object Recognition and Document Skew Estimation have come a long way in terms of performance and efficiency. New models follow one of two directions: improving performance using larger models, and improving efficiency using smaller models. However, real-time document detection and rectification is a niche that is largely unexplored by the literature, yet it remains a vital step for automatic information retrieval from visual documents. In this work, we strive towards an efficient document detection pipeline that is satisfactory in terms of Optical Character Recognition (OCR) retrieval and faster than other available solutions. We adapt IWPOD-Net, a license plate detection network, and train it for detection on NBID, a synthetic ID card dataset. We experiment with data augmentation and cross-dataset validation with MIDV (another synthetic ID and passport document dataset) to find the optimal scenario for the model. Other methods from both the Object Recognition and Skew Estimation state-of-the-art are evaluated for comparison with our approach. We use each method to detect and rectify the document, which is then read by an OCR system. The OCR output is then evaluated using a novel OCR quality metric based on the Levenshtein distance. Since the end goal is to improve automatic information retrieval, we use the overall OCR quality as a performance metric. We observe that with a promising model, document rectification does not have to be perfect to attain state-of-the-art performance scores. We show that our model is smaller and more efficient than current state-of-the-art solutions while retaining a competitive OCR quality metric. All code is available at https://github.com/BOVIFOCR/iwpod-doc-corners.git
中文: 本研究基于车牌检测网络开发了一种高效的文档检测与校正流程,在保持竞争力的OCR质量指标的同时,实现了比现有最优方案更轻量、更快速的性能表现。
English: This research introduces an efficient document detection and rectification pipeline adapted from a license plate detection network, achieving competitive OCR performance with a smaller, faster model than current state-of-the-art methods.

Authors:Duomin Wang, Wei Zuo, Aojie Li, Ling-Hao Chen, Xinyao Liao, Deyu Zhou, Zixin Yin, Xili Dai, Daxin Jiang, Gang Yu
Title: UniVerse-1: Unified Audio-Video Generation via Stitching of Experts
Abstract:
We introduce UniVerse-1, a unified, Veo-3-like model capable of simultaneously generating coordinated audio and video. To enhance training efficiency, we bypass training from scratch and instead employ a stitching of experts (SoE) technique. This approach deeply fuses the corresponding blocks of pre-trained video and music generation experts models, thereby fully leveraging their foundational capabilities. To ensure accurate annotations and temporal alignment for both ambient sounds and speech with video content, we developed an online annotation pipeline that processes the required training data and generates labels during training process. This strategy circumvents the performance degradation often caused by misalignment text-based annotations. Through the synergy of these techniques, our model, after being finetuned on approximately 7,600 hours of audio-video data, produces results with well-coordinated audio-visuals for ambient sounds generation and strong alignment for speech generation. To systematically evaluate our proposed method, we introduce Verse-Bench, a new benchmark dataset. In an effort to advance research in audio-video generation and to close the performance gap with state-of-the-art models such as Veo3, we make our model and code publicly available. We hope this contribution will benefit the broader research community. Project page: https://dorniwang.github.io/UniVerse-1/.

Authors:Hao Liang, Ruitao Wu, Bohan Zeng, Junbo Niu, Wentao Zhang, Bin Dong
Title: Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge
Abstract:
Multimodal reasoning remains a fundamental challenge in artificial intelligence. Despite substantial advances in text-based reasoning, even state-of-the-art models such as GPT-o3 struggle to maintain strong performance in multimodal scenarios. To address this gap, we introduce a caption-assisted reasoning framework that effectively bridges visual and textual modalities. Our approach achieved 1st place in the ICML 2025 AI for Math Workshop \& Challenge 2: SeePhys, highlighting its effectiveness and robustness. Furthermore, we validate its generalization on the MathVerse benchmark for geometric reasoning, demonstrating the versatility of our method. Our code is publicly available at https://github.com/OpenDCAI/SciReasoner.
中文摘要:作者提出了一种字幕辅助推理框架来增强多模态推理能力,该方法在ICML 2025 SeePhys挑战赛中荣获第一,并在MathVerse基准测试中展现出优秀的泛化性能。
English Summary: The authors propose a caption-assisted reasoning framework to enhance multimodal reasoning, achieving top performance in the ICML 2025 SeePhys challenge and demonstrating strong generalization on the MathVerse benchmark.

Authors:Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, Shanghang Zhang
Title: BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models
Abstract:
Recent progress in aligning image and video generative models with Group Relative Policy Optimization (GRPO) has improved human preference alignment, but existing variants remain inefficient due to sequential rollouts and large numbers of sampling steps, unreliable credit assignment: sparse terminal rewards are uniformly propagated across timesteps, failing to capture the varying criticality of decisions during denoising. In this paper, we present BranchGRPO, a method that restructures the rollout process into a branching tree, where shared prefixes amortize computation and pruning removes low-value paths and redundant depths. BranchGRPO introduces three contributions: (1) a branching scheme that amortizes rollout cost through shared prefixes while preserving exploration diversity; (2) a reward fusion and depth-wise advantage estimator that transforms sparse terminal rewards into dense step-level signals; and (3) pruning strategies that cut gradient computation but leave forward rollouts and exploration unaffected. On HPDv2.1 image alignment, BranchGRPO improves alignment scores by up to \textbf{16\%} over DanceGRPO, while reducing per-iteration training time by nearly \textbf{55\%}. A hybrid variant, BranchGRPO-Mix, further accelerates training to 4.7x faster than DanceGRPO without degrading alignment. On WanX video generation, it further achieves higher Video-Align scores with sharper and temporally consistent frames compared to DanceGRPO. Codes are available at \href{https://fredreic1849.github.io/BranchGRPO-Webpage/}{BranchGRPO}.
中文: BranchGRPO通过将生成模型的展开过程重构为带共享前缀和剪枝的分支树结构,将训练效率提升高达55%,对齐分数提高16%,优于现有方法。
English: BranchGRPO enhances generative model alignment by restructuring rollouts into a branching tree with shared prefixes and pruning, improving efficiency by up to 55% and alignment scores by 16% over prior methods.

Authors:Zhiwen Shao, Yifan Cheng, Fan Zhang, Xuehuai Shi, Canlin Li, Lizhuang Ma, Dit-yan Yeung
Title: Micro-Expression Recognition via Fine-Grained Dynamic Perception
Abstract:
Facial micro-expression recognition (MER) is a challenging task, due to the transience, subtlety, and dynamics of micro-expressions (MEs). Most existing methods resort to hand-crafted features or deep networks, in which the former often additionally requires key frames, and the latter suffers from small-scale and low-diversity training data. In this paper, we develop a novel fine-grained dynamic perception (FDP) framework for MER. We propose to rank frame-level features of a sequence of raw frames in chronological order, in which the rank process encodes the dynamic information of both ME appearances and motions. Specifically, a novel local-global feature-aware transformer is proposed for frame representation learning. A rank scorer is further adopted to calculate rank scores of each frame-level feature. Afterwards, the rank features from rank scorer are pooled in temporal dimension to capture dynamic representation. Finally, the dynamic representation is shared by a MER module and a dynamic image construction module, in which the former predicts the ME category, and the latter uses an encoder-decoder structure to construct the dynamic image. The design of dynamic image construction task is beneficial for capturing facial subtle actions associated with MEs and alleviating the data scarcity issue. Extensive experiments show that our method (i) significantly outperforms the state-of-the-art MER methods, and (ii) works well for dynamic image construction. Particularly, our FDP improves by 4.05%, 2.50%, 7.71%, and 2.11% over the previous best results in terms of F1-score on the CASME II, SAMM, CAS(ME)^2, and CAS(ME)^3 datasets, respectively. The code is available at https://github.com/CYF-cuber/FDP.
中文: 本文提出了一种细粒度动态感知框架,通过排序帧级特征并利用变换器捕捉动态表示,显著提升了面部微表情识别的性能,在多个数据集上取得了最优结果。
English: This paper introduces a fine-grained dynamic perception framework that enhances facial micro-expression recognition by ranking frame-level features and using a transformer for dynamic representation, achieving state-of-the-art results across multiple datasets.

Authors:Wanyin Cheng, Zanxi Ruan
Title: BLaVe-CoT: Consistency-Aware Visual Question Answering for Blind and Low Vision Users
Abstract:
Visual Question Answering (VQA) holds great potential for assisting Blind and Low Vision (BLV) users, yet real-world usage remains challenging. Due to visual impairments, BLV users often take blurry or poorly framed photos and face difficulty in articulating specific questions about what they cannot fully see. As a result, their visual questions are frequently ambiguous, and different users may interpret them in diverse ways. This leads to multiple valid answers, each grounded in different image regions-posing a mismatch with conventional VQA systems that assume a single answer and region. To bridge this gap, we present BLaVe-CoT, a VQA framework designed to reason about answer consistency in the face of ambiguity. Our method proposes diverse candidate answers using a LoRA-tuned BLIP-2 model, then grounds each answer spatially using PolyFormer, and finally applies a chain-of-thought reasoning module to assess whether the answers refer to the same or different regions. Evaluated on the VQA-AnswerTherapy benchmark, BLaVe-CoT outperforms previous methods and proves more robust to the ambiguity and visual noise common in assistive settings. This work highlights the need for VQA systems that can adapt to real human uncertainty and provide inclusive support for BLV users. To foster further research and accessibility applications, we have made the code publicly available at https://github.com/Accecwan/BLaVe-CoT.
中文摘要:BLaVe-CoT是一种新型视觉问答框架,通过生成多样化答案、空间定位及思维链推理来应对盲人和低视力用户的模糊提问,在基准测试中优于现有方法,提升了辅助技术的包容性。
English Summary: BLaVe-CoT is a novel VQA framework that addresses ambiguity in questions from blind and low vision users by generating diverse answers, grounding them spatially, and using chain-of-thought reasoning to evaluate answer consistency, outperforming previous methods on benchmarks.

Authors:Mohamed Mohamed, Brennan Nichyporuk, Douglas L. Arnold, Tal Arbel
Title: Imagining Alternatives: Towards High-Resolution 3D Counterfactual Medical Image Generation via Language Guidance
Abstract:
Vision-language models have demonstrated impressive capabilities in generating 2D images under various conditions; however, the success of these models is largely enabled by extensive, readily available pretrained foundation models. Critically, comparable pretrained models do not exist for 3D, significantly limiting progress. As a result, the potential of vision-language models to produce high-resolution 3D counterfactual medical images conditioned solely on natural language remains unexplored. Addressing this gap would enable powerful clinical and research applications, such as personalized counterfactual explanations, simulation of disease progression, and enhanced medical training by visualizing hypothetical conditions in realistic detail. Our work takes a step toward this challenge by introducing a framework capable of generating high-resolution 3D counterfactual medical images of synthesized patients guided by free-form language prompts. We adapt state-of-the-art 3D diffusion models with enhancements from Simple Diffusion and incorporate augmented conditioning to improve text alignment and image quality. To our knowledge, this is the first demonstration of a language-guided native-3D diffusion model applied to neurological imaging, where faithful three-dimensional modeling is essential. On two neurological MRI datasets, our framework simulates varying counterfactual lesion loads in Multiple Sclerosis and cognitive states in Alzheimer's disease, generating high-quality images while preserving subject fidelity. Our results lay the groundwork for prompt-driven disease progression analysis in 3D medical imaging. Project link - https://lesupermomo.github.io/imagining-alternatives/.

Authors:Ye Wang, Zili Yi, Yibo Zhang, Peng Zheng, Xuping Xie, Jiang Lin, Yilin Wang, Rui Ma
Title: OmniStyle2: Scalable and High Quality Artistic Style Transfer Data Generation via Destylization
Abstract:
OmniStyle2 introduces a novel approach to artistic style transfer by reframing it as a data problem. Our key insight is destylization, reversing style transfer by removing stylistic elements from artworks to recover natural, style-free counterparts. This yields DST-100K, a large-scale dataset that provides authentic supervision signals by aligning real artistic styles with their underlying content. To build DST-100K, we develop (1) DST, a text-guided destylization model that reconstructs stylefree content, and (2) DST-Filter, a multi-stage evaluation model that employs Chain-of-Thought reasoning to automatically discard low-quality pairs while ensuring content fidelity and style accuracy. Leveraging DST-100K, we train OmniStyle2, a simple feed-forward model based on FLUX.1-dev. Despite its simplicity, OmniStyle2 consistently surpasses state-of-the-art methods across both qualitative and quantitative benchmarks. Our results demonstrate that scalable data generation via destylization provides a reliable supervision paradigm, overcoming the fundamental challenge posed by the lack of ground-truth data in artistic style transfer.

Authors:Jeonghyun Noh, Wangsu Jeon, Jinsun Park
Title: Dual Interaction Network with Cross-Image Attention for Medical Image Segmentation
Abstract:
Medical image segmentation is a crucial method for assisting professionals in diagnosing various diseases through medical imaging. However, various factors such as noise, blurriness, and low contrast often hinder the accurate diagnosis of diseases. While numerous image enhancement techniques can mitigate these issues, they may also alter crucial information needed for accurate diagnosis in the original image. Conventional image fusion strategies, such as feature concatenation can address this challenge. However, they struggle to fully leverage the advantages of both original and enhanced images while suppressing the side effects of the enhancements. To overcome the problem, we propose a dual interactive fusion module (DIFM) that effectively exploits mutual complementary information from the original and enhanced images. DIFM employs cross-attention bidirectionally to simultaneously attend to corresponding spatial information across different images, subsequently refining the complementary features via global spatial attention. This interaction leverages low- to high-level features implicitly associated with diverse structural attributes like edges, blobs, and object shapes, resulting in enhanced features that embody important spatial characteristics. In addition, we introduce a multi-scale boundary loss based on gradient extraction to improve segmentation accuracy at object boundaries. Experimental results on the ACDC and Synapse datasets demonstrate the superiority of the proposed method quantitatively and qualitatively. Code available at: https://github.com/JJeong-Gari/DIN
中文: 提出的双重交互融合模块(DIFM)通过交叉注意力和全局空间注意力有效整合原始与增强医学图像的互补信息,结合多尺度边界损失提升物体边界分割精度,在基准数据集上展现出优越性能。
English: The proposed dual interactive fusion module (DIFM) effectively integrates complementary information from original and enhanced medical images using cross-attention and global spatial attention, while a multi-scale boundary loss improves segmentation accuracy at object boundaries, demonstrating superior performance on benchmark datasets.

Authors:Feng Wang, Zihao Yu
Title: Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching
Abstract:
Reinforcement Learning (RL) has recently emerged as a powerful technique for improving image and video generation in Diffusion and Flow Matching models, specifically for enhancing output quality and alignment with prompts. A critical step for applying online RL methods on Flow Matching is the introduction of stochasticity into the deterministic framework, commonly realized by Stochastic Differential Equation (SDE). Our investigation reveals a significant drawback to this approach: SDE-based sampling introduces pronounced noise artifacts in the generated images, which we found to be detrimental to the reward learning process. A rigorous theoretical analysis traces the origin of this noise to an excess of stochasticity injected during inference. To address this, we draw inspiration from Denoising Diffusion Implicit Models (DDIM) to reformulate the sampling process. Our proposed method, Coefficients-Preserving Sampling (CPS), eliminates these noise artifacts. This leads to more accurate reward modeling, ultimately enabling faster and more stable convergence for reinforcement learning-based optimizers like Flow-GRPO and Dance-GRPO. Code will be released at https://github.com/IamCreateAI/FlowCPS
中文摘要:强化学习虽能提升流匹配模型的图像生成质量,但随机采样会引入噪声伪影;我们提出的系数保持采样方法可消除此类噪声,从而改进奖励建模并加速训练收敛。
English Summary: Reinforcement Learning enhances image generation in Flow Matching models but introduces noise artifacts through stochastic sampling, which our proposed Coefficients-Preserving Sampling method eliminates to improve reward modeling and training convergence.

Authors:Shuolong Chen, Xingxing Li, Liu Yuan
Title: eKalibr-Inertial: Continuous-Time Spatiotemporal Calibration for Event-Based Visual-Inertial Systems
Abstract:
The bioinspired event camera, distinguished by its exceptional temporal resolution, high dynamic range, and low power consumption, has been extensively studied in recent years for motion estimation, robotic perception, and object detection. In ego-motion estimation, the visual-inertial setup is commonly adopted due to complementary characteristics between sensors (e.g., scale perception and low drift). For optimal event-based visual-inertial fusion, accurate spatiotemporal (extrinsic and temporal) calibration is required. In this work, we present eKalibr-Inertial, an accurate spatiotemporal calibrator for event-based visual-inertial systems, utilizing the widely used circle grid board. Building upon the grid pattern recognition and tracking methods in eKalibr and eKalibr-Stereo, the proposed method starts with a rigorous and efficient initialization, where all parameters in the estimator would be accurately recovered. Subsequently, a continuous-time-based batch optimization is conducted to refine the initialized parameters toward better states. The results of extensive real-world experiments show that eKalibr-Inertial can achieve accurate event-based visual-inertial spatiotemporal calibration. The implementation of eKalibr-Inertial is open-sourced at (https://github.com/Unsigned-Long/eKalibr) to benefit the research community.
Chinese: 本文提出eKalibr-Inertial,一种基于事件相机的视觉-惯性系统开源时空标定方法,通过严格初始化和连续时间批量优化实现精确参数估计。
English: This paper introduces eKalibr-Inertial, an open-source spatiotemporal calibration method for event-based visual-inertial systems that achieves accurate parameter estimation through rigorous initialization and continuous-time batch optimization.

Authors:Tyler Ward, Abdullah Imran
Title: A Probabilistic Segment Anything Model for Ambiguity-Aware Medical Image Segmentation
Abstract:
Recent advances in promptable segmentation, such as the Segment Anything Model (SAM), have enabled flexible, high-quality mask generation across a wide range of visual domains. However, SAM and similar models remain fundamentally deterministic, producing a single segmentation per object per prompt, and fail to capture the inherent ambiguity present in many real-world tasks. This limitation is particularly troublesome in medical imaging, where multiple plausible segmentations may exist due to annotation uncertainty or inter-expert variability. In this paper, we introduce Probabilistic SAM, a probabilistic extension of SAM that models a distribution over segmentations conditioned on both the input image and prompt. By incorporating a latent variable space and training with a variational objective, our model learns to generate diverse and plausible segmentation masks reflecting the variability in human annotations. The architecture integrates a prior and posterior network into the SAM framework, allowing latent codes to modulate the prompt embeddings during inference. The latent space allows for efficient sampling during inference, enabling uncertainty-aware outputs with minimal overhead. We evaluate Probabilistic SAM on the public LIDC-IDRI lung nodule dataset and demonstrate its ability to produce diverse outputs that align with expert disagreement, outperforming existing probabilistic baselines on uncertainty-aware metrics. Our code is available at: https://github.com/tbwa233/Probabilistic-SAM/.
中文: 针对SAM模型确定性分割的局限,本文提出概率SAM,通过引入潜在变量和变分训练生成多样化分割掩码,有效反映医学图像中专家标注的不确定性,并在公开数据集上验证了其优越性。
English: The Segment Anything Model (SAM) is deterministic and fails to capture segmentation ambiguity, so Probabilistic SAM is introduced as an extension that models a distribution over segmentations to generate diverse, uncertainty-aware masks, particularly benefiting medical imaging.

Authors:Zijian Chen, Wenjie Hua, Jinhao Li, Lirong Deng, Fan Du, Tingzhu Chen, Guangtao Zhai
Title: PictOBI-20k: Unveiling Large Multimodal Models in Visual Decipherment for Pictographic Oracle Bone Characters
Abstract:
Deciphering oracle bone characters (OBCs), the oldest attested form of written Chinese, has remained the ultimate, unwavering goal of scholars, offering an irreplaceable key to understanding humanity's early modes of production. Current decipherment methodologies of OBC are primarily constrained by the sporadic nature of archaeological excavations and the limited corpus of inscriptions. With the powerful visual perception capability of large multimodal models (LMMs), the potential of using LMMs for visually deciphering OBCs has increased. In this paper, we introduce PictOBI-20k, a dataset designed to evaluate LMMs on the visual decipherment tasks of pictographic OBCs. It includes 20k meticulously collected OBC and real object images, forming over 15k multi-choice questions. We also conduct subjective annotations to investigate the consistency of the reference point between humans and LMMs in visual reasoning. Experiments indicate that general LMMs possess preliminary visual decipherment skills, and LMMs are not effectively using visual information, while most of the time they are limited by language priors. We hope that our dataset can facilitate the evaluation and optimization of visual attention in future OBC-oriented LMMs. The code and dataset will be available at https://github.com/OBI-Future/PictOBI-20k.
中文摘要:本文提出了PictOBI-20k数据集,用于评估大型多模态模型对甲骨文的视觉解读能力,发现现有模型具备初步解读技能但主要受限于语言先验而非视觉信息。
English Summary: This paper introduces PictOBI-20k, a dataset for evaluating large multimodal models' ability to visually decipher oracle bone characters, revealing that current models possess basic skills but rely more on language priors than visual information.

Authors:Leo Ho, Yinghao Huang, Dafei Qin, Mingyi Shi, Wangpok Tse, Wei Liu, Junichi Yamagishi, Taku Komura
Title: InterAct: A Large-Scale Dataset of Dynamic, Expressive and Interactive Activities between Two People in Daily Scenarios
Abstract:
We address the problem of accurate capture of interactive behaviors between two people in daily scenarios. Most previous works either only consider one person or solely focus on conversational gestures of two people, assuming the body orientation and/or position of each actor are constant or barely change over each interaction. In contrast, we propose to simultaneously model two people's activities, and target objective-driven, dynamic, and semantically consistent interactions which often span longer duration and cover bigger space. To this end, we capture a new multi-modal dataset dubbed InterAct, which is composed of 241 motion sequences where two people perform a realistic and coherent scenario for one minute or longer over a complete interaction. For each sequence, two actors are assigned different roles and emotion labels, and collaborate to finish one task or conduct a common interaction activity. The audios, body motions, and facial expressions of both persons are captured. InterAct contains diverse and complex motions of individuals and interesting and relatively long-term interaction patterns barely seen before. We also demonstrate a simple yet effective diffusion-based method that estimates interactive face expressions and body motions of two people from speech inputs. Our method regresses the body motions in a hierarchical manner, and we also propose a novel fine-tuning mechanism to improve the lip accuracy of facial expressions. To facilitate further research, the data and code is made available at https://hku-cg.github.io/interact/ .

Authors:Gašper Podobnik, Tomaž Vrtovec
Title: MeshMetrics: A Precise Implementation of Distance-Based Image Segmentation Metrics
Abstract:
The surge of research in image segmentation has yielded remarkable performance gains but also exposed a reproducibility crisis. A major contributor is performance evaluation, where both selection and implementation of metrics play critical roles. While recent efforts have improved the former, the reliability of metric implementation has received far less attention. Pitfalls in distance-based metric implementation can lead to considerable discrepancies between common open-source tools, for instance, exceeding 100 mm for the Hausdorff distance and 30%pt for the normalized surface distance for the same pair of segmentations. To address these pitfalls, we introduce MeshMetrics, a mesh-based framework that provides a more precise computation of distance-based metrics than conventional grid-based approaches. Through theoretical analysis and empirical validation, we demonstrate that MeshMetrics achieves higher accuracy and precision than established tools, and is substantially less affected by discretization artifacts, such as distance quantization. We release MeshMetrics as an open-source Python package, available at https://github.com/gasperpodobnik/MeshMetrics.
中文: 摘要指出图像分割领域因度量实现不可靠而面临可重复性危机,并提出了MeshMetrics这一基于网格的框架,相比传统基于网格的方法能更精确地计算基于距离的度量。
English: The abstract highlights a reproducibility crisis in image segmentation due to unreliable metric implementations and introduces MeshMetrics, a mesh-based framework that ensures more accurate computation of distance-based metrics than traditional grid-based methods.

Authors:Xiaomeng Zhu, Changwei Wang, Haozhe Wang, Xinyu Liu, Fangzhen Lin
Title: OOTSM: A Decoupled Linguistic Framework for Effective Scene Graph Anticipation
Abstract:
A scene graph is a structured represention of objects and their relationships in a scene. Scene Graph Anticipation (SGA) involves predicting future scene graphs from video clips, enabling applications as intelligent surveillance and human-machine collaboration. Existing SGA approaches primarily leverage visual cues, often struggling to integrate valuable commonsense knowledge, thereby limiting long-term prediction robustness. To explicitly leverage such commonsense knowledge, we propose a new approach to better understand the objects, concepts, and relationships in a scene graph. Our approach decouples the SGA task in two steps: first a scene graph capturing model is used to convert a video clip into a sequence of scene graphs, then a pure text-based model is used to predict scene graphs in future frames. Our focus in this work is on the second step, and we call it Linguistic Scene Graph Anticipation (LSGA) and believes it should have independent interest beyond the use in SGA discussed here. For LSGA, we introduce an Object-Oriented Two-Staged Method (OOTSM) where an Large Language Model (LLM) first forecasts object appearances and disappearances before generating detailed human-object relations. We conduct extensive experiments to evaluate OOTSM in two settings. For LSGA, we evaluate our fine-tuned open-sourced LLMs against zero-shot APIs (i.e., GPT-4o, GPT-4o-mini, and DeepSeek-V3) on a benchmark constructed from Action Genome annotations. For SGA, we combine our OOTSM with STTran++ from, and our experiments demonstrate effective state-of-the-art performance: short-term mean-Recall (@10) increases by 3.4% while long-term mean-Recall (@50) improves dramatically by 21.9%. Code is available at https://github.com/ZhuXMMM/OOTSM.
中文: 本文提出了一种语言驱动的场景图预测方法,将任务解耦为视觉场景图提取和基于大语言模型的文本预测两个阶段,在长期预测方面实现了显著性能提升。
English: This paper introduces a linguistic approach to scene graph anticipation that decouples the task into visual scene graph extraction followed by text-based future prediction using large language models, achieving significant performance improvements particularly in long-term forecasting.

Authors:Jungin Park, Jiyoung Lee, Kwanghoon Sohn
Title: Language-guided Recursive Spatiotemporal Graph Modeling for Video Summarization
Abstract:
Video summarization aims to select keyframes that are visually diverse and can represent the whole story of a given video. Previous approaches have focused on global interlinkability between frames in a video by temporal modeling. However, fine-grained visual entities, such as objects, are also highly related to the main content of the video. Moreover, language-guided video summarization, which has recently been studied, requires a comprehensive linguistic understanding of complex real-world videos. To consider how all the objects are semantically related to each other, this paper regards video summarization as a language-guided spatiotemporal graph modeling problem. We present recursive spatiotemporal graph networks, called VideoGraph, which formulate the objects and frames as nodes of the spatial and temporal graphs, respectively. The nodes in each graph are connected and aggregated with graph edges, representing the semantic relationships between the nodes. To prevent the edges from being configured with visual similarity, we incorporate language queries derived from the video into the graph node representations, enabling them to contain semantic knowledge. In addition, we adopt a recursive strategy to refine initial graphs and correctly classify each frame node as a keyframe. In our experiments, VideoGraph achieves state-of-the-art performance on several benchmarks for generic and query-focused video summarization in both supervised and unsupervised manners. The code is available at https://github.com/park-jungin/videograph.
Chinese: 本文提出VideoGraph方法,将视频摘要视为语言引导的时空图建模问题,通过递归图网络整合对象与帧之间的语义关系,在多项基准测试中实现了最先进的性能。
English: This paper introduces VideoGraph, a novel approach that treats video summarization as a language-guided spatiotemporal graph modeling problem, achieving state-of-the-art performance by incorporating semantic relationships between objects and frames through recursive graph networks.

Authors:Changtao Miao, Yi Zhang, Man Luo, Weiwei Feng, Kaiyuan Zheng, Qi Chu, Tao Gong, Jianshu Li, Yunfeng Diao, Wei Zhou, Joey Tianyi Zhou, Xiaoshuai Hao
Title: MFFI: Multi-Dimensional Face Forgery Image Dataset for Real-World Scenarios
Abstract:
Rapid advances in Artificial Intelligence Generated Content (AIGC) have enabled increasingly sophisticated face forgeries, posing a significant threat to social security. However, current Deepfake detection methods are limited by constraints in existing datasets, which lack the diversity necessary in real-world scenarios. Specifically, these data sets fall short in four key areas: unknown of advanced forgery techniques, variability of facial scenes, richness of real data, and degradation of real-world propagation. To address these challenges, we propose the Multi-dimensional Face Forgery Image (\textbf{MFFI}) dataset, tailored for real-world scenarios. MFFI enhances realism based on four strategic dimensions: 1) Wider Forgery Methods; 2) Varied Facial Scenes; 3) Diversified Authentic Data; 4) Multi-level Degradation Operations. MFFI integrates $50$ different forgery methods and contains $1024K$ image samples. Benchmark evaluations show that MFFI outperforms existing public datasets in terms of scene complexity, cross-domain generalization capability, and detection difficulty gradients. These results validate the technical advance and practical utility of MFFI in simulating real-world conditions. The dataset and additional details are publicly available at {https://github.com/inclusionConf/MFFI}.
中文:MFFI数据集通过融合多种伪造方法、多样化场景、真实数据及现实退化处理,弥补了现有Deepfake检测数据集的不足,显著提升了真实场景下的检测性能。
English: The MFFI dataset addresses limitations in current Deepfake detection by incorporating diverse forgery methods, varied scenes, authentic data, and real-world degradation, enhancing detection capabilities for real-world scenarios.

Authors:Ashen Rodrigo, Isuru Munasinghe, Asanka Perera
Title: Vision-Based Object Detection for UAV Solar Panel Inspection Using an Enhanced Defects Dataset
Abstract:
Timely and accurate detection of defects and contaminants in solar panels is critical for maintaining the efficiency and reliability of photovoltaic systems. This study presents a comprehensive evaluation of five state-of-the-art object detection models: YOLOv3, Faster R-CNN, RetinaNet, EfficientDet, and Swin Transformer, for identifying physical and electrical defects as well as surface contaminants such as dust, dirt, and bird droppings on solar panels. A custom dataset, annotated in the COCO format and specifically designed for solar panel defect and contamination detection, was developed alongside a user interface to train and evaluate the models. The performance of each model is assessed and compared based on mean Average Precision (mAP), precision, recall, and inference speed. The results demonstrate the trade-offs between detection accuracy and computational efficiency, highlighting the relative strengths and limitations of each model. These findings provide valuable guidance for selecting appropriate detection approaches in practical solar panel monitoring and maintenance scenarios. The dataset will be publicly available at https://github.com/IsuruMunasinghe98/solar-panel-inspection-dataset.
Chinese: 本研究评估了五种先进目标检测模型在识别太阳能电池板缺陷与污染物方面的性能,通过对比精度与效率为实际监测应用提供指导。
English: This study evaluates five advanced object detection models for identifying defects and contaminants on solar panels, comparing their accuracy and efficiency to guide practical monitoring applications.

Authors:Gaspard Beaudouin, Minghan Li, Jaeyeon Kim, Sung-Hoon Yoon, Mengyu Wang
Title: Delta Velocity Rectified Flow for Text-to-Image Editing
Abstract:
We propose Delta Velocity Rectified Flow (DVRF), a novel inversion-free, path-aware editing framework within rectified flow models for text-to-image editing. DVRF is a distillation-based method that explicitly models the discrepancy between the source and target velocity fields in order to mitigate over-smoothing artifacts rampant in prior distillation sampling approaches. We further introduce a time-dependent shift term to push noisy latents closer to the target trajectory, enhancing the alignment with the target distribution. We theoretically demonstrate that when this shift is disabled, DVRF reduces to Delta Denoising Score, thereby bridging score-based diffusion optimization and velocity-based rectified-flow optimization. Moreover, when the shift term follows a linear schedule under rectified-flow dynamics, DVRF generalizes the Inversion-free method FlowEdit and provides a principled theoretical interpretation for it. Experimental results indicate that DVRF achieves superior editing quality, fidelity, and controllability while requiring no architectural modifications, making it efficient and broadly applicable to text-to-image editing tasks. Code is available at https://github.com/Harvard-AI-and-Robotics-Lab/DeltaVelocityRectifiedFlow.
中文: 我们提出了Delta Velocity Rectified Flow (DVRF),这是一种无反转的编辑框架,通过建模速度场差异并引入时间相关偏移来提升文本到图像编辑的质量和对齐度,无需修改模型架构。
English: We introduce Delta Velocity Rectified Flow (DVRF), an inversion-free framework that models velocity field discrepancies and incorporates a time-dependent shift to enhance text-to-image editing quality and alignment without architectural changes.

Authors:Matteo Poggi, Fabio Tosi
Title: FlowSeek: Optical Flow Made Easier with Depth Foundation Models and Motion Bases
Abstract:
We present FlowSeek, a novel framework for optical flow requiring minimal hardware resources for training. FlowSeek marries the latest advances on the design space of optical flow networks with cutting-edge single-image depth foundation models and classical low-dimensional motion parametrization, implementing a compact, yet accurate architecture. FlowSeek is trained on a single consumer-grade GPU, a hardware budget about 8x lower compared to most recent methods, and still achieves superior cross-dataset generalization on Sintel Final and KITTI, with a relative improvement of 10 and 15% over the previous state-of-the-art SEA-RAFT, as well as on Spring and LayeredFlow datasets.
Chinese: FlowSeek是一种高效的光流框架,仅需单个消费级GPU即可完成训练,在多个数据集上实现卓越的跨数据集泛化能力,性能比先前最优方法提升10-15%。
English: FlowSeek is a highly efficient optical flow framework that achieves superior cross-dataset generalization with minimal hardware requirements, training on a single consumer-grade GPU while outperforming previous state-of-the-art methods by 10-15%.

Authors:Zizun Li, Jianjun Zhou, Yifan Wang, Haoyu Guo, Wenzheng Chang, Yang Zhou, Haoyi Zhu, Junyi Chen, Chunhua Shen, Tong He
Title: WinT3R: Window-Based Streaming Reconstruction with Camera Token Pool
Abstract:
We present WinT3R, a feed-forward reconstruction model capable of online prediction of precise camera poses and high-quality point maps. Previous methods suffer from a trade-off between reconstruction quality and real-time performance. To address this, we first introduce a sliding window mechanism that ensures sufficient information exchange among frames within the window, thereby improving the quality of geometric predictions without large computation. In addition, we leverage a compact representation of cameras and maintain a global camera token pool, which enhances the reliability of camera pose estimation without sacrificing efficiency. These designs enable WinT3R to achieve state-of-the-art performance in terms of online reconstruction quality, camera pose estimation, and reconstruction speed, as validated by extensive experiments on diverse datasets. Code and model are publicly available at https://github.com/LiZizun/WinT3R.
中文: WinT3R是一种前馈重建模型,通过滑动窗口机制和紧凑相机表示,在在线重建质量、相机姿态估计和速度方面均达到领先水平。
English: WinT3R is a feed-forward reconstruction model that achieves state-of-the-art online reconstruction quality, camera pose estimation, and speed through a sliding window mechanism and compact camera representation.

Authors:Mohammad Saeid, Amir Salarpour, Pedram MohajerAnsari
Title: Enhancing 3D Point Cloud Classification with ModelNet-R and Point-SkipNet
Abstract:
The classification of 3D point clouds is crucial for applications such as autonomous driving, robotics, and augmented reality. However, the commonly used ModelNet40 dataset suffers from limitations such as inconsistent labeling, 2D data, size mismatches, and inadequate class differentiation, which hinder model performance. This paper introduces ModelNet-R, a meticulously refined version of ModelNet40 designed to address these issues and serve as a more reliable benchmark. Additionally, this paper proposes Point-SkipNet, a lightweight graph-based neural network that leverages efficient sampling, neighborhood grouping, and skip connections to achieve high classification accuracy with reduced computational overhead. Extensive experiments demonstrate that models trained in ModelNet-R exhibit significant performance improvements. Notably, Point-SkipNet achieves state-of-the-art accuracy on ModelNet-R with a substantially lower parameter count compared to contemporary models. This research highlights the crucial role of dataset quality in optimizing model efficiency for 3D point cloud classification. For more details, see the code at: https://github.com/m-saeid/ModeNetR_PointSkipNet.
中文: 本文提出了改进的3D点云数据集ModelNet-R以解决ModelNet40的缺陷,并设计了轻量级神经网络Point-SkipNet,该网络以更少参数实现最优分类精度,凸显了数据集质量对模型效能的关键作用。
English: This paper introduces ModelNet-R, an improved 3D point cloud dataset addressing ModelNet40's limitations, and proposes Point-SkipNet, a lightweight neural network that achieves top accuracy with fewer parameters, emphasizing dataset quality's role in model efficiency.

Authors:Julia Dietlmeier, Oluwabukola Grace Adegboro, Vayangi Ganepola, Claudia Mazo, Noel E. O'Connor
Title: VLSM-Ensemble: Ensembling CLIP-based Vision-Language Models for Enhanced Medical Image Segmentation
Abstract:
Vision-language models and their adaptations to image segmentation tasks present enormous potential for producing highly accurate and interpretable results. However, implementations based on CLIP and BiomedCLIP are still lagging behind more sophisticated architectures such as CRIS. In this work, instead of focusing on text prompt engineering as is the norm, we attempt to narrow this gap by showing how to ensemble vision-language segmentation models (VLSMs) with a low-complexity CNN. By doing so, we achieve a significant Dice score improvement of 6.3% on the BKAI polyp dataset using the ensembled BiomedCLIPSeg, while other datasets exhibit gains ranging from 1% to 6%. Furthermore, we provide initial results on additional four radiology and non-radiology datasets. We conclude that ensembling works differently across these datasets (from outperforming to underperforming the CRIS model), indicating a topic for future investigation by the community. The code is available at https://github.com/juliadietlmeier/VLSM-Ensemble.
中文: 本研究通过将视觉语言分割模型与低复杂度CNN集成,显著提升了医学图像分割性能,在部分数据集上Dice分数最高提升6.3%,但不同数据集表现存在差异,为未来研究提供了新方向。
English: This research demonstrates that ensembling vision-language segmentation models with a low-complexity CNN significantly improves performance, achieving up to a 6.3% Dice score increase on medical datasets, though results vary compared to sophisticated architectures like CRIS.

Authors:Yanzhi Tian, Zeming Liu, Zhengyang Liu, Chong Feng, Xin Li, Heyan Huang, Yuhang Guo
Title: PRIM: Towards Practical In-Image Multilingual Machine Translation
Abstract:
In-Image Machine Translation (IIMT) aims to translate images containing texts from one language to another. Current research of end-to-end IIMT mainly conducts on synthetic data, with simple background, single font, fixed text position, and bilingual translation, which can not fully reflect real world, causing a significant gap between the research and practical conditions. To facilitate research of IIMT in real-world scenarios, we explore Practical In-Image Multilingual Machine Translation (IIMMT). In order to convince the lack of publicly available data, we annotate the PRIM dataset, which contains real-world captured one-line text images with complex background, various fonts, diverse text positions, and supports multilingual translation directions. We propose an end-to-end model VisTrans to handle the challenge of practical conditions in PRIM, which processes visual text and background information in the image separately, ensuring the capability of multilingual translation while improving the visual quality. Experimental results indicate the VisTrans achieves a better translation quality and visual effect compared to other models. The code and dataset are available at: https://github.com/BITHLP/PRIM.
中文: 本研究提出了PRIM数据集和端到端的VisTrans模型,以解决实际场景中的图像内多语言机器翻译问题,相比其他模型在翻译质量和视觉效果上表现更优。
English: This study introduces the PRIM dataset and an end-to-end model called VisTrans to address practical in-image multilingual machine translation challenges, achieving superior translation quality and visual effects compared to existing models.

Authors:Svetlana Pavlitska, Haixi Fan, Konstantin Ditschuneit, J. Marius Zöllner
Title: Robust Experts: the Effect of Adversarial Training on CNNs with Sparse Mixture-of-Experts Layers
Abstract:
Robustifying convolutional neural networks (CNNs) against adversarial attacks remains challenging and often requires resource-intensive countermeasures. We explore the use of sparse mixture-of-experts (MoE) layers to improve robustness by replacing selected residual blocks or convolutional layers, thereby increasing model capacity without additional inference cost. On ResNet architectures trained on CIFAR-100, we find that inserting a single MoE layer in the deeper stages leads to consistent improvements in robustness under PGD and AutoPGD attacks when combined with adversarial training. Furthermore, we discover that when switch loss is used for balancing, it causes routing to collapse onto a small set of overused experts, thereby concentrating adversarial training on these paths and inadvertently making them more robust. As a result, some individual experts outperform the gated MoE model in robustness, suggesting that robust subpaths emerge through specialization. Our code is available at https://github.com/KASTEL-MobilityLab/robust-sparse-moes.
中文: 该研究表明,在卷积神经网络中引入稀疏专家混合层能通过增加模型容量而不提升推理成本来增强对抗攻击的鲁棒性,同时发现当路由集中于少数过度使用的专家时,会形成专门化的鲁棒子路径。
English: This study demonstrates that integrating sparse mixture-of-experts layers into CNNs enhances robustness against adversarial attacks by increasing model capacity without extra inference costs, while also revealing that specialized robust subpaths emerge when routing collapses onto overused experts.

Authors:Luca Müller, Hassan Ali, Philipp Allgeuer, Lukáš Gajdošech, Stefan Wermter
Title: Pointing-Guided Target Estimation via Transformer-Based Attention
Abstract:
Deictic gestures, like pointing, are a fundamental form of non-verbal communication, enabling humans to direct attention to specific objects or locations. This capability is essential in Human-Robot Interaction (HRI), where robots should be able to predict human intent and anticipate appropriate responses. In this work, we propose the Multi-Modality Inter-TransFormer (MM-ITF), a modular architecture to predict objects in a controlled tabletop scenario with the NICOL robot, where humans indicate targets through natural pointing gestures. Leveraging inter-modality attention, MM-ITF maps 2D pointing gestures to object locations, assigns a likelihood score to each, and identifies the most likely target. Our results demonstrate that the method can accurately predict the intended object using monocular RGB data, thus enabling intuitive and accessible human-robot collaboration. To evaluate the performance, we introduce a patch confusion matrix, providing insights into the model's predictions across candidate object locations. Code available at: https://github.com/lucamuellercode/MMITF.
Chinese Summary: 本研究提出MM-ITF模块化架构,通过跨模态注意力机制将二维指向手势映射至目标物体位置,利用单目RGB数据实现精准意图识别,并引入区块混淆矩阵评估模型性能,为人机协作提供直观交互方案。
English Summary: The study introduces MM-ITF, a modular architecture that accurately predicts target objects from human pointing gestures using monocular RGB data, enhancing intuitive human-robot collaboration through inter-modality attention and a novel evaluation metric.

Authors:Hulin Li, Qiliang Ren, Jun Li, Hanbing Wei, Zheng Liu, Linfang Fan
Title: A biologically inspired separable learning vision model for real-time traffic object perception in Dark
Abstract:
Fast and accurate object perception in low-light traffic scenes has attracted increasing attention. However, due to severe illumination degradation and the lack of reliable visual cues, existing perception models and methods struggle to quickly adapt to and accurately predict in low-light environments. Moreover, there is the absence of available large-scale benchmark specifically focused on low-light traffic scenes. To bridge this gap, we introduce a physically grounded illumination degradation method tailored to real-world low-light settings and construct Dark-traffic, the largest densely annotated dataset to date for low-light traffic scenes, supporting object detection, instance segmentation, and optical flow estimation. We further propose the Separable Learning Vision Model (SLVM), a biologically inspired framework designed to enhance perception under adverse lighting. SLVM integrates four key components: a light-adaptive pupillary mechanism for illumination-sensitive feature extraction, a feature-level separable learning strategy for efficient representation, task-specific decoupled branches for multi-task separable learning, and a spatial misalignment-aware fusion module for precise multi-feature alignment. Extensive experiments demonstrate that SLVM achieves state-of-the-art performance with reduced computational overhead. Notably, it outperforms RT-DETR by 11.2 percentage points in detection, YOLOv12 by 6.1 percentage points in instance segmentation, and reduces endpoint error (EPE) of baseline by 12.37% on Dark-traffic. On the LIS benchmark, the end-to-end trained SLVM surpasses Swin Transformer+EnlightenGAN and ConvNeXt-T+EnlightenGAN by an average of 11 percentage points across key metrics, and exceeds Mask RCNN (with light enhancement) by 3.1 percentage points. The Dark-traffic dataset and complete code is released at https://github.com/alanli1997/slvm.
中文: 本文提出了针对低光照交通场景的最大标注数据集Dark-traffic,并设计了仿生可分离学习视觉模型SLVM,该模型以更低计算成本在目标检测、实例分割和光流估计任务中实现了最优性能。
English: This paper introduces Dark-traffic, the largest annotated dataset for low-light traffic scenes, and proposes the Separable Learning Vision Model (SLVM), a biologically inspired framework that achieves state-of-the-art performance in object detection, instance segmentation, and optical flow estimation with reduced computational costs.

Authors:Hongyi Jing, Jiafu Chen, Chen Rao, Ziqiang Dang, Jiajie Teng, Tianyi Chu, Juncheng Mo, Shuo Fang, Huaizhong Lin, Rui Lv, Chenguang Ma, Lei Zhao
Title: SparkUI-Parser: Enhancing GUI Perception with Robust Grounding and Parsing
Abstract:
The existing Multimodal Large Language Models (MLLMs) for GUI perception have made great progress. However, the following challenges still exist in prior methods: 1) They model discrete coordinates based on text autoregressive mechanism, which results in lower grounding accuracy and slower inference speed. 2) They can only locate predefined sets of elements and are not capable of parsing the entire interface, which hampers the broad application and support for downstream tasks. To address the above issues, we propose SparkUI-Parser, a novel end-to-end framework where higher localization precision and fine-grained parsing capability of the entire interface are simultaneously achieved. Specifically, instead of using probability-based discrete modeling, we perform continuous modeling of coordinates based on a pre-trained Multimodal Large Language Model (MLLM) with an additional token router and coordinate decoder. This effectively mitigates the limitations inherent in the discrete output characteristics and the token-by-token generation process of MLLMs, consequently boosting both the accuracy and the inference speed. To further enhance robustness, a rejection mechanism based on a modified Hungarian matching algorithm is introduced, which empowers the model to identify and reject non-existent elements, thereby reducing false positives. Moreover, we present ScreenParse, a rigorously constructed benchmark to systematically assess structural perception capabilities of GUI models across diverse scenarios. Extensive experiments demonstrate that our approach consistently outperforms SOTA methods on ScreenSpot, ScreenSpot-v2, CAGUI-Grounding and ScreenParse benchmarks. The resources are available at https://github.com/antgroup/SparkUI-Parser.
中文摘要:现有GUI感知多模态大语言模型因离散坐标建模和有限元素检测存在精度与速度问题,SparkUI-Parser通过连续坐标建模和增强解析能力,在多个基准测试中实现更优性能。
English Summary: Existing multimodal large language models for GUI perception face challenges in accuracy and speed due to discrete coordinate modeling and limited element detection, which SparkUI-Parser addresses through continuous coordinate modeling and enhanced parsing capabilities to achieve superior performance across multiple benchmarks.

Authors:Ming Dai, Wenxuan Cheng, Jiedong Zhuang, Jiang-jiang Liu, Hongshen Zhao, Zhenhua Feng, Wankou Yang
Title: PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination
Abstract:
Recent advances in visual grounding have largely shifted away from traditional proposal-based two-stage frameworks due to their inefficiency and high computational complexity, favoring end-to-end direct reference paradigms. However, these methods rely exclusively on the referred target for supervision, overlooking the potential benefits of prominent prospective targets. Moreover, existing approaches often fail to incorporate multi-granularity discrimination, which is crucial for robust object identification in complex scenarios. To address these limitations, we propose PropVG, an end-to-end proposal-based framework that, to the best of our knowledge, is the first to seamlessly integrate foreground object proposal generation with referential object comprehension without requiring additional detectors. Furthermore, we introduce a Contrastive-based Refer Scoring (CRS) module, which employs contrastive learning at both sentence and word levels to enhance the capability in understanding and distinguishing referred objects. Additionally, we design a Multi-granularity Target Discrimination (MTD) module that fuses object- and semantic-level information to improve the recognition of absent targets. Extensive experiments on gRefCOCO (GREC/GRES), Ref-ZOM, R-RefCOCO, and RefCOCO (REC/RES) benchmarks demonstrate the effectiveness of PropVG. The codes and models are available at https://github.com/Dmmm1997/PropVG.
中文摘要:PropVG提出了一种端到端的基于提议的框架,通过对比学习和多粒度判别机制解决视觉定位中的现有缺陷,在多个基准测试中实现了卓越性能。
English Summary: PropVG introduces an end-to-end proposal-based framework with contrastive learning and multi-granularity discrimination to address limitations in visual grounding, achieving superior performance across multiple benchmarks.

Authors:Svetlana Pavlitska, Beyza Keskin, Alwin Faßbender, Christian Hubschneider, J. Marius Zöllner
Title: Extracting Uncertainty Estimates from Mixtures of Experts for Semantic Segmentation
Abstract:
Estimating accurate and well-calibrated predictive uncertainty is important for enhancing the reliability of computer vision models, especially in safety-critical applications like traffic scene perception. While ensemble methods are commonly used to quantify uncertainty by combining multiple models, a mixture of experts (MoE) offers an efficient alternative by leveraging a gating network to dynamically weight expert predictions based on the input. Building on the promising use of MoEs for semantic segmentation in our previous works, we show that well-calibrated predictive uncertainty estimates can be extracted from MoEs without architectural modifications. We investigate three methods to extract predictive uncertainty estimates: predictive entropy, mutual information, and expert variance. We evaluate these methods for an MoE with two experts trained on a semantical split of the A2D2 dataset. Our results show that MoEs yield more reliable uncertainty estimates than ensembles in terms of conditional correctness metrics under out-of-distribution (OOD) data. Additionally, we evaluate routing uncertainty computed via gate entropy and find that simple gating mechanisms lead to better calibration of routing uncertainty estimates than more complex classwise gates. Finally, our experiments on the Cityscapes dataset suggest that increasing the number of experts can further enhance uncertainty calibration. Our code is available at https://github.com/KASTEL-MobilityLab/mixtures-of-experts/.
中文: 专家混合模型(MoE)能比集成方法更高效地为语义分割提供校准良好的预测不确定性估计,其中预测熵和互信息等方法在分布外数据下展现出更高的可靠性。
English: Mixtures of Experts (MoE) provide well-calibrated predictive uncertainty estimates for semantic segmentation more efficiently than ensembles, with methods like predictive entropy and mutual information showing improved reliability under out-of-distribution data.

Authors:Kaname Yokoyama, Chihiro Nakatani, Norimichi Ukita
Title: Dynamic Group Detection using VLM-augmented Temporal Groupness Graph
Abstract:
This paper proposes dynamic human group detection in videos. For detecting complex groups, not only the local appearance features of in-group members but also the global context of the scene are important. Such local and global appearance features in each frame are extracted using a Vision-Language Model (VLM) augmented for group detection in our method. For further improvement, the group structure should be consistent over time. While previous methods are stabilized on the assumption that groups are not changed in a video, our method detects dynamically changing groups by global optimization using a graph with all frames' groupness probabilities estimated by our groupness-augmented CLIP features. Our experimental results demonstrate that our method outperforms state-of-the-art group detection methods on public datasets. Code: https://github.com/irajisamurai/VLM-GroupDetection.git
本文提出了一种动态视频人群检测方法,通过增强的视觉语言模型提取局部与全局特征,并利用全局优化实现时序一致性,在公开数据集上超越了现有最优方法。
This paper introduces a dynamic human group detection method in videos that combines local and global features using an enhanced Vision-Language Model and achieves temporal consistency through global optimization, outperforming existing techniques on public datasets.

Authors:Mustafa Munir, Alex Zhang, Radu Marculescu
Title: VCMamba: Bridging Convolutions with Multi-Directional Mamba for Efficient Visual Representation
Abstract:
Recent advances in Vision Transformers (ViTs) and State Space Models (SSMs) have challenged the dominance of Convolutional Neural Networks (CNNs) in computer vision. ViTs excel at capturing global context, and SSMs like Mamba offer linear complexity for long sequences, yet they do not capture fine-grained local features as effectively as CNNs. Conversely, CNNs possess strong inductive biases for local features but lack the global reasoning capabilities of transformers and Mamba. To bridge this gap, we introduce \textit{VCMamba}, a novel vision backbone that integrates the strengths of CNNs and multi-directional Mamba SSMs. VCMamba employs a convolutional stem and a hierarchical structure with convolutional blocks in its early stages to extract rich local features. These convolutional blocks are then processed by later stages incorporating multi-directional Mamba blocks designed to efficiently model long-range dependencies and global context. This hybrid design allows for superior feature representation while maintaining linear complexity with respect to image resolution. We demonstrate VCMamba's effectiveness through extensive experiments on ImageNet-1K classification and ADE20K semantic segmentation. Our VCMamba-B achieves 82.6% top-1 accuracy on ImageNet-1K, surpassing PlainMamba-L3 by 0.3% with 37% fewer parameters, and outperforming Vision GNN-B by 0.3% with 64% fewer parameters. Furthermore, VCMamba-B obtains 47.1 mIoU on ADE20K, exceeding EfficientFormer-L7 by 2.0 mIoU while utilizing 62% fewer parameters. Code is available at https://github.com/Wertyuui345/VCMamba.
中文: VCMamba是一种新型视觉骨干网络,融合了CNN的局部特征提取能力和多向Mamba SSM的全局上下文建模优势,在ImageNet分类和ADE20K分割任务中实现了线性复杂度的卓越性能。
English: VCMamba is a novel vision backbone that combines CNNs' local feature extraction with multi-directional Mamba SSMs' global context modeling, achieving superior performance with linear complexity on ImageNet classification and ADE20K segmentation.

Authors:Jingyi Lu, Kai Han
Title: Inpaint4Drag: Repurposing Inpainting Models for Drag-Based Image Editing via Bidirectional Warping
Abstract:
Drag-based image editing has emerged as a powerful paradigm for intuitive image manipulation. However, existing approaches predominantly rely on manipulating the latent space of generative models, leading to limited precision, delayed feedback, and model-specific constraints. Accordingly, we present Inpaint4Drag, a novel framework that decomposes drag-based editing into pixel-space bidirectional warping and image inpainting. Inspired by elastic object deformation in the physical world, we treat image regions as deformable materials that maintain natural shape under user manipulation. Our method achieves real-time warping previews (0.01s) and efficient inpainting (0.3s) at 512x512 resolution, significantly improving the interaction experience compared to existing methods that require minutes per edit. By transforming drag inputs directly into standard inpainting formats, our approach serves as a universal adapter for any inpainting model without architecture modification, automatically inheriting all future improvements in inpainting technology. Extensive experiments demonstrate that our method achieves superior visual quality and precise control while maintaining real-time performance. Project page: https://visual-ai.github.io/inpaint4drag/

Authors:Jun-Kun Chen, Aayush Bansal, Minh Phuoc Vo, Yu-Xiong Wang
Title: Virtual Fitting Room: Generating Arbitrarily Long Videos of Virtual Try-On from a Single Image -- Technical Preview
Abstract:
We introduce the Virtual Fitting Room (VFR), a novel video generative model that produces arbitrarily long virtual try-on videos. Our VFR models long video generation tasks as an auto-regressive, segment-by-segment generation process, eliminating the need for resource-intensive generation and lengthy video data, while providing the flexibility to generate videos of arbitrary length. The key challenges of this task are twofold: ensuring local smoothness between adjacent segments and maintaining global temporal consistency across different segments. To address these challenges, we propose our VFR framework, which ensures smoothness through a prefix video condition and enforces consistency with the anchor video -- a 360-degree video that comprehensively captures the human's wholebody appearance. Our VFR generates minute-scale virtual try-on videos with both local smoothness and global temporal consistency under various motions, making it a pioneering work in long virtual try-on video generation.

Authors:Zehong Yan, Peng Qi, Wynne Hsu, Mong Li Lee
Title: TRUST-VL: An Explainable News Assistant for General Multimodal Misinformation Detection
Abstract:
Multimodal misinformation, encompassing textual, visual, and cross-modal distortions, poses an increasing societal threat that is amplified by generative AI. Existing methods typically focus on a single type of distortion and struggle to generalize to unseen scenarios. In this work, we observe that different distortion types share common reasoning capabilities while also requiring task-specific skills. We hypothesize that joint training across distortion types facilitates knowledge sharing and enhances the model's ability to generalize. To this end, we introduce TRUST-VL, a unified and explainable vision-language model for general multimodal misinformation detection. TRUST-VL incorporates a novel Question-Aware Visual Amplifier module, designed to extract task-specific visual features. To support training, we also construct TRUST-Instruct, a large-scale instruction dataset containing 198K samples featuring structured reasoning chains aligned with human fact-checking workflows. Extensive experiments on both in-domain and zero-shot benchmarks demonstrate that TRUST-VL achieves state-of-the-art performance, while also offering strong generalization and interpretability.

Authors:Xin Lin, Xian Ge, Dizhe Zhang, Zhaoliang Wan, Xianshun Wang, Xiangtai Li, Wenjie Jiang, Bo Du, Dacheng Tao, Ming-Hsuan Yang, Lu Qi
Title: One Flight Over the Gap: A Survey from Perspective to Panoramic Vision
Abstract:
Driven by the demand for spatial intelligence and holistic scene perception, omnidirectional images (ODIs), which provide a complete 360\textdegree{} field of view, are receiving growing attention across diverse applications such as virtual reality, autonomous driving, and embodied robotics. Despite their unique characteristics, ODIs exhibit remarkable differences from perspective images in geometric projection, spatial distribution, and boundary continuity, making it challenging for direct domain adaption from perspective methods. This survey reviews recent panoramic vision techniques with a particular emphasis on the perspective-to-panorama adaptation. We first revisit the panoramic imaging pipeline and projection methods to build the prior knowledge required for analyzing the structural disparities. Then, we summarize three challenges of domain adaptation: severe geometric distortions near the poles, non-uniform sampling in Equirectangular Projection (ERP), and periodic boundary continuity. Building on this, we cover 20+ representative tasks drawn from more than 300 research papers in two dimensions. On one hand, we present a cross-method analysis of representative strategies for addressing panoramic specific challenges across different tasks. On the other hand, we conduct a cross-task comparison and classify panoramic vision into four major categories: visual quality enhancement and assessment, visual understanding, multimodal understanding, and visual generation. In addition, we discuss open challenges and future directions in data, models, and applications that will drive the advancement of panoramic vision research. We hope that our work can provide new insight and forward looking perspectives to advance the development of panoramic vision technologies. Our project page is https://insta360-research-team.github.io/Survey-of-Panorama
中文摘要:本综述探讨全景视觉技术,重点分析如何将平面图像方法适配到全方位图像,通过解决几何畸变和边界连续性等特有挑战,对任务进行分类并展望未来研究方向。
English Summary: This survey examines panoramic vision techniques, focusing on adapting perspective methods to omnidirectional images by addressing unique challenges like geometric distortions and boundary continuity, while categorizing tasks and discussing future research directions.

Authors:Sabbir Mollah, Rohit Gupta, Sirnam Swetha, Qingyang Liu, Ahnaf Munir, Mubarak Shah
Title: The Telephone Game: Evaluating Semantic Drift in Unified Models
Abstract:
Employing a single, unified model (UM) for both visual understanding (image-to-text: I2T) and visual generation (text-to-image: T2I) has opened a new direction in Visual Language Model (VLM) research. While UMs can also support broader unimodal tasks (e.g., text-to-text, image-to-image), we focus on the core cross-modal pair T2I and I2T. Existing evaluation benchmarks consider these capabilities in isolation: FID and GenEval for T2I, and benchmarks such as MME, MMBench for I2T. These isolated single-pass metrics do not reveal cross-consistency: whether a model that "understands" a concept can also "render" it, nor whether semantic meaning is preserved when cycling between image and text modalities. To address this, we introduce the Semantic Drift Protocol (SDP) for Unified Models, a cyclic evaluation protocol that alternates I2T and T2I over multiple generations to quantify semantic drift. We propose two metrics: (i) Mean Cumulative Drift (MCD), an embedding-based measure of overall semantic drift; and (ii) Multi-Generation GenEval (MGG), an object-level compliance score extending GenEval. To assess generalization beyond COCO dataset, which is widely used in training; we create a new benchmark Nocaps+Docci400, sampled from NoCaps and DOCCI and evaluated on seven recent models. SDP reveals substantial variation in cross-modal stability: some models like BAGEL maintain semantic meaning over many alternations, whereas others like VILA-U drift quickly despite strong single-pass scores. Our results highlight SDP as a necessary complement to standard I2T and T2I evaluations. Code is available at https://github.com/mollahsabbir/Semantic-Drift-in-Unified-Models
Chinese Summary: 本文提出语义漂移协议(SDP),通过图文跨模态循环转换评估统一视觉语言模型的语义一致性,采用新度量指标和专用基准测试,揭示了不同模型在保持语义稳定性方面的显著差异。
English Summary: This paper introduces the Semantic Drift Protocol (SDP) to evaluate cross-modal consistency in unified visual language models by cycling between image-to-text and text-to-image tasks, revealing significant variations in semantic stability across models through novel metrics and a specialized benchmark.

Authors:Hyunsoo Cha, Byungjun Kim, Hanbyul Joo
Title: Durian: Dual Reference Image-Guided Portrait Animation with Attribute Transfer
Abstract:
We present Durian, the first method for generating portrait animation videos with cross-identity attribute transfer from one or more reference images to a target portrait. Training such models typically requires attribute pairs of the same individual, which are rarely available at scale. To address this challenge, we propose a self-reconstruction formulation that leverages ordinary portrait videos to learn attribute transfer without explicit paired data. Two frames from the same video act as a pseudo pair: one serves as an attribute reference and the other as an identity reference. To enable this self-reconstruction training, we introduce a Dual ReferenceNet that processes the two references separately and then fuses their features via spatial attention within a diffusion model. To make sure each reference functions as a specialized stream for either identity or attribute information, we apply complementary masking to the reference images. Together, these two components guide the model to reconstruct the original video, naturally learning cross-identity attribute transfer. To bridge the gap between self-reconstruction training and cross-identity inference, we introduce a mask expansion strategy and augmentation schemes, enabling robust transfer of attributes with varying spatial extent and misalignment. Durian achieves state-of-the-art performance on portrait animation with attribute transfer. Moreover, its dual reference design uniquely supports multi-attribute composition and smooth attribute interpolation within a single generation pass, enabling highly flexible and controllable synthesis.

Authors:Zanwei Zhou, Taoran Yi, Jiemin Fang, Chen Yang, Lingxi Xie, Xinggang Wang, Wei Shen, Qi Tian
Title: Few-step Flow for 3D Generation via Marginal-Data Transport Distillation
Abstract:
Flow-based 3D generation models typically require dozens of sampling steps during inference. Though few-step distillation methods, particularly Consistency Models (CMs), have achieved substantial advancements in accelerating 2D diffusion models, they remain under-explored for more complex 3D generation tasks. In this study, we propose a novel framework, MDT-dist, for few-step 3D flow distillation. Our approach is built upon a primary objective: distilling the pretrained model to learn the Marginal-Data Transport. Directly learning this objective needs to integrate the velocity fields, while this integral is intractable to be implemented. Therefore, we propose two optimizable objectives, Velocity Matching (VM) and Velocity Distillation (VD), to equivalently convert the optimization target from the transport level to the velocity and the distribution level respectively. Velocity Matching (VM) learns to stably match the velocity fields between the student and the teacher, but inevitably provides biased gradient estimates. Velocity Distillation (VD) further enhances the optimization process by leveraging the learned velocity fields to perform probability density distillation. When evaluated on the pioneer 3D generation framework TRELLIS, our method reduces sampling steps of each flow transformer from 25 to 1 or 2, achieving 0.68s (1 step x 2) and 0.94s (2 steps x 2) latency with 9.0x and 6.5x speedup on A800, while preserving high visual and geometric fidelity. Extensive experiments demonstrate that our method significantly outperforms existing CM distillation methods, and enables TRELLIS to achieve superior performance in few-step 3D generation.
中文: 本研究提出MDT-dist框架,通过速度匹配和速度蒸馏实现少步3D流蒸馏,将采样步骤从25步减少至1-2步,在A800上实现最高9倍加速且保持高质量生成效果,显著优于现有一致性模型蒸馏方法。
English: This study introduces MDT-dist, a novel framework for few-step 3D flow distillation that employs Velocity Matching and Velocity Distillation to efficiently reduce sampling steps from 25 to 1 or 2 while maintaining high fidelity, achieving up to 9x speedup on A800 and outperforming existing methods.

Authors:Zidong Wang, Yiyuan Zhang, Xiaoyu Yue, Xiangyu Yue, Yangguang Li, Wanli Ouyang, Lei Bai
Title: Transition Models: Rethinking the Generative Learning Objective
Abstract:
A fundamental dilemma in generative modeling persists: iterative diffusion models achieve outstanding fidelity, but at a significant computational cost, while efficient few-step alternatives are constrained by a hard quality ceiling. This conflict between generation steps and output quality arises from restrictive training objectives that focus exclusively on either infinitesimal dynamics (PF-ODEs) or direct endpoint prediction. We address this challenge by introducing an exact, continuous-time dynamics equation that analytically defines state transitions across any finite time interval. This leads to a novel generative paradigm, Transition Models (TiM), which adapt to arbitrary-step transitions, seamlessly traversing the generative trajectory from single leaps to fine-grained refinement with more steps. Despite having only 865M parameters, TiM achieves state-of-the-art performance, surpassing leading models such as SD3.5 (8B parameters) and FLUX.1 (12B parameters) across all evaluated step counts. Importantly, unlike previous few-step generators, TiM demonstrates monotonic quality improvement as the sampling budget increases. Additionally, when employing our native-resolution strategy, TiM delivers exceptional fidelity at resolutions up to 4096x4096.
中文摘要:过渡模型(TiM)通过引入连续时间动态方程,解决了生成模型中计算效率与输出质量之间的固有矛盾,实现了任意步长的灵活转换,以更少参数达到顶尖性能,并能在增加采样步数时保持质量的单调提升。
English Summary: Transition Models (TiM) overcome the trade-off between computational efficiency and output quality in generative modeling by introducing a continuous-time dynamics equation that enables flexible step transitions, achieving state-of-the-art performance with fewer parameters while maintaining monotonic quality improvement with increased sampling steps.

Authors:Jimin Xu, Bosheng Qin, Tao Jin, Zhou Zhao, Zhenhui Ye, Jun Yu, Fei Wu
Title: SSGaussian: Semantic-Aware and Structure-Preserving 3D Style Transfer
Abstract:
Recent advancements in neural representations, such as Neural Radiance Fields and 3D Gaussian Splatting, have increased interest in applying style transfer to 3D scenes. While existing methods can transfer style patterns onto 3D-consistent neural representations, they struggle to effectively extract and transfer high-level style semantics from the reference style image. Additionally, the stylized results often lack structural clarity and separation, making it difficult to distinguish between different instances or objects within the 3D scene. To address these limitations, we propose a novel 3D style transfer pipeline that effectively integrates prior knowledge from pretrained 2D diffusion models. Our pipeline consists of two key stages: First, we leverage diffusion priors to generate stylized renderings of key viewpoints. Then, we transfer the stylized key views onto the 3D representation. This process incorporates two innovative designs. The first is cross-view style alignment, which inserts cross-view attention into the last upsampling block of the UNet, allowing feature interactions across multiple key views. This ensures that the diffusion model generates stylized key views that maintain both style fidelity and instance-level consistency. The second is instance-level style transfer, which effectively leverages instance-level consistency across stylized key views and transfers it onto the 3D representation. This results in a more structured, visually coherent, and artistically enriched stylization. Extensive qualitative and quantitative experiments demonstrate that our 3D style transfer pipeline significantly outperforms state-of-the-art methods across a wide range of scenes, from forward-facing to challenging 360-degree environments. Visit our project page https://jm-xu.github.io/SSGaussian for immersive visualization.
Chinese: 针对现有三维风格迁移方法难以提取高级语义和保持结构清晰的问题,本研究提出创新流程,通过扩散先验的跨视角对齐和实例级迁移技术,实现了在多种场景中显著优化的风格化效果。
English: Recent advances in 3D style transfer struggle with extracting high-level semantics and maintaining structural clarity, prompting the development of a novel pipeline that leverages diffusion priors through cross-view alignment and instance-level transfer to achieve superior stylization.

Authors:JiYuan Wang, Chunyu Lin, Lei Sun, Rongying Liu, Lang Nie, Mingxing Li, Kang Liao, Xiangxiang Chu, Yao Zhao
Title: From Editor to Dense Geometry Estimator
Abstract:
Leveraging visual priors from pre-trained text-to-image (T2I) generative models has shown success in dense prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than T2I generative models, may be a more suitable foundation for fine-tuning. Motivated by this, we conduct a systematic analysis of the fine-tuning behaviors of both editors and generators for dense geometry estimation. Our findings show that editing models possess inherent structural priors, which enable them to converge more stably by ``refining" their innate features, and ultimately achieve higher performance than their generative counterparts. Based on these findings, we introduce \textbf{FE2E}, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. Specifically, to tailor the editor for this deterministic task, we reformulate the editor's original flow matching loss into the ``consistent velocity" training objective. And we use logarithmic quantization to resolve the precision conflict between the editor's native BFloat16 format and the high precision demand of our tasks. Additionally, we leverage the DiT's global attention for a cost-free joint estimation of depth and normals in a single forward pass, enabling their supervisory signals to mutually enhance each other. Without scaling up the training data, FE2E achieves impressive performance improvements in zero-shot monocular depth and normal estimation across multiple datasets. Notably, it achieves over 35\% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100$\times$ data. The project page can be accessed \href{https://amap-ml.github.io/FE2E/}{here}.
中文: 研究表明,相比文本到图像生成模型,微调图像编辑模型因其固有的结构先验能更有效地进行密集几何估计,由此开发的FE2E框架无需增加训练数据即可实现显著性能提升。
English: The study demonstrates that fine-tuning image editing models, rather than text-to-image generators, yields superior dense geometry estimation due to their inherent structural priors, leading to the development of the FE2E framework that achieves significant performance gains without additional training data.

Authors:Silvio Chito, Paolo Rabino, Tatiana Tommasi
Title: Efficient Odd-One-Out Anomaly Detection
Abstract:
The recently introduced odd-one-out anomaly detection task involves identifying the odd-looking instances within a multi-object scene. This problem presents several challenges for modern deep learning models, demanding spatial reasoning across multiple views and relational reasoning to understand context and generalize across varying object categories and layouts. We argue that these challenges must be addressed with efficiency in mind. To this end, we propose a DINO-based model that reduces the number of parameters by one third and shortens training time by a factor of three compared to the current state-of-the-art, while maintaining competitive performance. Our experimental evaluation also introduces a Multimodal Large Language Model baseline, providing insights into its current limitations in structured visual reasoning tasks. The project page can be found at https://silviochito.github.io/EfficientOddOneOut/

Authors:Safouane El Ghazouali, Umberto Michelucci
Title: VisioFirm: Cross-Platform AI-assisted Annotation Tool for Computer Vision
Abstract:
AI models rely on annotated data to learn pattern and perform prediction. Annotation is usually a labor-intensive step that require associating labels ranging from a simple classification label to more complex tasks such as object detection, oriented bounding box estimation, and instance segmentation. Traditional tools often require extensive manual input, limiting scalability for large datasets. To address this, we introduce VisioFirm, an open-source web application designed to streamline image labeling through AI-assisted automation. VisioFirm integrates state-of-the-art foundation models into an interface with a filtering pipeline to reduce human-in-the-loop efforts. This hybrid approach employs CLIP combined with pre-trained detectors like Ultralytics models for common classes and zero-shot models such as Grounding DINO for custom labels, generating initial annotations with low-confidence thresholding to maximize recall. Through this framework, when tested on COCO-type of classes, initial prediction have been proven to be mostly correct though the users can refine these via interactive tools supporting bounding boxes, oriented bounding boxes, and polygons. Additionally, VisioFirm has on-the-fly segmentation powered by Segment Anything accelerated through WebGPU for browser-side efficiency. The tool supports multiple export formats (YOLO, COCO, Pascal VOC, CSV) and operates offline after model caching, enhancing accessibility. VisioFirm demonstrates up to 90\% reduction in manual effort through benchmarks on diverse datasets, while maintaining high annotation accuracy via clustering of connected CLIP-based disambiguate components and IoU-graph for redundant detection suppression. VisioFirm can be accessed from \href{https://github.com/OschAI/VisioFirm}{https://github.com/OschAI/VisioFirm}.
Chinese: VisioFirm 是一款开源网络应用程序,通过集成 CLIP 和 Grounding DINO 等基础模型实现AI辅助自动化图像标注,大幅减少人工操作,同时支持多种标注任务和导出格式。
English: VisioFirm is an open-source web application that leverages AI-assisted automation with foundation models like CLIP and Grounding DINO to streamline image labeling, significantly reducing manual effort while supporting various annotation tasks and export formats.

Authors:Orlando Castaneda, Kevin So-Tang, Kshitij Gurung
Title: Revisiting Simple Baselines for In-The-Wild Deepfake Detection
Abstract:
The widespread adoption of synthetic media demands accessible deepfake detectors and realistic benchmarks. While most existing research evaluates deepfake detectors on highly controlled datasets, we focus on the recently released "in-the-wild" benchmark, Deepfake-Eval-2024. Initial reporting on Deepfake-Eval-2024 showed that three finetuned open-source models achieve accuracies between 61% and 69%, significantly lagging behind the leading commercial deepfake detector with 82% accuracy. Our work revisits one of these baseline approaches, originally introduced by Ojha et al., which adapts standard pretrained vision backbones to produce generalizable deepfake detectors. We demonstrate that with better-tuned hyperparameters, this simple approach actually yields much higher performance -- 81% accuracy on Deepfake-Eval-2024 -- surpassing the previously reported accuracy of this baseline approach by 18% and competing with commercial deepfake detectors. We discuss tradeoffs in accuracy, computational costs, and interpretability, focusing on how practical these deepfake detectors might be when deployed in real-world settings. Our code can be found at https://github.com/Deepfake-Detection-KKO/deepfake-detection.
Chinese: 本研究证明,通过对基线深度伪造检测模型进行优化的超参数调整,在Deepfake-Eval-2024基准测试中达到了81%的准确率,可与商业检测器相媲美,同时揭示了实际部署中的权衡问题。
English: This study demonstrates that improved hyperparameter tuning of a baseline deepfake detection model achieves 81% accuracy on the Deepfake-Eval-2024 benchmark, rivaling commercial detectors while highlighting practical deployment tradeoffs.

Authors:Quang-Huy Che, Duc-Khai Lam
Title: TriLiteNet: Lightweight Model for Multi-Task Visual Perception
Abstract:
Efficient perception models are essential for Advanced Driver Assistance Systems (ADAS), as these applications require rapid processing and response to ensure safety and effectiveness in real-world environments. To address the real-time execution needs of such perception models, this study introduces the TriLiteNet model. This model can simultaneously manage multiple tasks related to panoramic driving perception. TriLiteNet is designed to optimize performance while maintaining low computational costs. Experimental results on the BDD100k dataset demonstrate that the model achieves competitive performance across three key tasks: vehicle detection, drivable area segmentation, and lane line segmentation. Specifically, the TriLiteNet_{base} demonstrated a recall of 85.6% for vehicle detection, a mean Intersection over Union (mIoU) of 92.4% for drivable area segmentation, and an Acc of 82.3% for lane line segmentation with only 2.35M parameters and a computational cost of 7.72 GFLOPs. Our proposed model includes a tiny configuration with just 0.14M parameters, which provides a multi-task solution with minimal computational demand. Evaluated for latency and power consumption on embedded devices, TriLiteNet in both configurations shows low latency and reasonable power during inference. By balancing performance, computational efficiency, and scalability, TriLiteNet offers a practical and deployable solution for real-world autonomous driving applications. Code is available at https://github.com/chequanghuy/TriLiteNet.
中文摘要:本研究提出TriLiteNet模型,这是一种面向自动驾驶的高效多任务感知模型,在保持低计算成本和嵌入式设备低延迟的同时,能在车辆检测、可行驶区域分割和车道线分割三项关键任务中实现优越性能。
English Summary: This study introduces TriLiteNet, an efficient multi-task perception model for autonomous driving that achieves competitive performance in vehicle detection, drivable area segmentation, and lane line segmentation while maintaining low computational costs and latency on embedded devices.

Authors:Ashish Tiwari, Satyam Bhardwaj, Yash Bachwana, Parag Sarvoday Sahu, T. M. Feroz Ali, Bhargava Chintalapati, Shanmuganathan Raman
Title: TensoIS: A Step Towards Feed-Forward Tensorial Inverse Subsurface Scattering for Perlin Distributed Heterogeneous Media
Abstract:
Estimating scattering parameters of heterogeneous media from images is a severely under-constrained and challenging problem. Most of the existing approaches model BSSRDF either through an analysis-by-synthesis approach, approximating complex path integrals, or using differentiable volume rendering techniques to account for heterogeneity. However, only a few studies have applied learning-based methods to estimate subsurface scattering parameters, but they assume homogeneous media. Interestingly, no specific distribution is known to us that can explicitly model the heterogeneous scattering parameters in the real world. Notably, procedural noise models such as Perlin and Fractal Perlin noise have been effective in representing intricate heterogeneities of natural, organic, and inorganic surfaces. Leveraging this, we first create HeteroSynth, a synthetic dataset comprising photorealistic images of heterogeneous media whose scattering parameters are modeled using Fractal Perlin noise. Furthermore, we propose Tensorial Inverse Scattering (TensoIS), a learning-based feed-forward framework to estimate these Perlin-distributed heterogeneous scattering parameters from sparse multi-view image observations. Instead of directly predicting the 3D scattering parameter volume, TensoIS uses learnable low-rank tensor components to represent the scattering volume. We evaluate TensoIS on unseen heterogeneous variations over shapes from the HeteroSynth test set, smoke and cloud geometries obtained from open-source realistic volumetric simulations, and some real-world samples to establish its effectiveness for inverse scattering. Overall, this study is an attempt to explore Perlin noise distribution, given the lack of any such well-defined distribution in literature, to potentially model real-world heterogeneous scattering in a feed-forward manner.

Authors:Shiku Kaito, Shinnosuke Matsuo, Daiki Suehiro, Ryoma Bise
Title: Learning from Majority Label: A Novel Problem in Multi-class Multiple-Instance Learning
Abstract:
The paper proposes a novel multi-class Multiple-Instance Learning (MIL) problem called Learning from Majority Label (LML). In LML, the majority class of instances in a bag is assigned as the bag-level label. The goal of LML is to train a classification model that estimates the class of each instance using the majority label. This problem is valuable in a variety of applications, including pathology image segmentation, political voting prediction, customer sentiment analysis, and environmental monitoring. To solve LML, we propose a Counting Network trained to produce bag-level majority labels, estimated by counting the number of instances in each class. Furthermore, analysis experiments on the characteristics of LML revealed that bags with a high proportion of the majority class facilitate learning. Based on this result, we developed a Majority Proportion Enhancement Module (MPEM) that increases the proportion of the majority class by removing minority class instances within the bags. Experiments demonstrate the superiority of the proposed method on four datasets compared to conventional MIL methods. Moreover, ablation studies confirmed the effectiveness of each module. The code is available at \href{https://github.com/Shiku-Kaito/Learning-from-Majority-Label-A-Novel-Problem-in-Multi-class-Multiple-Instance-Learning}{here}.
Chinese: 本文提出了一种名为“从多数标签学习”的新型多类多示例学习问题,其中包标签由实例的多数类决定,并通过带有多数比例增强模块的计数网络在四个数据集上超越了传统方法。
English: This paper introduces a novel multi-class multiple-instance learning problem called Learning from Majority Label (LML), where bag labels are determined by the majority class of instances, and proposes a Counting Network with a Majority Proportion Enhancement Module that outperforms conventional methods across four datasets.

Authors:Yijun Zhou, Yikui Zhai, Zilu Ying, Tingfeng Xian, Wenlve Zhou, Zhiheng Zhou, Xiaolin Tian, Xudong Jia, Hongsheng Zhang, C. L. Philip Chen
Title: Multimodal Feature Fusion Network with Text Difference Enhancement for Remote Sensing Change Detection
Abstract:
Although deep learning has advanced remote sensing change detection (RSCD), most methods rely solely on image modality, limiting feature representation, change pattern modeling, and generalization especially under illumination and noise disturbances. To address this, we propose MMChange, a multimodal RSCD method that combines image and text modalities to enhance accuracy and robustness. An Image Feature Refinement (IFR) module is introduced to highlight key regions and suppress environmental noise. To overcome the semantic limitations of image features, we employ a vision language model (VLM) to generate semantic descriptions of bitemporal images. A Textual Difference Enhancement (TDE) module then captures fine grained semantic shifts, guiding the model toward meaningful changes. To bridge the heterogeneity between modalities, we design an Image Text Feature Fusion (ITFF) module that enables deep cross modal integration. Extensive experiments on LEVIRCD, WHUCD, and SYSUCD demonstrate that MMChange consistently surpasses state of the art methods across multiple metrics, validating its effectiveness for multimodal RSCD. Code is available at: https://github.com/yikuizhai/MMChange.
中文: 提出的MMChange方法通过融合图像与文本模态,借助专门模块提升遥感变化检测的精度和鲁棒性,在多项实验中均优于现有最优方法。
English: The proposed MMChange method enhances remote sensing change detection by integrating image and text modalities through specialized modules to improve accuracy and robustness, outperforming state-of-the-art approaches in experiments.

Authors:Minghui Zhang, Yaoyu Liu, Junyang Wu, Xin You, Hanxiao Zhang, Junjun He, Yun Gu
Title: TopoSculpt: Betti-Steered Topological Sculpting of 3D Fine-grained Tubular Shapes
Abstract:
Medical tubular anatomical structures are inherently three-dimensional conduits with lumens, enclosing walls, and complex branching topologies. Accurate reconstruction of their geometry and topology is crucial for applications such as bronchoscopic navigation and cerebral arterial connectivity assessment. Existing methods often rely on voxel-wise overlap measures, which fail to capture topological correctness and completeness. Although topology-aware losses and persistent homology constraints have shown promise, they are usually applied patch-wise and cannot guarantee global preservation or correct geometric errors at inference. To address these limitations, we propose a novel TopoSculpt, a framework for topological refinement of 3D fine-grained tubular structures. TopoSculpt (i) adopts a holistic whole-region modeling strategy to capture full spatial context, (ii) first introduces a Topological Integrity Betti (TIB) constraint that jointly enforces Betti number priors and global integrity, and (iii) employs a curriculum refinement scheme with persistent homology to progressively correct errors from coarse to fine scales. Extensive experiments on challenging pulmonary airway and Circle of Willis datasets demonstrate substantial improvements in both geometry and topology. For instance, $β_{0}$ errors are reduced from 69.00 to 3.40 on the airway dataset and from 1.65 to 0.30 on the CoW dataset, with Tree length detected and branch detected rates improving by nearly 10\%. These results highlight the effectiveness of TopoSculpt in correcting critical topological errors and advancing the high-fidelity modeling of complex 3D tubular anatomy. The project homepage is available at: https://github.com/Puzzled-Hui/TopoSculpt.
中文: TopoSculpt框架通过整体区域建模、拓扑完整性约束和渐进式优化策略,显著提升了三维管状解剖结构的几何精度与拓扑正确性,在气道和脑动脉数据上取得突破性改进。
English: The proposed TopoSculpt framework introduces a holistic topological refinement approach for 3D tubular structures, employing Betti number constraints and persistent homology to significantly improve geometric accuracy and topological integrity across medical datasets.

Authors:Xiaofu Chen, Israfel Salazar, Yova Kementchedjhieva
Title: SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation
Abstract:
As interest grows in generating long, detailed image captions, standard evaluation metrics become increasingly unreliable. N-gram-based metrics though efficient, fail to capture semantic correctness. Representational Similarity (RS) metrics, designed to address this, initially saw limited use due to high computational costs, while today, despite advances in hardware, they remain unpopular due to low correlation to human judgments. Meanwhile, metrics based on large language models (LLMs) show strong correlation with human judgments, but remain too expensive for iterative use during model development. We introduce SPECS (Specificity-Enhanced CLIPScore), a reference-free RS metric tailored to long image captioning. SPECS modifies CLIP with a new objective that emphasizes specificity: rewarding correct details and penalizing incorrect ones. We show that SPECS matches the performance of open-source LLM-based metrics in correlation to human judgments, while being far more efficient. This makes it a practical alternative for iterative checkpoint evaluation during image captioning model development.Our code can be found at https://github.com/mbzuai-nlp/SPECS.
中文:SPECS是一种针对长图像描述的新型高效评估指标,在保持与人类判断高度相关的同时,显著提升了计算效率,可作为模型开发中的实用工具。
English: SPECS is a new efficient metric for evaluating long image captions that matches the performance of LLM-based metrics in human judgment correlation while being significantly faster.

Authors:Gowen Loo, Chang Liu, Qinghong Yin, Xiang Chen, Jiawei Chen, Jingyuan Zhang, Yu Tian
Title: MobileRAG: Enhancing Mobile Agent with Retrieval-Augmented Generation
Abstract:
Smartphones have become indispensable in people's daily lives, permeating nearly every aspect of modern society. With the continuous advancement of large language models (LLMs), numerous LLM-based mobile agents have emerged. These agents are capable of accurately parsing diverse user queries and automatically assisting users in completing complex or repetitive operations. However, current agents 1) heavily rely on the comprehension ability of LLMs, which can lead to errors caused by misoperations or omitted steps during tasks, 2) lack interaction with the external environment, often terminating tasks when an app cannot fulfill user queries, and 3) lack memory capabilities, requiring each instruction to reconstruct the interface and being unable to learn from and correct previous mistakes. To alleviate the above issues, we propose MobileRAG, a mobile agents framework enhanced by Retrieval-Augmented Generation (RAG), which includes InterRAG, LocalRAG, and MemRAG. It leverages RAG to more quickly and accurately identify user queries and accomplish complex and long-sequence mobile tasks. Additionally, to more comprehensively assess the performance of MobileRAG, we introduce MobileRAG-Eval, a more challenging benchmark characterized by numerous complex, real-world mobile tasks that require external knowledge assistance. Extensive experimental results on MobileRAG-Eval demonstrate that MobileRAG can easily handle real-world mobile tasks, achieving 10.3\% improvement over state-of-the-art methods with fewer operational steps. Our code is publicly available at: https://github.com/liuxiaojieOutOfWorld/MobileRAG_arxiv
中文:MobileRAG是一种基于检索增强生成技术的新型移动代理框架,通过提升查询准确性、实现环境交互和集成记忆功能,能够以更少操作步骤高效处理复杂移动任务,有效解决了现有系统依赖性强、交互不足和缺乏记忆的问题。
English: MobileRAG is a novel mobile agent framework enhanced by Retrieval-Augmented Generation that addresses current limitations by improving query accuracy, enabling environmental interaction, and incorporating memory capabilities to handle complex mobile tasks more efficiently with fewer errors.

Authors:Haiwei Xue, Xiangyang Luo, Zhanghao Hu, Xin Zhang, Xunzhi Xiang, Yuqin Dai, Jianzhuang Liu, Zhensong Zhang, Minglei Li, Jian Yang, Fei Ma, Zhiyong Wu, Changpeng Yang, Zonghong Dai, Fei Richard Yu
Title: Human Motion Video Generation: A Survey
Abstract:
Human motion video generation has garnered significant research interest due to its broad applications, enabling innovations such as photorealistic singing heads or dynamic avatars that seamlessly dance to music. However, existing surveys in this field focus on individual methods, lacking a comprehensive overview of the entire generative process. This paper addresses this gap by providing an in-depth survey of human motion video generation, encompassing over ten sub-tasks, and detailing the five key phases of the generation process: input, motion planning, motion video generation, refinement, and output. Notably, this is the first survey that discusses the potential of large language models in enhancing human motion video generation. Our survey reviews the latest developments and technological trends in human motion video generation across three primary modalities: vision, text, and audio. By covering over two hundred papers, we offer a thorough overview of the field and highlight milestone works that have driven significant technological breakthroughs. Our goal for this survey is to unveil the prospects of human motion video generation and serve as a valuable resource for advancing the comprehensive applications of digital humans. A complete list of the models examined in this survey is available in Our Repository https://github.com/Winn1y/Awesome-Human-Motion-Video-Generation.
中文: 本综述首次全面概述人体运动视频生成领域,涵盖十余项子任务并详述生成过程的五个阶段,特别探讨了大语言模型的潜力,系统回顾了视觉、文本和音频三大模态的技术进展。
English: This survey provides the first comprehensive overview of human motion video generation, covering over ten sub-tasks and detailing the five-phase generation process while highlighting the potential of large language models and reviewing developments across vision, text, and audio modalities.

Authors:Jiajun Song, Xiaoou Liu
Title: SalientFusion: Context-Aware Compositional Zero-Shot Food Recognition
Abstract:
Food recognition has gained significant attention, but the rapid emergence of new dishes requires methods for recognizing unseen food categories, motivating Zero-Shot Food Learning (ZSFL). We propose the task of Compositional Zero-Shot Food Recognition (CZSFR), where cuisines and ingredients naturally align with attributes and objects in Compositional Zero-Shot learning (CZSL). However, CZSFR faces three challenges: (1) Redundant background information distracts models from learning meaningful food features, (2) Role confusion between staple and side dishes leads to misclassification, and (3) Semantic bias in a single attribute can lead to confusion of understanding. Therefore, we propose SalientFusion, a context-aware CZSFR method with two components: SalientFormer, which removes background redundancy and uses depth features to resolve role confusion; DebiasAT, which reduces the semantic bias by aligning prompts with visual features. Using our proposed benchmarks, CZSFood-90 and CZSFood-164, we show that SalientFusion achieves state-of-the-art results on these benchmarks and the most popular general datasets for the general CZSL. The code is avaliable at https://github.com/Jiajun-RUC/SalientFusion.
中文: 本研究提出组合零样本食物识别任务以解决背景干扰和语义偏差等挑战,通过SalientFusion方法在新基准和通用数据集上取得了最优性能。
English: The study introduces Compositional Zero-Shot Food Recognition (CZSFR) to address challenges like background distraction and semantic bias, proposing the SalientFusion method which achieves state-of-the-art results on new benchmarks and general datasets.

Authors:Nan Yang, Yang Wang, Zhanwen Liu, Yuchao Dai, Yang Liu, Xiangmo Zhao
Title: Focus Through Motion: RGB-Event Collaborative Token Sparsification for Efficient Object Detection
Abstract:
Existing RGB-Event detection methods process the low-information regions of both modalities (background in images and non-event regions in event data) uniformly during feature extraction and fusion, resulting in high computational costs and suboptimal performance. To mitigate the computational redundancy during feature extraction, researchers have respectively proposed token sparsification methods for the image and event modalities. However, these methods employ a fixed number or threshold for token selection, hindering the retention of informative tokens for samples with varying complexity. To achieve a better balance between accuracy and efficiency, we propose FocusMamba, which performs adaptive collaborative sparsification of multimodal features and efficiently integrates complementary information. Specifically, an Event-Guided Multimodal Sparsification (EGMS) strategy is designed to identify and adaptively discard low-information regions within each modality by leveraging scene content changes perceived by the event camera. Based on the sparsification results, a Cross-Modality Focus Fusion (CMFF) module is proposed to effectively capture and integrate complementary features from both modalities. Experiments on the DSEC-Det and PKU-DAVIS-SOD datasets demonstrate that the proposed method achieves superior performance in both accuracy and efficiency compared to existing methods. The code will be available at https://github.com/Zizzzzzzz/FocusMamba.
中文: FocusMamba提出了一种自适应协同稀疏化方法,通过事件引导策略智能剔除RGB-事件数据中的低信息区域并融合互补特征,在基准数据集上实现了精度与效率的双重提升。
English: FocusMamba introduces an adaptive collaborative sparsification method that efficiently discards low-information regions in RGB-Event data and integrates complementary features, achieving superior accuracy and efficiency on benchmark datasets.

Authors:Zongsen Qiu
Title: STA-Net: A Decoupled Shape and Texture Attention Network for Lightweight Plant Disease Classification
Abstract:
Responding to rising global food security needs, precision agriculture and deep learning-based plant disease diagnosis have become crucial. Yet, deploying high-precision models on edge devices is challenging. Most lightweight networks use attention mechanisms designed for generic object recognition, which poorly capture subtle pathological features like irregular lesion shapes and complex textures. To overcome this, we propose a twofold solution: first, using a training-free neural architecture search method (DeepMAD) to create an efficient network backbone for edge devices; second, introducing the Shape-Texture Attention Module (STAM). STAM splits attention into two branches -- one using deformable convolutions (DCNv4) for shape awareness and the other using a Gabor filter bank for texture awareness. On the public CCMT plant disease dataset, our STA-Net model (with 401K parameters and 51.1M FLOPs) reached 89.00% accuracy and an F1 score of 88.96%. Ablation studies confirm STAM significantly improves performance over baseline and standard attention models. Integrating domain knowledge via decoupled attention thus presents a promising path for edge-deployed precision agriculture AI. The source code is available at https://github.com/RzMY/STA-Net.
中文: 针对现有注意力机制难以捕捉植物病害细微特征的问题,本研究提出STA-Net轻量级模型,通过无训练神经网络架构搜索主干和新型形状-纹理双分支注意力模块,在植物病害数据集上实现89.00%的准确率,为边缘设备上的精准农业应用提供了有效解决方案。
English: To address the limitations of generic attention mechanisms in capturing subtle plant disease features, this study introduces STA-Net, a lightweight model combining a training-free neural architecture search backbone with a novel Shape-Texture Attention Module that achieves 89.00% accuracy on plant disease diagnosis while being optimized for edge deployment.

Authors:Taha Koleilat, Hassan Rivaz, Yiming Xiao
Title: Singular Value Few-shot Adaptation of Vision-Language Models
Abstract:
Vision-language models (VLMs) like CLIP have shown impressive zero-shot and few-shot learning capabilities across diverse applications. However, adapting these models to new fine-grained domains remains difficult due to reliance on prompt engineering and the high cost of full model fine-tuning. Existing adaptation approaches rely on augmented components, such as prompt tokens and adapter modules, which could limit adaptation quality, destabilize the model, and compromise the rich knowledge learned during pretraining. In this work, we present CLIP-SVD, a novel multi-modal and parameter-efficient adaptation technique that leverages Singular Value Decomposition (SVD) to modify the internal parameter space of CLIP without injecting additional modules. Specifically, we fine-tune only the singular values of the CLIP parameter matrices to rescale the basis vectors for domain adaptation while retaining the pretrained model. This design enables enhanced adaptation performance using only 0.04% of the model's total parameters and better preservation of its generalization ability. CLIP-SVD achieves state-of-the-art classification results on 11 natural and 10 biomedical datasets, outperforming previous methods in both accuracy and generalization under few-shot settings. Additionally, we leverage a natural language-based approach to analyze the effectiveness and dynamics of the CLIP adaptation to allow interpretability of CLIP-SVD. The code is publicly available at https://github.com/HealthX-Lab/CLIP-SVD.
Chinese: CLIP-SVD 提出了一种基于奇异值分解的参数高效适配方法,通过微调CLIP内部参数,在多个数据集上以极少的参数量实现了优异的准确性和泛化能力。
English: CLIP-SVD introduces a parameter-efficient adaptation method using Singular Value Decomposition to fine-tune CLIP's internal parameters, achieving superior accuracy and generalization with minimal parameter usage across multiple datasets.

Authors:Casper van Engelenburg, Jan van Gemert, Seyran Khademi
Title: LayoutGKN: Graph Similarity Learning of Floor Plans
Abstract:
Floor plans depict building layouts and are often represented as graphs to capture the underlying spatial relationships. Comparison of these graphs is critical for applications like search, clustering, and data visualization. The most successful methods to compare graphs \ie, graph matching networks, rely on costly intermediate cross-graph node-level interactions, therefore being slow in inference time. We introduce \textbf{LayoutGKN}, a more efficient approach that postpones the cross-graph node-level interactions to the end of the joint embedding architecture. We do so by using a differentiable graph kernel as a distance function on the final learned node-level embeddings. We show that LayoutGKN computes similarity comparably or better than graph matching networks while significantly increasing the speed. \href{https://github.com/caspervanengelenburg/LayoutGKN}{Code and data} are open.
中文:LayoutGKN是一种更高效的图匹配方法,它将跨图节点交互推迟到最终嵌入阶段,在显著提升推理速度的同时,实现了与现有网络相当或更优的相似性计算效果。
English: LayoutGKN is a more efficient graph matching method that delays cross-graph interactions until the final embedding stage, achieving comparable or better similarity accuracy while significantly speeding up inference compared to existing networks.

Authors:Seth Z. Zhao, Huizhi Zhang, Zhaowei Li, Juntong Peng, Anthony Chui, Zewei Zhou, Zonglin Meng, Hao Xiang, Zhiyu Huang, Fujia Wang, Ran Tian, Chenfeng Xu, Bolei Zhou, Jiaqi Ma
Title: QuantV2X: A Fully Quantized Multi-Agent System for Cooperative Perception
Abstract:
Cooperative perception through Vehicle-to-Everything (V2X) communication offers significant potential for enhancing vehicle perception by mitigating occlusions and expanding the field of view. However, past research has predominantly focused on improving accuracy metrics without addressing the crucial system-level considerations of efficiency, latency, and real-world deployability. Noticeably, most existing systems rely on full-precision models, which incur high computational and transmission costs, making them impractical for real-time operation in resource-constrained environments. In this paper, we introduce \textbf{QuantV2X}, the first fully quantized multi-agent system designed specifically for efficient and scalable deployment of multi-modal, multi-agent V2X cooperative perception. QuantV2X introduces a unified end-to-end quantization strategy across both neural network models and transmitted message representations that simultaneously reduces computational load and transmission bandwidth. Remarkably, despite operating under low-bit constraints, QuantV2X achieves accuracy comparable to full-precision systems. More importantly, when evaluated under deployment-oriented metrics, QuantV2X reduces system-level latency by 3.2$\times$ and achieves a +9.5 improvement in mAP30 over full-precision baselines. Furthermore, QuantV2X scales more effectively, enabling larger and more capable models to fit within strict memory budgets. These results highlight the viability of a fully quantized multi-agent intermediate fusion system for real-world deployment. The system will be publicly released to promote research in this field: https://github.com/ucla-mobility/QuantV2X.
中文: QuantV2X提出了首个完全量化的协同感知系统,通过统一量化策略在保持精度的同时大幅降低计算和传输开销,实现了3.2倍的延迟降低和更好的可扩展性,为V2X实际部署提供了可行方案。
English: QuantV2X introduces a fully quantized cooperative perception system that reduces computational and transmission costs while maintaining accuracy comparable to full-precision systems, achieving significant latency reduction and improved scalability for real-world V2X deployment.

Authors:Rajeev Ranjan Dwivedi, Ankur Kumar, Vinod K Kurmi
Title: Multi Attribute Bias Mitigation via Representation Learning
Abstract:
Real world images frequently exhibit multiple overlapping biases, including textures, watermarks, gendered makeup, scene object pairings, etc. These biases collectively impair the performance of modern vision models, undermining both their robustness and fairness. Addressing these biases individually proves inadequate, as mitigating one bias often permits or intensifies others. We tackle this multi bias problem with Generalized Multi Bias Mitigation (GMBM), a lean two stage framework that needs group labels only while training and minimizes bias at test time. First, Adaptive Bias Integrated Learning (ABIL) deliberately identifies the influence of known shortcuts by training encoders for each attribute and integrating them with the main backbone, compelling the classifier to explicitly recognize these biases. Then Gradient Suppression Fine Tuning prunes those very bias directions from the backbone's gradients, leaving a single compact network that ignores all the shortcuts it just learned to recognize. Moreover we find that existing bias metrics break under subgroup imbalance and train test distribution shifts, so we introduce Scaled Bias Amplification (SBA): a test time measure that disentangles model induced bias amplification from distributional differences. We validate GMBM on FB CMNIST, CelebA, and COCO, where we boost worst group accuracy, halve multi attribute bias amplification, and set a new low in SBA even as bias complexity and distribution shifts intensify, making GMBM the first practical, end to end multibias solution for visual recognition. Project page: http://visdomlab.github.io/GMBM/

Authors:Reina Ishikawa, Ryo Fujii, Hideo Saito, Ryo Hachiuma
Title: Human Preference-Aligned Concept Customization Benchmark via Decomposed Evaluation
Abstract:
Evaluating concept customization is challenging, as it requires a comprehensive assessment of fidelity to generative prompts and concept images. Moreover, evaluating multiple concepts is considerably more difficult than evaluating a single concept, as it demands detailed assessment not only for each individual concept but also for the interactions among concepts. While humans can intuitively assess generated images, existing metrics often provide either overly narrow or overly generalized evaluations, resulting in misalignment with human preference. To address this, we propose Decomposed GPT Score (D-GPTScore), a novel human-aligned evaluation method that decomposes evaluation criteria into finer aspects and incorporates aspect-wise assessments using Multimodal Large Language Model (MLLM). Additionally, we release Human Preference-Aligned Concept Customization Benchmark (CC-AlignBench), a benchmark dataset containing both single- and multi-concept tasks, enabling stage-wise evaluation across a wide difficulty range -- from individual actions to multi-person interactions. Our method significantly outperforms existing approaches on this benchmark, exhibiting higher correlation with human preferences. This work establishes a new standard for evaluating concept customization and highlights key challenges for future research. The benchmark and associated materials are available at https://github.com/ReinaIshikawa/D-GPTScore.
中文: 本文提出了D-GPTScore方法,通过多模态大语言模型将评估标准分解为更细粒度,并发布了包含单概念与多概念任务的CC-AlignBench基准,在与人偏好的对齐度上显著优于现有方法。
English: This paper introduces D-GPTScore, a human-aligned evaluation method that decomposes assessment criteria into finer aspects using MLLM, and releases CC-AlignBench, a benchmark for single- and multi-concept tasks, achieving superior correlation with human preferences.

Authors:Hui Chen, Liangyu Liu, Xianchao Xiu, Wanquan Liu
Title: Transformer-Guided Content-Adaptive Graph Learning for Hyperspectral Unmixing
Abstract:
Hyperspectral unmixing (HU) targets to decompose each mixed pixel in remote sensing images into a set of endmembers and their corresponding abundances. Despite significant progress in this field using deep learning, most methods fail to simultaneously characterize global dependencies and local consistency, making it difficult to preserve both long-range interactions and boundary details. This letter proposes a novel transformer-guided content-adaptive graph unmixing framework (T-CAGU), which overcomes these challenges by employing a transformer to capture global dependencies and introducing a content-adaptive graph neural network to enhance local relationships. Unlike previous work, T-CAGU integrates multiple propagation orders to dynamically learn the graph structure, ensuring robustness against noise. Furthermore, T-CAGU leverages a graph residual mechanism to preserve global information and stabilize training. Experimental results demonstrate its superiority over the state-of-the-art methods. Our code is available at https://github.com/xianchaoxiu/T-CAGU.
中文: 该摘要提出T-CAGU新型高光谱解混框架,通过结合Transformer捕获全局依赖与自适应图网络增强局部关联,在性能上超越了现有先进方法。
English: This abstract introduces T-CAGU, a novel hyperspectral unmixing framework that combines transformers for global dependencies and adaptive graph networks for local consistency, achieving superior performance over existing methods.

Authors:Yixiong Jing, Cheng Zhang, Haibing Wu, Guangming Wang, Olaf Wysocki, Brian Sheil
Title: InfraDiffusion: zero-shot depth map restoration with diffusion models and prompted segmentation from sparse infrastructure point clouds
Abstract:
Point clouds are widely used for infrastructure monitoring by providing geometric information, where segmentation is required for downstream tasks such as defect detection. Existing research has automated semantic segmentation of structural components, while brick-level segmentation (identifying defects such as spalling and mortar loss) has been primarily conducted from RGB images. However, acquiring high-resolution images is impractical in low-light environments like masonry tunnels. Point clouds, though robust to dim lighting, are typically unstructured, sparse, and noisy, limiting fine-grained segmentation. We present InfraDiffusion, a zero-shot framework that projects masonry point clouds into depth maps using virtual cameras and restores them by adapting the Denoising Diffusion Null-space Model (DDNM). Without task-specific training, InfraDiffusion enhances visual clarity and geometric consistency of depth maps. Experiments on masonry bridge and tunnel point cloud datasets show significant improvements in brick-level segmentation using the Segment Anything Model (SAM), underscoring its potential for automated inspection of masonry assets. Our code and data is available at https://github.com/Jingyixiong/InfraDiffusion-official-implement.
Chinese Summary: InfraDiffusion是一种零样本框架,通过虚拟相机和扩散模型将砖石点云转换为增强的深度图,无需任务特定训练即可显著提升砖块级分割效果,实现基础设施自动化检测。
English Summary: InfraDiffusion is a zero-shot framework that converts masonry point clouds into enhanced depth maps using virtual cameras and diffusion models, significantly improving brick-level segmentation for automated infrastructure inspection without task-specific training.

Authors:Junhao Jia, Yifei Sun, Yunyou Liu, Cheng Yang, Changmiao Wang, Feiwei Qin, Yong Peng, Wenwen Min
Title: RTGMFF: Enhanced fMRI-based Brain Disorder Diagnosis via ROI-driven Text Generation and Multimodal Feature Fusion
Abstract:
Functional magnetic resonance imaging (fMRI) is a powerful tool for probing brain function, yet reliable clinical diagnosis is hampered by low signal-to-noise ratios, inter-subject variability, and the limited frequency awareness of prevailing CNN- and Transformer-based models. Moreover, most fMRI datasets lack textual annotations that could contextualize regional activation and connectivity patterns. We introduce RTGMFF, a framework that unifies automatic ROI-level text generation with multimodal feature fusion for brain-disorder diagnosis. RTGMFF consists of three components: (i) ROI-driven fMRI text generation deterministically condenses each subject's activation, connectivity, age, and sex into reproducible text tokens; (ii) Hybrid frequency-spatial encoder fuses a hierarchical wavelet-mamba branch with a cross-scale Transformer encoder to capture frequency-domain structure alongside long-range spatial dependencies; and (iii) Adaptive semantic alignment module embeds the ROI token sequence and visual features in a shared space, using a regularized cosine-similarity loss to narrow the modality gap. Extensive experiments on the ADHD-200 and ABIDE benchmarks show that RTGMFF surpasses current methods in diagnostic accuracy, achieving notable gains in sensitivity, specificity, and area under the ROC curve. Code is available at https://github.com/BeistMedAI/RTGMFF.
中文: RTGMFF框架通过从fMRI数据生成ROI级文本描述,并利用混合频空编码器和自适应语义对齐整合多模态特征,在ADHD-200和ABIDE基准测试中显著提升了脑部疾病诊断的准确性。
English: The RTGMFF framework enhances brain-disorder diagnosis by generating ROI-level text descriptions from fMRI data and integrating multimodal features through a hybrid frequency-spatial encoder and adaptive semantic alignment, demonstrating superior accuracy on ADHD-200 and ABIDE benchmarks.

Authors:Tzuhsuan Huang, Cheng Yu Yeo, Tsai-Ling Huang, Hong-Han Shuai, Wen-Huang Cheng, Jun-Cheng Chen
Title: Enhancing Robustness in Post-Processing Watermarking: An Ensemble Attack Network Using CNNs and Transformers
Abstract:
Recent studies on deep watermarking have predominantly focused on in-processing watermarking, which integrates the watermarking process into image generation. However, post-processing watermarking, which embeds watermarks after image generation, offers more flexibility. It can be applied to outputs from any generative model (e.g. GANs, diffusion models) without needing access to the model's internal structure. It also allows users to embed unique watermarks into individual images. Therefore, this study focuses on post-processing watermarking and enhances its robustness by incorporating an ensemble attack network during training. We construct various versions of attack networks using CNN and Transformer in both spatial and frequency domains to investigate how each combination influences the robustness of the watermarking model. Our results demonstrate that combining a CNN-based attack network in the spatial domain with a Transformer-based attack network in the frequency domain yields the highest robustness in watermarking models. Extensive evaluation on the WAVES benchmark, using average bit accuracy as the metric, demonstrates that our ensemble attack network significantly enhances the robustness of baseline watermarking methods under various stress tests. In particular, for the Regeneration Attack defined in WAVES, our method improves StegaStamp by 18.743%. The code is released at:https://github.com/aiiu-lab/DeepRobustWatermark.
中文摘要:本研究通过训练中引入集成攻击网络,结合空间域的CNN和频域的Transformer模型,显著增强了后处理水印的鲁棒性,尤其在WAVES基准测试中将StegaStamp对再生攻击的抵抗能力提升了18.743%。
English Summary: This study advances post-processing watermarking by integrating an ensemble attack network during training, combining CNN and Transformer models across spatial and frequency domains to significantly boost watermark robustness, particularly improving StegaStamp by 18.743% against regeneration attacks.

Authors:Shuai Jiang, Yunfeng Ma, Jingyu Zhou, Yuan Bian, Yaonan Wang, Min Liu
Title: Resilient Multimodal Industrial Surface Defect Detection with Uncertain Sensors Availability
Abstract:
Multimodal industrial surface defect detection (MISDD) aims to identify and locate defect in industrial products by fusing RGB and 3D modalities. This article focuses on modality-missing problems caused by uncertain sensors availability in MISDD. In this context, the fusion of multiple modalities encounters several troubles, including learning mode transformation and information vacancy. To this end, we first propose cross-modal prompt learning, which includes: i) the cross-modal consistency prompt serves the establishment of information consistency of dual visual modalities; ii) the modality-specific prompt is inserted to adapt different input patterns; iii) the missing-aware prompt is attached to compensate for the information vacancy caused by dynamic modalities-missing. In addition, we propose symmetric contrastive learning, which utilizes text modality as a bridge for fusion of dual vision modalities. Specifically, a paired antithetical text prompt is designed to generate binary text semantics, and triple-modal contrastive pre-training is offered to accomplish multimodal learning. Experiment results show that our proposed method achieves 73.83% I-AUROC and 93.05% P-AUROC with a total missing rate 0.7 for RGB and 3D modalities (exceeding state-of-the-art methods 3.84% and 5.58% respectively), and outperforms existing approaches to varying degrees under different missing types and rates. The source code will be available at https://github.com/SvyJ/MISDD-MM.
中文: 本文提出了一种新颖的多模态工业表面缺陷检测方法,通过跨模态提示学习和对称对比学习解决模态缺失问题,在不同缺失类型和比率下均优于现有方法。
English: This article introduces a novel method for multimodal industrial surface defect detection that addresses modality-missing issues through cross-modal prompt learning and symmetric contrastive learning, achieving superior performance over existing approaches.

Authors:Zeyu Liu, Shengwei Ding
Title: STAR: A Fast and Robust Rigid Registration Framework for Serial Histopathological Images
Abstract:
Registration of serial whole-slide histopathological images (WSIs) is critical for enabling direct comparison across diverse stains and for preparing paired datasets in artificial intelligence (AI) workflows such as virtual staining and biomarker prediction. While existing methods often rely on complex deformable or deep learning approaches that are computationally intensive and difficult to reproduce, lightweight rigid frameworks-sufficient for many consecutive-section scenarios-remain underdeveloped. We introduce STAR (Serial Tissue Alignment for Rigid registration), a fast and robust open-source framework for multi-WSI alignment. STAR integrates stain-conditioned preprocessing with a hierarchical coarse-to-fine correlation strategy, adaptive kernel scaling, and built-in quality control, achieving reliable rigid registration across heterogeneous tissue types and staining protocols, including hematoxylin-eosin (H&E), special histochemical stains (e.g., PAS, PASM, Masson's), and immunohistochemical (IHC) markers (e.g., CD31, KI67). Evaluated on the ANHIR 2019 and ACROBAT 2022 datasets spanning multiple organs and scanning conditions, STAR consistently produced stable alignments within minutes per slide, demonstrating robustness to cross-stain variability and partial tissue overlap. Beyond benchmarks, we present case studies on H&E-IHC alignment, construction of multi-IHC panels, and typical failure modes, underscoring both utility and limitations. Released as an open and lightweight tool, STAR provides a reproducible baseline that lowers the barrier for clinical adoption and enables large-scale paired data preparation for next-generation computational pathology.
中文:STAR是一种快速、鲁棒的开源框架,用于连续全切片病理图像的刚性配准,它结合了特定预处理与分层相关策略,能在不同染色方案和组织类型间实现可靠对齐。
English: STAR is a fast, robust open-source framework for rigid registration of serial whole-slide histopathological images, integrating specialized preprocessing with hierarchical correlation to achieve reliable alignment across diverse stains and tissue types.

Authors:Kimihiro Hasegawa, Wiradee Imrattanatrai, Masaki Asada, Susan Holm, Yuran Wang, Vincent Zhou, Ken Fukuda, Teruko Mitamura
Title: ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly
Abstract:
Assistants on assembly tasks have a large potential to benefit humans from everyday tasks to industrial settings. However, no testbeds support application-oriented system evaluation in a practical setting, especially in assembly. To foster the development, we propose a new multimodal QA dataset on assembly activities. Our dataset, ProMQA-Assembly, consists of 391 QA pairs that require the multimodal understanding of human-activity recordings and their instruction manuals in an online-style manner. In the development, we adopt a semi-automated QA annotation approach, where LLMs generate candidates and humans verify them, as a cost-effective method, and further improve it by integrating fine-grained action labels to diversify question types. Furthermore, we create instruction task graphs for the target tasks of assembling toy vehicles. These newly created task graphs are used in our benchmarking experiment, as well as to facilitate the human verification process in the QA annotation. Utilizing our dataset, we benchmark models, including competitive proprietary multimodal models. Our results suggest great room for improvement for the current models. We believe our new evaluation dataset can contribute to the further development of procedural-activity assistants.
中文: 本文提出了ProMQA-Assembly多模态问答数据集,包含391个问答对,采用半自动标注方法评估实际装配任务中的AI助手,基准测试表明现有模型仍有很大改进空间。
English: This paper introduces ProMQA-Assembly, a multimodal QA dataset with 391 question-answer pairs designed to evaluate AI assistants in practical assembly tasks, using a semi-automated annotation method and benchmarking that reveals significant room for improvement in current models.

Authors:Armin Saadat, Nima Hashemi, Hooman Vaseli, Michael Y. Tsang, Christina Luong, Michiel Van de Panne, Teresa S. M. Tsang, Purang Abolmaesumi
Title: PRECISE-AS: Personalized Reinforcement Learning for Efficient Point-of-Care Echocardiography in Aortic Stenosis Diagnosis
Abstract:
Aortic stenosis (AS) is a life-threatening condition caused by a narrowing of the aortic valve, leading to impaired blood flow. Despite its high prevalence, access to echocardiography (echo), the gold-standard diagnostic tool, is often limited due to resource constraints, particularly in rural and underserved areas. Point-of-care ultrasound (POCUS) offers a more accessible alternative but is restricted by operator expertise and the challenge of selecting the most relevant imaging views. To address this, we propose a reinforcement learning (RL)-driven active video acquisition framework that dynamically selects each patient's most informative echo videos. Unlike traditional methods that rely on a fixed set of videos, our approach continuously evaluates whether additional imaging is needed, optimizing both accuracy and efficiency. Tested on data from 2,572 patients, our method achieves 80.6% classification accuracy while using only 47% of the echo videos compared to a full acquisition. These results demonstrate the potential of active feature acquisition to enhance AS diagnosis, making echocardiographic assessments more efficient, scalable, and personalized. Our source code is available at: https://github.com/Armin-Saadat/PRECISE-AS.
中文: 该研究提出强化学习驱动的动态采集框架,通过智能选择最具诊断价值的心脏超声视频,仅用47%的视频量即实现80.6%的主动脉狭窄分类准确率,显著提升了诊断效率与可及性。
English: The proposed reinforcement learning framework dynamically selects the most informative echocardiography videos for aortic stenosis diagnosis, achieving 80.6% accuracy while using only 47% of videos compared to full acquisition, thereby enhancing diagnostic efficiency and accessibility.

Authors:Mennatullah Siam
Title: PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding?
Abstract:
Multi-modal large language models (MLLMs) have shown impressive generalization across tasks using images and text modalities. While their extension to video has enabled tasks such as video question answering and video captioning, their pixel-level visual grounding abilities are less studied. In this work, we raise the pertinent question of whether motion is used in pixel-level visual grounding and whether video MLLMs can segment objects based on natural language expressions describing their motion patterns. We identify the shortcomings in the current benchmarks, where we show that a single frame can often suffice for capturing the motion referring expression without any temporal reasoning. To address this, we introduce four motion-centric probing techniques, particularly designed for the visual grounding task, to study video MLLMs' ability to identify true motion from a fake one and their ability to grasp the motion order. Consequently, we provide a motion-centric benchmark, MoCentric-Bench. It ensures that video MLLMs are evaluated towards leveraging the interaction between motion and language rather than being dominated by static appearance cues emphasized in existing visual grounding datasets. We further establish strong single-image baselines that are on par with or outperform prior methods. Finally, we explore simple motion-centric adaptation techniques that provide state-of-the-art performance on our MoCentric-Bench. Our motion-centric benchmark, evaluation and findings challenge future models to improve dense spatiotemporal grounding and pixel-level understanding within videos. Code and datasets will be made publicly available at https://github.com/MSiam/PixFoundation-2.0.git.
中文: 本研究提出了以运动为中心的基准MoCentric-Bench,旨在评估视频多模态大语言模型利用运动线索进行像素级视觉定位的能力,通过四项运动探测技术验证模型对真实运动与运动顺序的理解,弥补现有数据集中静态特征主导的不足。
English: This study introduces MoCentric-Bench, a motion-centric benchmark designed to evaluate video multi-modal large language models' ability to perform pixel-level visual grounding by leveraging motion cues and temporal reasoning, addressing limitations in existing datasets that overemphasize static appearance.

Authors:You Shen, Zhipeng Zhang, Yansong Qu, Liujuan Cao
Title: FastVGGT: Training-Free Acceleration of Visual Geometry Transformer
Abstract:
Foundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, scaling these models to long-sequence image inputs remains a significant challenge due to inference-time inefficiency. In this work, we present a detailed analysis of VGGT, a state-of-the-art feed-forward visual geometry model and identify its primary bottleneck. Visualization further reveals a token collapse phenomenon in the attention maps. Motivated by these findings, we explore the potential of token merging in the feed-forward visual geometry model. Owing to the unique architectural and task-specific properties of 3D models, directly applying existing merging techniques proves challenging. To this end, we propose FastVGGT, which, for the first time, leverages token merging in the 3D domain through a training-free mechanism for accelerating VGGT. we devise a unique token partitioning strategy tailored to 3D architectures and tasks, effectively eliminating redundant computation while preserving VGGT's powerful reconstruction capacity. Extensive experiments on multiple 3D geometry benchmarks validate the effectiveness of our approach. Notably, with 1000 input images, FastVGGT achieves a 4x speedup over VGGT while mitigating error accumulation in long-sequence scenarios. These findings underscore the potential of token merging as a principled solution for scalable 3D vision systems. Code is available at: https://mystorm16.github.io/fastvggt/.

Authors:Xinrui Gong, Oliver Hahn, Christoph Reich, Krishnakant Singh, Simone Schaub-Meyer, Daniel Cremers, Stefan Roth
Title: Motion-Refined DINOSAUR for Unsupervised Multi-Object Discovery
Abstract:
Unsupervised multi-object discovery (MOD) aims to detect and localize distinct object instances in visual scenes without any form of human supervision. Recent approaches leverage object-centric learning (OCL) and motion cues from video to identify individual objects. However, these approaches use supervision to generate pseudo labels to train the OCL model. We address this limitation with MR-DINOSAUR -- Motion-Refined DINOSAUR -- a minimalistic unsupervised approach that extends the self-supervised pre-trained OCL model, DINOSAUR, to the task of unsupervised multi-object discovery. We generate high-quality unsupervised pseudo labels by retrieving video frames without camera motion for which we perform motion segmentation of unsupervised optical flow. We refine DINOSAUR's slot representations using these pseudo labels and train a slot deactivation module to assign slots to foreground and background. Despite its conceptual simplicity, MR-DINOSAUR achieves strong multi-object discovery results on the TRI-PD and KITTI datasets, outperforming the previous state of the art despite being fully unsupervised.
Chinese: MR-DINOSAUR是一种完全无监督的方法,通过运动生成的伪标签优化自监督的以物体为中心学习模型,在无需人工标注的情况下实现了最先进的多物体发现性能。
English: MR-DINOSAUR is a fully unsupervised approach that refines self-supervised object-centric learning models using motion-based pseudo labels to achieve state-of-the-art multi-object discovery without human supervision.

Authors:Nina Wiedemann, Sainan Liu, Quentin Leboutet, Katelyn Gao, Benjamin Ummenhofer, Michael Paulitsch, Kai Yuan
Title: Unifi3D: A Study on 3D Representations for Generation and Reconstruction in a Common Framework
Abstract:
Following rapid advancements in text and image generation, research has increasingly shifted towards 3D generation. Unlike the well-established pixel-based representation in images, 3D representations remain diverse and fragmented, encompassing a wide variety of approaches such as voxel grids, neural radiance fields, signed distance functions, point clouds, or octrees, each offering distinct advantages and limitations. In this work, we present a unified evaluation framework designed to assess the performance of 3D representations in reconstruction and generation. We compare these representations based on multiple criteria: quality, computational efficiency, and generalization performance. Beyond standard model benchmarking, our experiments aim to derive best practices over all steps involved in the 3D generation pipeline, including preprocessing, mesh reconstruction, compression with autoencoders, and generation. Our findings highlight that reconstruction errors significantly impact overall performance, underscoring the need to evaluate generation and reconstruction jointly. We provide insights that can inform the selection of suitable 3D models for various applications, facilitating the development of more robust and application-specific solutions in 3D generation. The code for our framework is available at https://github.com/isl-org/unifi3d.
中文: 本研究提出一个统一的评估框架,用于比较各种三维表示在重建和生成中的表现,强调联合评估对优化质量、效率和特定应用需求的性能至关重要。
English: This study introduces a unified evaluation framework to compare diverse 3D representations in reconstruction and generation, emphasizing that joint assessment is crucial for optimizing performance across quality, efficiency, and application-specific needs.

Authors:Jingru Fan, Yufan Dang, Jingyao Wu, Huatao Li, Runde Yang, Xiyuan Yang, Yuheng Wang, Zhong Zhang, Yaxi Lu, Yankai Lin, Zhiyuan Liu, Dahai Li, Chen Qian
Title: AppCopilot: Toward General, Accurate, Long-Horizon, and Efficient Mobile Agent
Abstract:
With the raid evolution of large language models and multimodal foundation models, the mobile-agent landscape has proliferated without converging on the fundamental challenges. This paper identifies four core problems that must be solved for mobile agents to deliver practical, scalable impact: (1) generalization across tasks, modalities, apps, and devices; (2) accuracy, specifically precise on-screen interaction and click targeting; (3) long-horizon capability for sustained, multi-step goals; and (4) efficiency, specifically high-performance runtime on resource-constrained devices. We present AppCopilot, a multimodal, multi-agent, general-purpose on-device assistant that operates across applications and constitutes a full-stack, closed-loop system from data to deployment. AppCopilot operationalizes this position through an end-to-end autonomous pipeline spanning data collection, training, deployment, high-quality and efficient inference, and mobile application development. At the model layer, it integrates multimodal foundation models with robust Chinese-English support. At the reasoning and control layer, it combines chain-of-thought reasoning, hierarchical task planning and decomposition, and multi-agent collaboration. At the execution layer, it enables user personalization and experiential adaptation, voice interaction, function calling, cross-app and cross-device orchestration, and comprehensive mobile app support. The system design incorporates profiling-driven optimization for latency, memory, and energy across heterogeneous hardware. Empirically, AppCopilot achieves significant improvements along all four dimensions: stronger generalization, higher-precision on-screen actions, more reliable long-horizon task completion, and faster, more resource-efficient runtime.
中文摘要:本文提出AppCopilot,一种设备端多模态助手,通过融合基础模型、多智能体协作和移动端优化部署,系统性地解决了移动智能体在泛化能力、操作精度、长程任务和运行效率四大核心难题。
English Summary: This paper introduces AppCopilot, an on-device multimodal assistant designed to address four core challenges in mobile agents—generalization, accuracy, long-horizon capability, and efficiency—through an integrated system combining foundation models, multi-agent collaboration, and optimized mobile deployment.

Authors:Tao Wang, Zhenxuan Zhang, Yuanbo Zhou, Xinlin Zhang, Yuanbin Chen, Tao Tan, Guang Yang, Tong Tong
Title: From Noisy Labels to Intrinsic Structure: A Geometric-Structural Dual-Guided Framework for Noise-Robust Medical Image Segmentation
Abstract:
The effectiveness of convolutional neural networks in medical image segmentation relies on large-scale, high-quality annotations, which are costly and time-consuming to obtain. Even expert-labeled datasets inevitably contain noise arising from subjectivity and coarse delineations, which disrupt feature learning and adversely impact model performance. To address these challenges, this study propose a Geometric-Structural Dual-Guided Network (GSD-Net), which integrates geometric and structural cues to improve robustness against noisy annotations. It incorporates a Geometric Distance-Aware module that dynamically adjusts pixel-level weights using geometric features, thereby strengthening supervision in reliable regions while suppressing noise. A Structure-Guided Label Refinement module further refines labels with structural priors, and a Knowledge Transfer module enriches supervision and improves sensitivity to local details. To comprehensively assess its effectiveness, we evaluated GSD-Net on six publicly available datasets: four containing three types of simulated label noise, and two with multi-expert annotations that reflect real-world subjectivity and labeling inconsistencies. Experimental results demonstrate that GSD-Net achieves state-of-the-art performance under noisy annotations, achieving improvements of 2.52% on Kvasir, 22.76% on Shenzhen, 8.87% on BU-SUC, and 4.59% on BraTS2020 under SR simulated noise. The codes of this study are available at https://github.com/ortonwang/GSD-Net.
Chinese: 本研究提出GSD-Net,一种几何与结构双引导网络,通过动态调整监督和优化标签来增强对医学图像噪声标注的鲁棒性,在多个数据集上实现了最先进的性能。
English: This study introduces GSD-Net, a geometric-structural dual-guided network that enhances robustness against noisy medical image annotations by dynamically adjusting supervision and refining labels, achieving state-of-the-art performance across multiple datasets.

Authors:Xiaobao Wei, Changyong Shu, Zhaokun Yue, Chang Huang, Weiwei Liu, Shuai Yang, Lirong Yang, Peng Gao, Wenbin Zhang, Gaochao Zhu, Chengxiang Wang
Title: Decoupling Bidirectional Geometric Representations of 4D cost volume with 2D convolution
Abstract:
High-performance real-time stereo matching methods invariably rely on 3D regularization of the cost volume, which is unfriendly to mobile devices. And 2D regularization based methods struggle in ill-posed regions. In this paper, we present a deployment-friendly 4D cost aggregation network DBStereo, which is based on pure 2D convolutions. Specifically, we first provide a thorough analysis of the decoupling characteristics of 4D cost volume. And design a lightweight bidirectional geometry aggregation block to capture spatial and disparity representation respectively. Through decoupled learning, our approach achieves real-time performance and impressive accuracy simultaneously. Extensive experiments demonstrate that our proposed DBStereo outperforms all existing aggregation-based methods in both inference time and accuracy, even surpassing the iterative-based method IGEV-Stereo. Our study break the empirical design of using 3D convolutions for 4D cost volume and provides a simple yet strong baseline of the proposed decouple aggregation paradigm for further study. Code will be available at (\href{https://github.com/happydummy/DBStereo}{https://github.com/happydummy/DBStereo}) soon.
Chinese: DBStereo提出了一种基于纯二维卷积的易部署4D代价聚合网络,通过解耦空间和视差学习,在实现实时性能的同时获得了卓越精度,超越了现有方法且无需依赖三维正则化。
English: DBStereo introduces a deployment-friendly 4D cost aggregation network using pure 2D convolutions, achieving real-time performance and superior accuracy by decoupling spatial and disparity learning, outperforming existing methods without relying on 3D regularization.

Authors:Yuheng Li, Yizhou Wu, Yuxiang Lai, Mingzhe Hu, Xiaofeng Yang
Title: MedDINOv3: How to adapt vision foundation models for medical image segmentation?
Abstract:
Accurate segmentation of organs and tumors in CT and MRI scans is essential for diagnosis, treatment planning, and disease monitoring. While deep learning has advanced automated segmentation, most models remain task-specific, lacking generalizability across modalities and institutions. Vision foundation models (FMs) pretrained on billion-scale natural images offer powerful and transferable representations. However, adapting them to medical imaging faces two key challenges: (1) the ViT backbone of most foundation models still underperform specialized CNNs on medical image segmentation, and (2) the large domain gap between natural and medical images limits transferability. We introduce MedDINOv3, a simple and effective framework for adapting DINOv3 to medical segmentation. We first revisit plain ViTs and design a simple and effective architecture with multi-scale token aggregation. Then, we perform domain-adaptive pretraining on CT-3M, a curated collection of 3.87M axial CT slices, using a multi-stage DINOv3 recipe to learn robust dense features. MedDINOv3 matches or exceeds state-of-the-art performance across four segmentation benchmarks, demonstrating the potential of vision foundation models as unified backbones for medical image segmentation. The code is available at https://github.com/ricklisz/MedDINOv3.
Chinese: MedDINOv3通过领域自适应预训练和架构改进,将DINOv3视觉基础模型成功应用于医学图像分割,在多个基准测试中达到最优性能,有效解决了自然图像与医学图像间的领域差异及ViT模型性能不足的问题。
English: MedDINOv3 adapts the DINOv3 vision foundation model to medical imaging through domain-specific pretraining and architectural enhancements, achieving state-of-the-art performance across multiple segmentation benchmarks while addressing key challenges in transferability and ViT underperformance.

Authors:Zeren Xiong, Zikun Chen, Zedong Zhang, Xiang Li, Ying Tai, Jian Yang, Jun Li
Title: Category-Aware 3D Object Composition with Disentangled Texture and Shape Multi-view Diffusion
Abstract:
In this paper, we tackle a new task of 3D object synthesis, where a 3D model is composited with another object category to create a novel 3D model. However, most existing text/image/3D-to-3D methods struggle to effectively integrate multiple content sources, often resulting in inconsistent textures and inaccurate shapes. To overcome these challenges, we propose a straightforward yet powerful approach, category+3D-to-3D (C33D), for generating novel and structurally coherent 3D models. Our method begins by rendering multi-view images and normal maps from the input 3D model, then generating a novel 2D object using adaptive text-image harmony (ATIH) with the front-view image and a text description from another object category as inputs. To ensure texture consistency, we introduce texture multi-view diffusion, which refines the textures of the remaining multi-view RGB images based on the novel 2D object. For enhanced shape accuracy, we propose shape multi-view diffusion to improve the 2D shapes of both the multi-view RGB images and the normal maps, also conditioned on the novel 2D object. Finally, these outputs are used to reconstruct a complete and novel 3D model. Extensive experiments demonstrate the effectiveness of our method, yielding impressive 3D creations, such as shark(3D)-crocodile(text) in the first row of Fig. 1. A project page is available at: https://xzr52.github.io/C33D/
中文: 本文提出C33D方法,通过自适应文本图像协调和多视角扩散技术,将现有3D模型与其他类别对象结合生成新颖3D模型,有效保持纹理一致性和形状精确度。
English: This paper introduces C33D, a novel method for synthesizing 3D models by combining existing 3D objects with new categories through adaptive text-image harmony and multi-view diffusion to ensure texture consistency and shape accuracy.

Authors:Yihong Wu, Jinqiao Wei, Xionghui Zhao, Yidi Li, Shaoyi Du, Bin Ren, Nicu Sebe
Title: DSGC-Net: A Dual-Stream Graph Convolutional Network for Crowd Counting via Feature Correlation Mining
Abstract:
Deep learning-based crowd counting methods have achieved remarkable progress in recent years. However, in complex crowd scenarios, existing models still face challenges when adapting to significant density distribution differences between regions. Additionally, the inconsistency of individual representations caused by viewpoint changes and body posture differences further limits the counting accuracy of the models. To address these challenges, we propose DSGC-Net, a Dual-Stream Graph Convolutional Network based on feature correlation mining. DSGC-Net introduces a Density Approximation (DA) branch and a Representation Approximation (RA) branch. By modeling two semantic graphs, it captures the potential feature correlations in density variations and representation distributions. The DA branch incorporates a density prediction module that generates the density distribution map, and constructs a density-driven semantic graph based on density similarity. The RA branch establishes a representation-driven semantic graph by computing global representation similarity. Then, graph convolutional networks are applied to the two semantic graphs separately to model the latent semantic relationships, which enhance the model's ability to adapt to density variations and improve counting accuracy in multi-view and multi-pose scenarios. Extensive experiments on three widely used datasets demonstrate that DSGC-Net outperforms current state-of-the-art methods. In particular, we achieve MAE of 48.9 and 5.9 in ShanghaiTech Part A and Part B datasets, respectively. The released code is available at: https://github.com/Wu-eon/CrowdCounting-DSGCNet.
中文: DSGC-Net通过双流图卷积网络,构建密度和表征语义图来捕捉特征关联,有效提升了复杂人群场景下的计数精度,在多个数据集上实现了最优性能。
English: DSGC-Net, a dual-stream graph convolutional network, addresses crowd counting challenges by modeling density and representation correlations through two semantic graphs, achieving state-of-the-art performance on benchmark datasets.

Authors:Nils Hoehing, Mayug Maniparambil, Ellen Rushe, Noel E. O'Connor, Anthony Ventresque
Title: Understanding Space Is Rocket Science -- Only Top Reasoning Models Can Solve Spatial Understanding Tasks
Abstract:
We propose RocketScience, an open-source contrastive VLM benchmark that tests for spatial relation understanding. It is comprised of entirely new real-world image-text pairs covering mostly relative spatial understanding and the order of objects. The benchmark is designed to be very easy for humans and hard for the current generation of VLMs, and this is empirically verified. Our results show a striking lack of spatial relation understanding in open source and frontier commercial VLMs and a surprisingly high performance of reasoning models. Additionally, we perform a disentanglement analysis to separate the contributions of object localization and spatial reasoning in chain-of-thought-based models and find that the performance on the benchmark is bottlenecked by spatial reasoning and not object localization capabilities. We release the dataset with a CC-BY-4.0 license and make the evaluation code available at: https://github.com/nilshoehing/rocketscience
Chinese: RocketScience 是一个评估视觉语言模型空间关系理解能力的开源基准测试,发现现有模型存在显著缺陷,并证实空间推理能力是主要瓶颈,而非物体定位能力。
English: RocketScience is an open-source benchmark that evaluates spatial relation understanding in vision-language models, revealing significant deficiencies in current models despite high human performance and identifying spatial reasoning as the primary bottleneck.

Authors:Mohit Mendiratta, Mayur Deshmukh, Kartik Teotia, Vladislav Golyanik, Adam Kortylewski, Christian Theobalt
Title: GRMM: Real-Time High-Fidelity Gaussian Morphable Head Model with Learned Residuals
Abstract:
3D Morphable Models (3DMMs) enable controllable facial geometry and expression editing for reconstruction, animation, and AR/VR, but traditional PCA-based mesh models are limited in resolution, detail, and photorealism. Neural volumetric methods improve realism but remain too slow for interactive use. Recent Gaussian Splatting (3DGS) based facial models achieve fast, high-quality rendering but still depend solely on a mesh-based 3DMM prior for expression control, limiting their ability to capture fine-grained geometry, expressions, and full-head coverage. We introduce GRMM, the first full-head Gaussian 3D morphable model that augments a base 3DMM with residual geometry and appearance components, additive refinements that recover high-frequency details such as wrinkles, fine skin texture, and hairline variations. GRMM provides disentangled control through low-dimensional, interpretable parameters (e.g., identity shape, facial expressions) while separately modelling residuals that capture subject- and expression-specific detail beyond the base model's capacity. Coarse decoders produce vertex-level mesh deformations, fine decoders represent per-Gaussian appearance, and a lightweight CNN refines rasterised images for enhanced realism, all while maintaining 75 FPS real-time rendering. To learn consistent, high-fidelity residuals, we present EXPRESS-50, the first dataset with 60 aligned expressions across 50 identities, enabling robust disentanglement of identity and expression in Gaussian-based 3DMMs. Across monocular 3D face reconstruction, novel-view synthesis, and expression transfer, GRMM surpasses state-of-the-art methods in fidelity and expression accuracy while delivering interactive real-time performance.
中文: GRMM是首个全头高斯3D可形变模型,通过增强基础3DMM的残差组件来捕捉皱纹和毛发等高频细节,以75 FPS实现实时渲染,并在面部重建和表情任务中表现卓越。
English: GRMM is the first full-head Gaussian 3D morphable model that enhances a base 3DMM with residual components to capture high-frequency details like wrinkles and hair, providing real-time rendering at 75 FPS and superior performance in facial reconstruction and expression tasks.

Authors:Matic Fučka, Vitjan Zavrtanik, Danijel Skočaj
Title: SALAD -- Semantics-Aware Logical Anomaly Detection
Abstract:
Recent surface anomaly detection methods excel at identifying structural anomalies, such as dents and scratches, but struggle with logical anomalies, such as irregular or missing object components. The best-performing logical anomaly detection approaches rely on aggregated pretrained features or handcrafted descriptors (most often derived from composition maps), which discard spatial and semantic information, leading to suboptimal performance. We propose SALAD, a semantics-aware discriminative logical anomaly detection method that incorporates a newly proposed composition branch to explicitly model the distribution of object composition maps, consequently learning important semantic relationships. Additionally, we introduce a novel procedure for extracting composition maps that requires no hand-made labels or category-specific information, in contrast to previous methods. By effectively modelling the composition map distribution, SALAD significantly improves upon state-of-the-art methods on the standard benchmark for logical anomaly detection, MVTec LOCO, achieving an impressive image-level AUROC of 96.1%. Code: https://github.com/MaticFuc/SALAD
Chinese Summary: 提出的SALAD方法通过新的组合分支显式建模物体组合图分布,无需人工标注即可在MVTec LOCO基准上实现96.1%的图像级AUROC,显著提升了逻辑异常检测性能。
English Summary: The proposed SALAD method enhances logical anomaly detection by explicitly modeling object composition maps with a new composition branch, achieving a 96.1% AUROC on MVTec LOCO without requiring manual labels.

Authors:Ziyun Zeng, Junhao Zhang, Wei Li, Mike Zheng Shou
Title: Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing
Abstract:
In recent years, integrating multimodal understanding and generation into a single unified model has emerged as a promising paradigm. While this approach achieves strong results in text-to-image (T2I) generation, it still struggles with precise image editing. We attribute this limitation to an imbalanced division of responsibilities. The understanding module primarily functions as a translator that encodes user instructions into semantic conditions, while the generation module must simultaneously act as designer and painter, inferring the original layout, identifying the target editing region, and rendering the new content. This imbalance is counterintuitive because the understanding module is typically trained with several times more data on complex reasoning tasks than the generation module. To address this issue, we introduce Draw-In-Mind (DIM), a dataset comprising two complementary subsets: (i) DIM-T2I, containing 14M long-context image-text pairs to enhance complex instruction comprehension; and (ii) DIM-Edit, consisting of 233K chain-of-thought imaginations generated by GPT-4o, serving as explicit design blueprints for image edits. We connect a frozen Qwen2.5-VL-3B with a trainable SANA1.5-1.6B via a lightweight two-layer MLP, and train it on the proposed DIM dataset, resulting in DIM-4.6B-T2I/Edit. Despite its modest parameter scale, DIM-4.6B-Edit achieves SOTA or competitive performance on the ImgEdit and GEdit-Bench benchmarks, outperforming much larger models such as UniWorld-V1 and Step1X-Edit. These findings demonstrate that explicitly assigning the design responsibility to the understanding module provides significant benefits for image editing. Our dataset and models are available at https://github.com/showlab/DIM.
中文: 该研究提出了Draw-In-Mind (DIM)数据集和模型,通过强化理解模块的设计职责来解决多模态模型中职责分配失衡问题,以较少参数量实现了图像编辑任务的顶尖性能。
English: The study introduces Draw-In-Mind (DIM), a dataset and model that addresses imbalanced responsibilities in unified multimodal models by enhancing the understanding module's design role, achieving state-of-the-art image editing performance with fewer parameters.

Authors:Zhenyuan Chen, Chenxi Wang, Ningyu Zhang, Feng Zhang
Title: RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events
Abstract:
Remote sensing is critical for disaster monitoring, yet existing datasets lack temporal image pairs and detailed textual annotations. While single-snapshot imagery dominates current resources, it fails to capture dynamic disaster impacts over time. To address this gap, we introduce the Remote Sensing Change Caption (RSCC) dataset, a large-scale benchmark comprising 62,315 pre-/post-disaster image pairs (spanning earthquakes, floods, wildfires, and more) paired with rich, human-like change captions. By bridging the temporal and semantic divide in remote sensing data, RSCC enables robust training and evaluation of vision-language models for disaster-aware bi-temporal understanding. Our results highlight RSCC's ability to facilitate detailed disaster-related analysis, paving the way for more accurate, interpretable, and scalable vision-language applications in remote sensing. Code and dataset are available at https://github.com/Bili-Sakura/RSCC.
中文摘要:RSCC数据集通过提供62,315组灾前/灾后图像对及拟人化变化描述,弥补了遥感数据中时序图像对与详细标注的缺失,为灾害感知的双时相视觉语言模型提供了可靠的训练与评估基础。
English Summary: The RSCC dataset addresses the lack of temporal image pairs and detailed annotations in remote sensing by providing 62,315 pre-/post-disaster image pairs with human-like change captions, enabling robust training of vision-language models for disaster analysis.

Authors:Zhipeng Weng, Xiaopeng Liu, Ce Liu, Xingyuan Guo, Yukai Shi, Liang Lin
Title: DroneSR: Rethinking Few-shot Thermal Image Super-Resolution from Drone-based Perspective
Abstract:
Although large scale models achieve significant improvements in performance, the overfitting challenge still frequently undermines their generalization ability. In super resolution tasks on images, diffusion models as representatives of generative models typically adopt large scale architectures. However, few-shot drone-captured infrared training data frequently induces severe overfitting in large-scale architectures. To address this key challenge, our method proposes a new Gaussian quantization representation learning method oriented to diffusion models that alleviates overfitting and enhances robustness. At the same time, an effective monitoring mechanism tracks large scale architectures during training to detect signs of overfitting. By introducing Gaussian quantization representation learning, our method effectively reduces overfitting while maintaining architecture complexity. On this basis, we construct a multi source drone-based infrared image benchmark dataset for detection and use it to emphasize overfitting issues of large scale architectures in few sample, drone-based diverse drone-based image reconstruction scenarios. To verify the efficacy of the method in mitigating overfitting, experiments are conducted on the constructed benchmark. Experimental results demonstrate that our method outperforms existing super resolution approaches and significantly mitigates overfitting of large scale architectures under complex conditions. The code and DroneSR dataset will be available at: https://github.com/wengzp1/GARLSR.
中文:本文针对无人机拍摄的少量红外图像导致大规模架构过拟合的问题,提出了一种面向扩散模型的高斯量化表示学习方法,通过构建的新基准数据集验证了该方法在抑制过拟合方面优于现有超分辨率技术。
English: This paper introduces a Gaussian quantization representation learning method for diffusion models to mitigate overfitting in large-scale architectures when processing few-shot drone-captured infrared images, validated through a newly constructed benchmark dataset showing superior performance over existing super-resolution approaches.

Authors:Yotam Erel, Rishabh Dabral, Vladislav Golyanik, Amit H. Bermano, Christian Theobalt
Title: PractiLight: Practical Light Control Using Foundational Diffusion Models
Abstract:
Light control in generated images is a difficult task, posing specific challenges, spanning over the entire image and frequency spectrum. Most approaches tackle this problem by training on extensive yet domain-specific datasets, limiting the inherent generalization and applicability of the foundational backbones used. Instead, PractiLight is a practical approach, effectively leveraging foundational understanding of recent generative models for the task. Our key insight is that lighting relationships in an image are similar in nature to token interaction in self-attention layers, and hence are best represented there. Based on this and other analyses regarding the importance of early diffusion iterations, PractiLight trains a lightweight LoRA regressor to produce the direct irradiance map for a given image, using a small set of training images. We then employ this regressor to incorporate the desired lighting into the generation process of another image using Classifier Guidance. This careful design generalizes well to diverse conditions and image domains. We demonstrate state-of-the-art performance in terms of quality and control with proven parameter and data efficiency compared to leading works over a wide variety of scenes types. We hope this work affirms that image lighting can feasibly be controlled by tapping into foundational knowledge, enabling practical and general relighting.
Chinese Summary: PractiLight提出了一种实用的图像光照控制方法,通过利用生成模型的基础知识,采用轻量级LoRA回归器生成辐照度图,并应用分类器引导技术,实现在多种场景下高效且通用的重光照效果。
English Summary: PractiLight introduces a practical method for controlling image lighting by leveraging generative models' foundational knowledge, using a lightweight LoRA regressor to produce irradiance maps and applying Classifier Guidance for effective and generalized relighting across diverse scenes.

Authors:Zetong Zhou, Dongping Chen, Zixian Ma, Zhihan Hu, Mingyang Fu, Sinan Wang, Yao Wan, Zhou Zhao, Ranjay Krishna
Title: Reinforced Visual Perception with Tools
Abstract:
Visual reasoning, a cornerstone of human intelligence, encompasses complex perceptual and logical processes essential for solving diverse visual problems. While advances in computer vision have produced powerful models for various perceptual tasks, leveraging these for general visual reasoning remains challenging. Prior work demonstrates that augmenting LLMs with vision models via supervised finetuning improves performance, but faces key limitations such as expensive data generation, reliance on careful data filtering, and poor generalization. To address these issues, we propose ReVPT to enhance multi-modal LLMs' abilities to reason about and use visual tools through reinforcement learning. We introduce a novel RL algorithm based on GRPO, designed to train models to reason with a suite of four visual tools. Through extensive experiments, we show that our method achieves state-of-the-art performance on several perception-heavy benchmarks, including SAT, CV-Bench, BLINK and MMStar, significantly outperforming the supervised and text-based RL finetuning baselines. Notably, Our ReVPT-3B and ReVPT-7B outperform the instruct models by 9.03% and 9.44% on CV-Bench. Finally, we bring to the community new insights on RL-based visual tool-usage through extensive ablations. Our code is available at https://github.com/ls-kelvin/REVPT.
Chinese Summary: 本文提出ReVPT方法,通过强化学习训练多模态大语言模型使用四种视觉工具来增强视觉推理能力,在多个感知密集型基准测试中取得最优性能,显著超越监督学习和基于文本的强化学习基线。
English Summary: The paper introduces ReVPT, a reinforcement learning approach that enhances multimodal LLMs' visual reasoning by training them to use four visual tools, achieving state-of-the-art results on perception-heavy benchmarks and outperforming supervised baselines by significant margins.

Authors:Yuqing Chen, Junjie Wang, Lin Liu, Ruihang Chu, Xiaopeng Zhang, Qi Tian, Yujiu Yang
Title: O-DisCo-Edit: Object Distortion Control for Unified Realistic Video Editing
Abstract:
Diffusion models have recently advanced video editing, yet controllable editing remains challenging due to the need for precise manipulation of diverse object properties. Current methods require different control signal for diverse editing tasks, which complicates model design and demands significant training resources. To address this, we propose O-DisCo-Edit, a unified framework that incorporates a novel object distortion control (O-DisCo). This signal, based on random and adaptive noise, flexibly encapsulates a wide range of editing cues within a single representation. Paired with a "copy-form" preservation module for preserving non-edited regions, O-DisCo-Edit enables efficient, high-fidelity editing through an effective training paradigm. Extensive experiments and comprehensive human evaluations consistently demonstrate that O-DisCo-Edit surpasses both specialized and multitask state-of-the-art methods across various video editing tasks. https://cyqii.github.io/O-DisCo-Edit.github.io/
中文:O-DisCo-Edit提出了一种统一框架,结合物体扭曲控制和复制形态保护模块,能在多种视频编辑任务中实现灵活且高保真的编辑效果,在实验和人工评估中均优于现有方法。
English: O-DisCo-Edit introduces a unified framework with object distortion control and a copy-form preservation module to enable flexible, high-fidelity video editing across diverse tasks, outperforming existing methods in experiments and human evaluations.

Authors:Ganlin Zhang, Shenhan Qian, Xi Wang, Daniel Cremers
Title: ViSTA-SLAM: Visual SLAM with Symmetric Two-view Association
Abstract:
We present ViSTA-SLAM as a real-time monocular visual SLAM system that operates without requiring camera intrinsics, making it broadly applicable across diverse camera setups. At its core, the system employs a lightweight symmetric two-view association (STA) model as the frontend, which simultaneously estimates relative camera poses and regresses local pointmaps from only two RGB images. This design reduces model complexity significantly, the size of our frontend is only 35\% that of comparable state-of-the-art methods, while enhancing the quality of two-view constraints used in the pipeline. In the backend, we construct a specially designed Sim(3) pose graph that incorporates loop closures to address accumulated drift. Extensive experiments demonstrate that our approach achieves superior performance in both camera tracking and dense 3D reconstruction quality compared to current methods. Github repository: https://github.com/zhangganlin/vista-slam
中文:ViSTA-SLAM是一种无需相机内参的实时单目视觉SLAM系统,采用轻量级对称双视图关联模型进行位姿估计和点云重建,显著降低复杂度,在相机跟踪和稠密三维重建方面性能优越。
English: ViSTA-SLAM is a real-time monocular visual SLAM system that operates without camera intrinsics, using a lightweight symmetric two-view association model for efficient pose estimation and point mapping, achieving superior tracking and 3D reconstruction with reduced complexity.

Authors:Biao Yang, Bin Wen, Boyang Ding, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Guowang Zhang, Han Shen, Hao Peng, Haojie Ding, Hao Wang, Haonan Fan, Hengrui Ju, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Muhao Wei, Qiang Wang, Ruitao Wang, Sen Na, Shengnan Zhang, Siyang Mao, Sui Huang, Tianke Zhang, Tingting Gao, Wei Chen, Wei Yuan, Xiangyu Wu, Xiao Hu, Xingyu Lu, Yi-Fan Zhang, Yiping Yang, Yulong Chen, Zeyi Lu, Zhenhua Wu, Zhixin Ling, Zhuoran Yang, Ziming Li, Di Xu, Haixuan Gao, Hang Li, Jing Wang, Lejian Ren, Qigen Hu, Qianqian Wang, Shiyao Wang, Xinchen Luo, Yan Li, Yuhang Hu, Zixing Zhang
Title: Kwai Keye-VL 1.5 Technical Report
Abstract:
In recent years, the development of Large Language Models (LLMs) has significantly advanced, extending their capabilities to multimodal tasks through Multimodal Large Language Models (MLLMs). However, video understanding remains a challenging area due to the dynamic and information-dense nature of videos. Existing models struggle with the trade-off between spatial resolution and temporal coverage when processing video content. We present Keye-VL-1.5, which addresses fundamental challenges in video comprehension through three key innovations. First, we introduce a novel Slow-Fast video encoding strategy that dynamically allocates computational resources based on inter-frame similarity, processing key frames with significant visual changes at higher resolution (Slow pathway) while handling relatively static frames with increased temporal coverage at lower resolution (Fast pathway). Second, we implement a progressive four-stage pre-training methodology that systematically extends the model's context length from 8K to 128K tokens, enabling processing of longer videos and more complex visual content. Third, we develop a comprehensive post-training pipeline focusing on reasoning enhancement and human preference alignment, incorporating a 5-step chain-of-thought data construction process, iterative GSPO-based reinforcement learning with progressive prompt hinting for difficult cases, and alignment training. Through extensive evaluation on public benchmarks and rigorous internal human assessment, Keye-VL-1.5 demonstrates significant improvements over existing models, particularly excelling in video understanding tasks while maintaining competitive performance on general multimodal benchmarks.
中文:Keye-VL-1.5通过慢快双路视频编码、渐进式上下文扩展和强化后训练三大创新,在显著提升视频理解能力的同时保持了优异的多模态性能。
English: Keye-VL-1.5 introduces a Slow-Fast video encoding strategy, progressive context extension, and enhanced post-training to significantly improve video understanding while maintaining strong multimodal performance.

Authors:Junjie Chen, Xuyang Liu, Zichen Wen, Yiyu Wang, Siteng Huang, Honggang Chen
Title: Variation-aware Vision Token Dropping for Faster Large Vision-Language Models
Abstract:
Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts, leading to reduced inference efficiency. Token compression offers a direct solution by reducing the number of tokens to be processed, thereby improving computational efficiency. Through extensive analysis, we identify two critical limitations in existing inner-LLM token compression methods: positional bias and incompatibility with efficient operators, which hinder their practical deployment for LVLM acceleration. This paper presents the first approach from a token variation perspective, revealing that visual token variations within LLMs exhibit task-agnostic properties. We propose Variation-aware Vision Token Dropping (\textit{i.e.}, \textbf{V$^2$Drop}), which progressively removes visual tokens with minimal variation during LVLM inference, thereby enhancing computational efficiency. Extensive experiments across multiple models and benchmarks demonstrate that our V$^2$Drop is able to maintain \textbf{94.0\%} and \textbf{98.6\%} of the original model performance for image and video understanding tasks respectively, while reducing LLM generation latency by \textbf{31.5\%} and \textbf{74.2\%}. When combined with efficient operators, V$^2$Drop further reduces GPU peak memory usage.
中文: 本文提出V²Drop方法,通过渐进式剔除变化最小的视觉标记来实现大视觉语言模型的高效计算,在保持图像和视频理解任务性能的同时显著降低延迟和内存消耗。
English: This paper introduces V²Drop, a token compression method that enhances computational efficiency in large vision-language models by progressively dropping visual tokens with minimal variation, maintaining high performance while significantly reducing latency and memory usage.

Authors:Liu Qifeng, Zhao Dawei, Dong Yabo, Xiao Liang, Wang Juan, Min Chen, Li Fuyang, Jiang Weizhong, Lu Dongming, Nie Yiming
Title: PointSlice: Accurate and Efficient Slice-Based Representation for 3D Object Detection from Point Clouds
Abstract:
3D object detection from point clouds plays a critical role in autonomous driving. Currently, the primary methods for point cloud processing are voxel-based and pillarbased approaches. Voxel-based methods offer high accuracy through fine-grained spatial segmentation but suffer from slower inference speeds. Pillar-based methods enhance inference speed but still fall short of voxel-based methods in accuracy. To address these issues, we propose a novel point cloud processing method, PointSlice, which slices point clouds along the horizontal plane and includes a dedicated detection network. The main contributions of PointSlice are: (1) A new point cloud processing technique that converts 3D point clouds into multiple sets of 2D (x-y) data slices. The model only learns 2D data distributions, treating the 3D point cloud as separate batches of 2D data, which reduces the number of model parameters and enhances inference speed; (2) The introduction of a Slice Interaction Network (SIN). To maintain vertical relationships across slices, we incorporate SIN into the 2D backbone network, which improves the model's 3D object perception capability. Extensive experiments demonstrate that PointSlice achieves high detection accuracy and inference speed. On the Waymo dataset, PointSlice is 1.13x faster and has 0.79x fewer parameters than the state-of-the-art voxel-based method (SAFDNet), with only a 1.2 mAPH accuracy reduction. On the nuScenes dataset, we achieve a state-of-the-art detection result of 66.74 mAP. On the Argoverse 2 dataset, PointSlice is 1.10x faster, with 0.66x fewer parameters and a 1.0 mAP accuracy reduction. The code will be available at https://github.com/qifeng22/PointSlice2.
中文: PointSlice提出了一种创新的点云处理方法,通过将三维点云转换为二维切片并结合切片交互网络,在保持高精度的同时显著提升了推理速度,在多个自动驾驶数据集中表现优异。
English: PointSlice introduces a novel 3D object detection method that converts point clouds into 2D slices and employs a Slice Interaction Network to balance high accuracy with improved inference speed across multiple autonomous driving datasets.

Authors:Artur Díaz-Juan, Coloma Ballester, Gloria Haro
Title: SoccerHigh: A Benchmark Dataset for Automatic Soccer Video Summarization
Abstract:
Video summarization aims to extract key shots from longer videos to produce concise and informative summaries. One of its most common applications is in sports, where highlight reels capture the most important moments of a game, along with notable reactions and specific contextual events. Automatic summary generation can support video editors in the sports media industry by reducing the time and effort required to identify key segments. However, the lack of publicly available datasets poses a challenge in developing robust models for sports highlight generation. In this paper, we address this gap by introducing a curated dataset for soccer video summarization, designed to serve as a benchmark for the task. The dataset includes shot boundaries for 237 matches from the Spanish, French, and Italian leagues, using broadcast footage sourced from the SoccerNet dataset. Alongside the dataset, we propose a baseline model specifically designed for this task, which achieves an F1 score of 0.3956 in the test set. Furthermore, we propose a new metric constrained by the length of each target summary, enabling a more objective evaluation of the generated content. The dataset and code are available at https://ipcv.github.io/SoccerHigh/.

Authors:Mo Wang, Kaining Peng, Jingsheng Tang, Hongkai Wen, Quanying Liu
Title: DCA: Graph-Guided Deep Embedding Clustering for Brain Atlases
Abstract:
Brain atlases are essential for reducing the dimensionality of neuroimaging data and enabling interpretable analysis. However, most existing atlases are predefined, group-level templates with limited flexibility and resolution. We present Deep Cluster Atlas (DCA), a graph-guided deep embedding clustering framework for generating individualized, voxel-wise brain parcellations. DCA combines a pretrained autoencoder with spatially regularized deep clustering to produce functionally coherent and spatially contiguous regions. Our method supports flexible control over resolution and anatomical scope, and generalizes to arbitrary brain structures. We further introduce a standardized benchmarking platform for atlas evaluation, using multiple large-scale fMRI datasets. Across multiple datasets and scales, DCA outperforms state-of-the-art atlases, improving functional homogeneity by 98.8% and silhouette coefficient by 29%, and achieves superior performance in downstream tasks such as autism diagnosis and cognitive decoding. We also observe that a fine-tuned pretrained model achieves superior results on the corresponding task. Codes and models are available at https://github.com/ncclab-sustech/DCA .
中文: DCA是一种通过深度聚类生成个性化高分辨率脑区图谱的新框架,在功能同质性和疾病诊断等下游任务中显著优于现有方法。
English: DCA is a novel brain atlas framework that generates individualized, high-resolution parcellations using deep clustering, significantly outperforming existing methods in functional coherence and diagnostic applications.

Authors:Wei Lu, Lingyu Zhu, Si-Bao Chen
Title: Unsupervised Ultra-High-Resolution UAV Low-Light Image Enhancement: A Benchmark, Metric and Framework
Abstract:
Low light conditions significantly degrade Unmanned Aerial Vehicles (UAVs) performance in critical applications. Existing Low-light Image Enhancement (LIE) methods struggle with the unique challenges of aerial imagery, including Ultra-High Resolution (UHR), lack of paired data, severe non-uniform illumination, and deployment constraints. To address these issues, we propose three key contributions. First, we present U3D, the first unsupervised UHR UAV dataset for LIE, with a unified evaluation toolkit. Second, we introduce the Edge Efficiency Index (EEI), a novel metric balancing perceptual quality with key deployment factors: speed, resolution, model complexity, and memory footprint. Third, we develop U3LIE, an efficient framework with two training-only designs-Adaptive Pre-enhancement Augmentation (APA) for input normalization and a Luminance Interval Loss (L_int) for exposure control. U3LIE achieves SOTA results, processing 4K images at 23.8 FPS on a single GPU, making it ideal for real-time on-board deployment. In summary, these contributions provide a holistic solution (dataset, metric, and method) for advancing robust 24/7 UAV vision. The code and datasets are available at https://github.com/lwCVer/U3D_Toolkit.
中文摘要:该研究提出U3D无监督超高清数据集、边缘效率指数指标及U3LIE框架,有效提升无人机低光图像性能,实现4K实时处理,为全天候无人机视觉提供完整解决方案。
English Summary: The study introduces U3D, an unsupervised ultra-high-resolution dataset, the Edge Efficiency Index metric, and the U3LIE framework to enhance low-light UAV image performance, achieving real-time 4K processing for robust 24/7 aerial vision.

Authors:Jiayi Gao, Changcheng Hua, Qingchao Chen, Yuxin Peng, Yang Liu
Title: Identity-Preserving Text-to-Video Generation via Training-Free Prompt, Image, and Guidance Enhancement
Abstract:
Identity-preserving text-to-video (IPT2V) generation creates videos faithful to both a reference subject image and a text prompt. While fine-tuning large pretrained video diffusion models on ID-matched data achieves state-of-the-art results on IPT2V, data scarcity and high tuning costs hinder broader improvement. We thus introduce a Training-Free Prompt, Image, and Guidance Enhancement (TPIGE) framework that bridges the semantic gap between the video description and the reference image and design sampling guidance that enhances identity preservation and video quality, achieving performance gains at minimal cost.Specifically, we first propose Face Aware Prompt Enhancement, using GPT-4o to enhance the text prompt with facial details derived from the reference image. We then propose Prompt Aware Reference Image Enhancement, leveraging an identity-preserving image generator to refine the reference image, rectifying conflicts with the text prompt. The above mutual refinement significantly improves input quality before video generation. Finally, we propose ID-Aware Spatiotemporal Guidance Enhancement, utilizing unified gradients to optimize identity preservation and video quality jointly during generation.Our method outperforms prior work and is validated by automatic and human evaluations on a 1000 video test set, winning first place in the ACM Multimedia 2025 Identity-Preserving Video Generation Challenge, demonstrating state-of-the-art performance and strong generality. The code is available at https://github.com/Andyplus1/IPT2V.git.
中文: TPIGE框架通过GPT-4o优化文本提示和图像生成器增强参考图像,在视频生成过程中采用时空引导技术,无需额外训练即可实现最先进的身份保持文本到视频生成效果。
English: The TPIGE framework enhances identity-preserving text-to-video generation by refining prompts and reference images with GPT-4o and an image generator, then applying spatiotemporal guidance during sampling to achieve state-of-the-art results without additional training.

Authors:Oussama Messai, Abbass Zein-Eddine, Abdelouahid Bentamou, Mickaël Picq, Nicolas Duquesne, Stéphane Puydarrieux, Yann Gavet
Title: Image Quality Enhancement and Detection of Small and Dense Objects in Industrial Recycling Processes
Abstract:
This paper tackles two key challenges: detecting small, dense, and overlapping objects (a major hurdle in computer vision) and improving the quality of noisy images, especially those encountered in industrial environments. [1, 2]. Our focus is on evaluating methods built on supervised deep learning. We perform an analysis of these methods, using a newly developed dataset comprising over 10k images and 120k instances. By evaluating their performance, accuracy, and computational efficiency, we identify the most reliable detection systems and highlight the specific challenges they address in industrial applications. This paper also examines the use of deep learning models to improve image quality in noisy industrial environments. We introduce a lightweight model based on a fully connected convolutional network. Additionally, we suggest potential future directions for further enhancing the effectiveness of the model. The repository of the dataset and proposed model can be found at: https://github.com/o-messai/SDOOD, https://github.com/o-messai/DDSRNet
本文针对工业环境中检测小而密集、重叠物体及提升噪声图像质量的问题,采用监督式深度学习方法,通过新数据集评估性能并提出轻量级模型以改善图像质量。
This paper addresses the detection of small, dense, and overlapping objects and the enhancement of noisy images in industrial settings using supervised deep learning, evaluating methods on a new dataset and proposing a lightweight model for improved image quality.

Authors:Thinh-Phuc Nguyen, Thanh-Hai Nguyen, Gia-Huy Dinh, Lam-Huy Nguyen, Minh-Triet Tran, Trung-Nghia Le
Title: ReCap: Event-Aware Image Captioning with Article Retrieval and Semantic Gaussian Normalization
Abstract:
Image captioning systems often produce generic descriptions that fail to capture event-level semantics which are crucial for applications like news reporting and digital archiving. We present ReCap, a novel pipeline for event-enriched image retrieval and captioning that incorporates broader contextual information from relevant articles to generate narrative-rich, factually grounded captions. Our approach addresses the limitations of standard vision-language models that typically focus on visible content while missing temporal, social, and historical contexts. ReCap comprises three integrated components: (1) a robust two-stage article retrieval system using DINOv2 embeddings with global feature similarity for initial candidate selection followed by patch-level mutual nearest neighbor similarity re-ranking; (2) a context extraction framework that synthesizes information from article summaries, generic captions, and original source metadata; and (3) a large language model-based caption generation system with Semantic Gaussian Normalization to enhance fluency and relevance. Evaluated on the OpenEvents V1 dataset as part of Track 1 in the EVENTA 2025 Grand Challenge, ReCap achieved a strong overall score of 0.54666, ranking 2nd on the private test set. These results highlight ReCap's effectiveness in bridging visual perception with real-world knowledge, offering a practical solution for context-aware image understanding in high-stakes domains. The code is available at https://github.com/Noridom1/EVENTA2025-Event-Enriched-Image-Captioning.
Chinese: ReCap提出了一种事件增强的图像描述生成流程,通过整合文章上下文信息来生成富含叙述性的描述,在EVENTA 2025挑战赛中排名第二,有效解决了传统模型缺乏语境感知的问题。
English: ReCap introduces an event-enriched image captioning pipeline that integrates contextual information from articles to generate narrative-rich captions, addressing the limitations of standard models by ranking 2nd in the EVENTA 2025 challenge.

Authors:Xiangdong Zhang, Shaofeng Zhang, Junchi Yan
Title: Towards More Diverse and Challenging Pre-training for Point Cloud Learning: Self-Supervised Cross Reconstruction with Decoupled Views
Abstract:
Point cloud learning, especially in a self-supervised way without manual labels, has gained growing attention in both vision and learning communities due to its potential utility in a wide range of applications. Most existing generative approaches for point cloud self-supervised learning focus on recovering masked points from visible ones within a single view. Recognizing that a two-view pre-training paradigm inherently introduces greater diversity and variance, it may thus enable more challenging and informative pre-training. Inspired by this, we explore the potential of two-view learning in this domain. In this paper, we propose Point-PQAE, a cross-reconstruction generative paradigm that first generates two decoupled point clouds/views and then reconstructs one from the other. To achieve this goal, we develop a crop mechanism for point cloud view generation for the first time and further propose a novel positional encoding to represent the 3D relative position between the two decoupled views. The cross-reconstruction significantly increases the difficulty of pre-training compared to self-reconstruction, which enables our method to surpass previous single-modal self-reconstruction methods in 3D self-supervised learning. Specifically, it outperforms the self-reconstruction baseline (Point-MAE) by 6.5%, 7.0%, and 6.7% in three variants of ScanObjectNN with the Mlp-Linear evaluation protocol. The code is available at https://github.com/aHapBean/Point-PQAE.
中文摘要:本文提出Point-PQAE双视角交叉重建方法,通过解耦点云生成双视角并设计新型位置编码,在自监督点云学习中实现了比单视角方法更具挑战性的预训练任务,显著提升了性能表现。
English Summary: This paper introduces Point-PQAE, a two-view cross-reconstruction method for self-supervised point cloud learning that surpasses single-view approaches by generating more challenging pre-training tasks through decoupled views and novel positional encoding.

Authors:Lee Chae-Yeon, Nam Hyeon-Woo, Tae-Hyun Oh
Title: Learning Correlation-aware Aleatoric Uncertainty for 3D Hand Pose Estimation
Abstract:
3D hand pose estimation is a fundamental task in understanding human hands. However, accurately estimating 3D hand poses remains challenging due to the complex movement of hands, self-similarity, and frequent occlusions. In this work, we address two limitations: the inability of existing 3D hand pose estimation methods to estimate aleatoric (data) uncertainty, and the lack of uncertainty modeling that incorporates joint correlation knowledge, which has not been thoroughly investigated. To this end, we introduce aleatoric uncertainty modeling into the 3D hand pose estimation framework, aiming to achieve a better trade-off between modeling joint correlations and computational efficiency. We propose a novel parameterization that leverages a single linear layer to capture intrinsic correlations among hand joints. This is enabled by formulating the hand joint output space as a probabilistic distribution, allowing the linear layer to capture joint correlations. Our proposed parameterization is used as a task head layer, and can be applied as an add-on module on top of the existing models. Our experiments demonstrate that our parameterization for uncertainty modeling outperforms existing approaches. Furthermore, the 3D hand pose estimation model equipped with our uncertainty head achieves favorable accuracy in 3D hand pose estimation while introducing new uncertainty modeling capability to the model. The project page is available at https://hand-uncertainty.github.io/.

Authors:Lingzhou Mu, Qiang Wang, Fan Jiang, Mengchao Wang, Yaqi Fan, Mu Xu, Kai Zhang
Title: FantasyHSI: Video-Generation-Centric 4D Human Synthesis In Any Scene through A Graph-based Multi-Agent Framework
Abstract:
Human-Scene Interaction (HSI) seeks to generate realistic human behaviors within complex environments, yet it faces significant challenges in handling long-horizon, high-level tasks and generalizing to unseen scenes. To address these limitations, we introduce FantasyHSI, a novel HSI framework centered on video generation and multi-agent systems that operates without paired data. We model the complex interaction process as a dynamic directed graph, upon which we build a collaborative multi-agent system. This system comprises a scene navigator agent for environmental perception and high-level path planning, and a planning agent that decomposes long-horizon goals into atomic actions. Critically, we introduce a critic agent that establishes a closed-loop feedback mechanism by evaluating the deviation between generated actions and the planned path. This allows for the dynamic correction of trajectory drifts caused by the stochasticity of the generative model, thereby ensuring long-term logical consistency. To enhance the physical realism of the generated motions, we leverage Direct Preference Optimization (DPO) to train the action generator, significantly reducing artifacts such as limb distortion and foot-sliding. Extensive experiments on our custom SceneBench benchmark demonstrate that FantasyHSI significantly outperforms existing methods in terms of generalization, long-horizon task completion, and physical realism. Ours project page: https://fantasy-amap.github.io/fantasy-hsi/

Authors:Yuan Liu, Zhongyin Zhao, Le Tian, Haicheng Wang, Xubing Ye, Yangxiu You, Zilin Yu, Chuhan Wu, Xiao Zhou, Yang Yu, Jie Zhou
Title: POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion
Abstract:
High-quality labeled data is essential for training accurate document conversion models, particularly in domains with complex formats such as tables, formulas, and multi-column text. However, manual annotation is both costly and time-consuming, while automatic labeling using existing models often lacks accuracy in handling such challenging scenarios. Consequently, training student models by distilling outputs from teacher models can significantly limit their performance in real-world applications. In this paper, we propose a fully automated, distillation-free framework comprising two stages for constructing high-quality document extraction datasets and models capable of handling diverse document formats and layouts. In the first stage, we introduce a method for generating large-scale, diverse synthetic data, which enables a model to extract key elements in a unified format with strong initial performance. In the second stage, we present a self-improvement approach that further adapts the model, initially trained on synthetic data, to real-world documents. Specifically, we first use the fine-tuned model to annotate real documents, then apply a suite of filtering strategies to verify annotation quality, and finally retrain the model on the verified dataset. By iteratively repeating this process, we progressively enhance both the model's conversion capabilities and the quality of the generated data. We train a public POINTS-1.5 model to obtain POINTS-Reader, which surpasses many existing public and proprietary models of comparable or larger size. Our model is available at https://github.com/Tencent/POINTS-Reader.
中文: 本文提出了一种全自动的两阶段框架,首先生成多样化的合成数据训练文档提取模型,然后通过自改进技术在真实文档上迭代优化,最终开发出性能卓越的POINTS-Reader模型,其表现优于许多现有解决方案。
English: This paper introduces a fully automated, two-stage framework that first generates diverse synthetic data to train a model for document extraction and then iteratively refines it using self-improvement techniques on real documents, resulting in the high-performing POINTS-Reader model that surpasses many existing solutions.

Authors:Tianwei Ye, Yong Ma, Xiaoguang Mei
Title: DcMatch: Unsupervised Multi-Shape Matching with Dual-Level Consistency
Abstract:
Establishing point-to-point correspondences across multiple 3D shapes is a fundamental problem in computer vision and graphics. In this paper, we introduce DcMatch, a novel unsupervised learning framework for non-rigid multi-shape matching. Unlike existing methods that learn a canonical embedding from a single shape, our approach leverages a shape graph attention network to capture the underlying manifold structure of the entire shape collection. This enables the construction of a more expressive and robust shared latent space, leading to more consistent shape-to-universe correspondences via a universe predictor. Simultaneously, we represent these correspondences in both the spatial and spectral domains and enforce their alignment in the shared universe space through a novel cycle consistency loss. This dual-level consistency fosters more accurate and coherent mappings. Extensive experiments on several challenging benchmarks demonstrate that our method consistently outperforms previous state-of-the-art approaches across diverse multi-shape matching scenarios. Code is available at https://github.com/YeTianwei/DcMatch.
中文: DcMatch是一种无监督学习框架,通过形状图注意力网络和双域循环一致性实现鲁棒的多形状匹配,在多个基准测试中超越了现有最优方法。
English: DcMatch is an unsupervised learning framework that utilizes a shape graph attention network and dual-domain cycle consistency to achieve robust and consistent multi-shape matching, outperforming previous state-of-the-art methods.

Authors:Bingnan Yang, Mi Zhang, Zhili Zhang, Zhan Zhang, Yuanxin Zhao, Xiangyun Hu, Jianya Gong
Title: SegAssess: Panoramic quality mapping for robust and transferable unsupervised segmentation assessment
Abstract:
High-quality image segmentation is fundamental to pixel-level geospatial analysis in remote sensing, necessitating robust segmentation quality assessment (SQA), particularly in unsupervised settings lacking ground truth. Although recent deep learning (DL) based unsupervised SQA methods show potential, they often suffer from coarse evaluation granularity, incomplete assessments, and poor transferability. To overcome these limitations, this paper introduces Panoramic Quality Mapping (PQM) as a new paradigm for comprehensive, pixel-wise SQA, and presents SegAssess, a novel deep learning framework realizing this approach. SegAssess distinctively formulates SQA as a fine-grained, four-class panoramic segmentation task, classifying pixels within a segmentation mask under evaluation into true positive (TP), false positive (FP), true negative (TN), and false negative (FN) categories, thereby generating a complete quality map. Leveraging an enhanced Segment Anything Model (SAM) architecture, SegAssess uniquely employs the input mask as a prompt for effective feature integration via cross-attention. Key innovations include an Edge Guided Compaction (EGC) branch with an Aggregated Semantic Filter (ASF) module to refine predictions near challenging object edges, and an Augmented Mixup Sampling (AMS) training strategy integrating multi-source masks to significantly boost cross-domain robustness and zero-shot transferability. Comprehensive experiments across 32 datasets derived from 6 sources demonstrate that SegAssess achieves state-of-the-art (SOTA) performance and exhibits remarkable zero-shot transferability to unseen masks, establishing PQM via SegAssess as a robust and transferable solution for unsupervised SQA. The code is available at https://github.com/Yangbn97/SegAssess.
中文: 本文提出SegAssess框架,通过全景质量映射实现像素级分割质量评估,在跨域测试中展现出卓越的零样本迁移能力,成为无监督分割质量评估的突破性解决方案。
English: This paper introduces SegAssess, a novel deep learning framework implementing Panoramic Quality Mapping (PQM) for comprehensive pixel-level segmentation quality assessment, which achieves state-of-the-art performance and exceptional zero-shot transferability across diverse datasets.

Authors:Weiren Zhao, Lanfeng Zhong, Xin Liao, Wenjun Liao, Sichuan Zhang, Shaoting Zhang, Guotai Wang
Title: MetaSSL: A General Heterogeneous Loss for Semi-Supervised Medical Image Segmentation
Abstract:
Semi-Supervised Learning (SSL) is important for reducing the annotation cost for medical image segmentation models. State-of-the-art SSL methods such as Mean Teacher, FixMatch and Cross Pseudo Supervision (CPS) are mainly based on consistency regularization or pseudo-label supervision between a reference prediction and a supervised prediction. Despite the effectiveness, they have overlooked the potential noise in the labeled data, and mainly focus on strategies to generate the reference prediction, while ignoring the heterogeneous values of different unlabeled pixels. We argue that effectively mining the rich information contained by the two predictions in the loss function, instead of the specific strategy to obtain a reference prediction, is more essential for SSL, and propose a universal framework MetaSSL based on a spatially heterogeneous loss that assigns different weights to pixels by simultaneously leveraging the uncertainty and consistency information between the reference and supervised predictions. Specifically, we split the predictions on unlabeled data into four regions with decreasing weights in the loss: Unanimous and Confident (UC), Unanimous and Suspicious (US), Discrepant and Confident (DC), and Discrepant and Suspicious (DS), where an adaptive threshold is proposed to distinguish confident predictions from suspicious ones. The heterogeneous loss is also applied to labeled images for robust learning considering the potential annotation noise. Our method is plug-and-play and general to most existing SSL methods. The experimental results showed that it improved the segmentation performance significantly when integrated with existing SSL frameworks on different datasets. Code is available at https://github.com/HiLab-git/MetaSSL.
中文: 提出的MetaSSL框架通过引入空间异质性损失,根据不确定性和一致性为像素分配权重,有效处理标注数据中的噪声,显著提升了多种半监督学习方法在医学图像分割中的性能。
English: The proposed MetaSSL framework enhances semi-supervised medical image segmentation by introducing a spatially heterogeneous loss that assigns pixel-level weights based on uncertainty and consistency, effectively addressing noise in labeled data and improving performance across various SSL methods.

Authors:Zhengqiang Zhang, Rongyuan Wu, Lingchen Sun, Lei Zhang
Title: GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation
Abstract:
Effective and efficient tokenization plays an important role in image representation and generation. Conventional methods, constrained by uniform 2D/1D grid tokenization, are inflexible to represent regions with varying shapes and textures and at different locations, limiting their efficacy of feature representation. In this work, we propose $\textbf{GPSToken}$, a novel $\textbf{G}$aussian $\textbf{P}$arameterized $\textbf{S}$patially-adaptive $\textbf{Token}$ization framework, to achieve non-uniform image tokenization by leveraging parametric 2D Gaussians to dynamically model the shape, position, and textures of different image regions. We first employ an entropy-driven algorithm to partition the image into texture-homogeneous regions of variable sizes. Then, we parameterize each region as a 2D Gaussian (mean for position, covariance for shape) coupled with texture features. A specialized transformer is trained to optimize the Gaussian parameters, enabling continuous adaptation of position/shape and content-aware feature extraction. During decoding, Gaussian parameterized tokens are reconstructed into 2D feature maps through a differentiable splatting-based renderer, bridging our adaptive tokenization with standard decoders for end-to-end training. GPSToken disentangles spatial layout (Gaussian parameters) from texture features to enable efficient two-stage generation: structural layout synthesis using lightweight networks, followed by structure-conditioned texture generation. Experiments demonstrate the state-of-the-art performance of GPSToken, which achieves rFID and FID scores of 0.65 and 1.50 on image reconstruction and generation tasks using 128 tokens, respectively. Codes and models of GPSToken can be found at $\href{https://github.com/xtudbxk/GPSToken}{https://github.com/xtudbxk/GPSToken}$.
Chinese: GPSToken提出了一种高斯参数化的空间自适应标记框架,利用二维高斯动态建模图像区域,实现非均匀标记,并在图像重建和生成任务中取得了最先进的性能。
English: GPSToken introduces a Gaussian parameterized spatially-adaptive tokenization framework that uses 2D Gaussians to dynamically model image regions, enabling non-uniform tokenization and achieving state-of-the-art performance in image reconstruction and generation.

Authors:Huang Fang, Mengxi Zhang, Heng Dong, Wei Li, Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, Hang Li
Title: Robix: A Unified Model for Robot Interaction, Reasoning and Planning
Abstract:
We introduce Robix, a unified model that integrates robot reasoning, task planning, and natural language interaction within a single vision-language architecture. Acting as the high-level cognitive layer in a hierarchical robot system, Robix dynamically generates atomic commands for the low-level controller and verbal responses for human interaction, enabling robots to follow complex instructions, plan long-horizon tasks, and interact naturally with human within an end-to-end framework. Robix further introduces novel capabilities such as proactive dialogue, real-time interruption handling, and context-aware commonsense reasoning during task execution. At its core, Robix leverages chain-of-thought reasoning and adopts a three-stage training strategy: (1) continued pretraining to enhance foundational embodied reasoning abilities including 3D spatial understanding, visual grounding, and task-centric reasoning; (2) supervised finetuning to model human-robot interaction and task planning as a unified reasoning-action sequence; and (3) reinforcement learning to improve reasoning-action consistency and long-horizon task coherence. Extensive experiments demonstrate that Robix outperforms both open-source and commercial baselines (e.g., GPT-4o and Gemini 2.5 Pro) in interactive task execution, demonstrating strong generalization across diverse instruction types (e.g., open-ended, multi-stage, constrained, invalid, and interrupted) and various user-involved tasks such as table bussing, grocery shopping, and dietary filtering.

Authors:Dongfu Jiang, Yi Lu, Zhuofeng Li, Zhiheng Lyu, Ping Nie, Haozhe Wang, Alex Su, Hui Chen, Kai Zou, Chao Du, Tianyu Pang, Wenhu Chen
Title: VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated success in enhancing LLM reasoning capabilities, but remains limited to single-turn interactions without tool integration. While recent Agentic Reinforcement Learning with Tool use (ARLT) approaches have emerged to address multi-turn tool interactions, existing works develop task-specific codebases that suffer from fragmentation, synchronous execution bottlenecks, and limited extensibility across domains. These inefficiencies hinder broader community adoption and algorithmic innovation. We introduce VerlTool, a unified and modular framework that addresses these limitations through systematic design principles. VerlTool provides four key contributions: (1) upstream alignment with VeRL ensuring compatibility and simplified maintenance, (2) unified tool management via standardized APIs supporting diverse modalities including code execution, search, SQL databases, and vision processing, (3) asynchronous rollout execution achieving near 2$\times$ speedup by eliminating synchronization bottlenecks, and (4) comprehensive evaluation demonstrating competitive performance across 6 ARLT domains. Our framework formalizes ARLT as multi-turn trajectories with multi-modal observation tokens (text/image/video), extending beyond single-turn RLVR paradigms. We train and evaluate models on mathematical reasoning, knowledge QA, SQL generation, visual reasoning, web search, and software engineering tasks, achieving results comparable to specialized systems while providing unified training infrastructure. The modular plugin architecture enables rapid tool integration requiring only lightweight Python definitions, significantly reducing development overhead and providing a scalable foundation for tool-augmented RL research. Our code is open-sourced at https://github.com/TIGER-AI-Lab/verl-tool.
中文: VerlTool作为一个统一模块化框架,通过标准化API、异步执行和灵活插件架构,解决了现有强化学习方法在多轮工具交互中的效率瓶颈,在六大领域实现优异性能的同时大幅降低了开发门槛。
English: VerlTool is a unified modular framework that overcomes the limitations of existing reinforcement learning approaches by enabling efficient multi-turn tool interactions through standardized APIs, asynchronous execution, and a flexible plugin architecture, achieving competitive performance across six domains while accelerating development.

Authors:Jaewoo Ahn, Junseo Kim, Heeseung Yun, Jaehyeon Son, Dongmin Park, Jaewoong Cho, Gunhee Kim
Title: FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games
Abstract:
GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions. Existing game benchmarks, however, lack diversity and rarely evaluate agents on completing entire storylines. To address this, we introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion and tackle the observation-behavior gap: the challenge of remembering and acting on earlier gameplay information. We also propose CUA-as-a-Judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap. Nonetheless, a marked discrepancy between humans and best-performing agents warrants continued research efforts to narrow this divide.

Authors:Josef Grün, Lukas Meyer, Maximilian Weiherer, Bernhard Egger, Marc Stamminger, Linus Franke
Title: Towards Integrating Multi-Spectral Imaging with Gaussian Splatting
Abstract:
We present a study of how to integrate color (RGB) and multi-spectral imagery (red, green, red-edge, and near-infrared) into the 3D Gaussian Splatting (3DGS) framework, a state-of-the-art explicit radiance-field-based method for fast and high-fidelity 3D reconstruction from multi-view images. While 3DGS excels on RGB data, naive per-band optimization of additional spectra yields poor reconstructions due to inconsistently appearing geometry in the spectral domain. This problem is prominent, even though the actual geometry is the same, regardless of spectral modality. To investigate this, we evaluate three strategies: 1) Separate per-band reconstruction with no shared structure. 2) Splitting optimization, in which we first optimize RGB geometry, copy it, and then fit each new band to the model by optimizing both geometry and band representation. 3) Joint, in which the modalities are jointly optimized, optionally with an initial RGB-only phase. We showcase through quantitative metrics and qualitative novel-view renderings on multi-spectral datasets the effectiveness of our dedicated optimized Joint strategy, increasing overall spectral reconstruction as well as enhancing RGB results through spectral cross-talk. We therefore suggest integrating multi-spectral data directly into the spherical harmonics color components to compactly model each Gaussian's multi-spectral reflectance. Moreover, our analysis reveals several key trade-offs in when and how to introduce spectral bands during optimization, offering practical insights for robust multi-modal 3DGS reconstruction.

Authors:Yutong Gao, Maoyuan Shao, Xinyang Huang, Chuang Zhu, Lijuan Sun, Yu Weng, Xuan Liu, Guoshun Nan
Title: Spotlighter: Revisiting Prompt Tuning from a Representative Mining View
Abstract:
CLIP's success has demonstrated that prompt tuning can achieve robust cross-modal semantic alignment for tasks ranging from open-domain recognition to fine-grained classification. However, redundant or weakly relevant feature components introduce noise and incur unnecessary computational costs. In this work, we propose Spotlighter, a lightweight token-selection framework that simultaneously enhances accuracy and efficiency in prompt tuning. Spotlighter evaluates each visual token's activation from both sample-wise and semantic-wise perspectives and retains only the top-scoring tokens for downstream prediction. A class-specific semantic memory bank of learned prototypes refines this selection, ensuring semantic representativeness and compensating for discarded features. To further prioritize informative signals, we introduce a two-level ranking mechanism that dynamically weights token--prototype interactions. Across 11 few-shot benchmarks, Spotlighter outperforms CLIP by up to 11.19\% in harmonic mean accuracy and achieves up to 0.8K additional FPS, with only 21 extra parameters. These results establish Spotlighter as an effective and scalable baseline for prompt tuning. Code for our method will be available at https://github.com/greatest-gourmet/Spotlighter.
中文: Spotlighter是一种轻量级令牌选择框架,通过双重评估和语义记忆库保留高分视觉令牌,在提示调优中显著提升准确性与效率,成为该领域的有效基准。
English: Spotlighter is a lightweight token-selection framework that enhances prompt tuning by retaining top-scoring visual tokens through dual-perspective evaluation and a semantic memory bank, achieving superior accuracy and efficiency across benchmarks.

Authors:Zirui Zhou, Zizhao Peng, Dongyang Jin, Chao Fan, Fengwei An, Shiqi Yu
Title: Pose as Clinical Prior: Learning Dual Representations for Scoliosis Screening
Abstract:
Recent AI-based scoliosis screening methods primarily rely on large-scale silhouette datasets, often neglecting clinically relevant postural asymmetries-key indicators in traditional screening. In contrast, pose data provide an intuitive skeletal representation, enhancing clinical interpretability across various medical applications. However, pose-based scoliosis screening remains underexplored due to two main challenges: (1) the scarcity of large-scale, annotated pose datasets; and (2) the discrete and noise-sensitive nature of raw pose coordinates, which hinders the modeling of subtle asymmetries. To address these limitations, we introduce Scoliosis1K-Pose, a 2D human pose annotation set that extends the original Scoliosis1K dataset, comprising 447,900 frames of 2D keypoints from 1,050 adolescents. Building on this dataset, we introduce the Dual Representation Framework (DRF), which integrates a continuous skeleton map to preserve spatial structure with a discrete Postural Asymmetry Vector (PAV) that encodes clinically relevant asymmetry descriptors. A novel PAV-Guided Attention (PGA) module further uses the PAV as clinical prior to direct feature extraction from the skeleton map, focusing on clinically meaningful asymmetries. Extensive experiments demonstrate that DRF achieves state-of-the-art performance. Visualizations further confirm that the model leverages clinical asymmetry cues to guide feature extraction and promote synergy between its dual representations. The dataset and code are publicly available at https://zhouzi180.github.io/Scoliosis1K/.

Authors:Xueyang Kang, Zhengkang Xiang, Zezheng Zhang, Kourosh Khoshelham
Title: Look Beyond: Two-Stage Scene View Generation via Panorama and Video Diffusion
Abstract:
Novel view synthesis (NVS) from a single image is highly ill-posed due to large unobserved regions, especially for views that deviate significantly from the input. While existing methods focus on consistency between the source and generated views, they often fail to maintain coherence and correct view alignment across long-range or looped trajectories. We propose a model that addresses this by decomposing single-view NVS into a 360-degree scene extrapolation followed by novel view interpolation. This design ensures long-term view and scene consistency by conditioning on keyframes extracted and warped from a generated panoramic representation. In the first stage, a panorama diffusion model learns the scene prior from the input perspective image. Perspective keyframes are then sampled and warped from the panorama and used as anchor frames in a pre-trained video diffusion model, which generates novel views through a proposed spatial noise diffusion process. Compared to prior work, our method produces globally consistent novel views -- even in loop closure scenarios -- while enabling flexible camera control. Experiments on diverse scene datasets demonstrate that our approach outperforms existing methods in generating coherent views along user-defined trajectories. Our implementation is available at https://github.com/YiGuYT/LookBeyond.
中文摘要:该模型通过先外推360度场景再插值新视角的方法,解决了单图像新视角合成中的长期一致性问题,在用户定义轨迹上比现有方法生成更连贯的视图。
English Summary: The proposed model enhances novel view synthesis from a single image by first extrapolating a 360-degree scene and then interpolating novel views, ensuring long-term consistency and superior performance on user-defined trajectories compared to existing methods.

Authors:Sicheng Yang, Hongqiu Wang, Zhaohu Xing, Sixiang Chen, Lei Zhu
Title: SegDINO: An Efficient Design for Medical and Natural Image Segmentation with DINO-V3
Abstract:
The DINO family of self-supervised vision models has shown remarkable transferability, yet effectively adapting their representations for segmentation remains challenging. Existing approaches often rely on heavy decoders with multi-scale fusion or complex upsampling, which introduce substantial parameter overhead and computational cost. In this work, we propose SegDINO, an efficient segmentation framework that couples a frozen DINOv3 backbone with a lightweight decoder. SegDINO extracts multi-level features from the pretrained encoder, aligns them to a common resolution and channel width, and utilizes a lightweight MLP head to directly predict segmentation masks. This design minimizes trainable parameters while preserving the representational power of foundation features. Extensive experiments across six benchmarks, including three medical datasets (TN3K, Kvasir-SEG, ISIC) and three natural image datasets (MSD, VMD-D, ViSha), demonstrate that SegDINO consistently achieves state-of-the-art performance compared to existing methods. Code is available at https://github.com/script-Yang/SegDINO.
中文摘要:SegDINO是一种高效的分割框架,通过将冻结的DINOv3主干网络与轻量级解码器相结合,在多个基准测试中实现最优性能,同时显著降低计算成本。
English Summary: SegDINO is an efficient segmentation framework combining a frozen DINOv3 backbone with a lightweight decoder that achieves state-of-the-art performance across multiple benchmarks while minimizing computational overhead.

Authors:Xinlei Liu, Tao Hu, Peng Yi, Weitao Han, Jichao Xie, Baolin Li
Title: Sequential Difference Maximization: Generating Adversarial Examples via Multi-Stage Optimization
Abstract:
Efficient adversarial attack methods are critical for assessing the robustness of computer vision models. In this paper, we reconstruct the optimization objective for generating adversarial examples as "maximizing the difference between the non-true labels' probability upper bound and the true label's probability," and propose a gradient-based attack method termed Sequential Difference Maximization (SDM). SDM establishes a three-layer optimization framework of "cycle-stage-step." The processes between cycles and between iterative steps are respectively identical, while optimization stages differ in terms of loss functions: in the initial stage, the negative probability of the true label is used as the loss function to compress the solution space; in subsequent stages, we introduce the Directional Probability Difference Ratio (DPDR) loss function to gradually increase the non-true labels' probability upper bound by compressing the irrelevant labels' probabilities. Experiments demonstrate that compared with previous SOTA methods, SDM not only exhibits stronger attack performance but also achieves higher attack cost-effectiveness. Additionally, SDM can be combined with adversarial training methods to enhance their defensive effects. The code is available at https://github.com/X-L-Liu/SDM.
Chinese: 本文提出序列差异最大化(SDM)方法,通过循环-阶段-步骤的三层优化框架,在压缩真实标签概率的同时提升非真实标签概率上限,相比现有最优方法不仅攻击性能更强且成本效益更高。
English: This paper introduces Sequential Difference Maximization (SDM), a gradient-based adversarial attack method that enhances both attack effectiveness and cost-efficiency by optimizing non-true label probabilities while compressing the true label's probability, outperforming previous state-of-the-art methods.

Authors:Zeyu Li, Annan Shu
Title: Aligned Anchor Groups Guided Line Segment Detector
Abstract:
This paper introduces a novel line segment detector, the Aligned Anchor Groups guided Line Segment Detector (AAGLSD), designed to detect line segments from images with high precision and completeness. The algorithm employs a hierarchical approach to extract candidate pixels with different saliency levels, including regular anchors and aligned anchor groups. AAGLSD initiates from these aligned anchor groups, sequentially linking anchors and updating the currently predicted line segment simultaneously. The final predictions are derived through straightforward validation and merging of adjacent line segments, avoiding complex refinement strategies. AAGLSD is evaluated on various datasets and quantitative experiments demonstrate that the proposed method can effectively extract complete line segments from input images compared to other advanced line segment detectors. The implementation is available at https://github.com/LLiDaBao/AAGLSD.
中文摘要:AAGLSD是一种新型线段检测器,通过对齐锚点组和分层像素提取技术,结合简单验证与合并流程,实现了高精度且完整的线段检测效果,性能优于现有先进方法。
English Summary: AAGLSD is a novel line segment detector that uses aligned anchor groups and hierarchical pixel extraction to achieve high-precision detection with simple validation and merging processes, outperforming existing methods in completeness.

Authors:Yangsong Zhang, Abdul Ahad Butt, Gül Varol, Ivan Laptev
Title: InterPose: Learning to Generate Human-Object Interactions from Large-Scale Web Videos
Abstract:
Human motion generation has shown great advances thanks to the recent diffusion models trained on large-scale motion capture data. Most of existing works, however, currently target animation of isolated people in empty scenes. Meanwhile, synthesizing realistic human-object interactions in complex 3D scenes remains a critical challenge in computer graphics and robotics. One obstacle towards generating versatile high-fidelity human-object interactions is the lack of large-scale datasets with diverse object manipulations. Indeed, existing motion capture data is typically restricted to single people and manipulations of limited sets of objects. To address this issue, we propose an automatic motion extraction pipeline and use it to collect interaction-rich human motions. Our new dataset InterPose contains 73.8K sequences of 3D human motions and corresponding text captions automatically obtained from 45.8K videos with human-object interactions. We perform extensive experiments and demonstrate InterPose to bring significant improvements to state-of-the-art methods for human motion generation. Moreover, using InterPose we develop an LLM-based agent enabling zero-shot animation of people interacting with diverse objects and scenes.

Authors:Xiufeng Huang, Ziyuan Luo, Qi Song, Ruofei Wang, Renjie Wan
Title: MarkSplatter: Generalizable Watermarking for 3D Gaussian Splatting Model via Splatter Image Structure
Abstract:
The growing popularity of 3D Gaussian Splatting (3DGS) has intensified the need for effective copyright protection. Current 3DGS watermarking methods rely on computationally expensive fine-tuning procedures for each predefined message. We propose the first generalizable watermarking framework that enables efficient protection of Splatter Image-based 3DGS models through a single forward pass. We introduce GaussianBridge that transforms unstructured 3D Gaussians into Splatter Image format, enabling direct neural processing for arbitrary message embedding. To ensure imperceptibility, we design a Gaussian-Uncertainty-Perceptual heatmap prediction strategy for preserving visual quality. For robust message recovery, we develop a dense segmentation-based extraction mechanism that maintains reliable extraction even when watermarked objects occupy minimal regions in rendered views. Project page: https://kevinhuangxf.github.io/marksplatter.

Authors:Dinh-Khoi Vo, Van-Loc Nguyen, Minh-Triet Tran, Trung-Nghia Le
Title: EVENT-Retriever: Event-Aware Multimodal Image Retrieval for Realistic Captions
Abstract:
Event-based image retrieval from free-form captions presents a significant challenge: models must understand not only visual features but also latent event semantics, context, and real-world knowledge. Conventional vision-language retrieval approaches often fall short when captions describe abstract events, implicit causality, temporal context, or contain long, complex narratives. To tackle these issues, we introduce a multi-stage retrieval framework combining dense article retrieval, event-aware language model reranking, and efficient image collection, followed by caption-guided semantic matching and rank-aware selection. We leverage Qwen3 for article search, Qwen3-Reranker for contextual alignment, and Qwen2-VL for precise image scoring. To further enhance performance and robustness, we fuse outputs from multiple configurations using Reciprocal Rank Fusion (RRF). Our system achieves the top-1 score on the private test set of Track 2 in the EVENTA 2025 Grand Challenge, demonstrating the effectiveness of combining language-based reasoning and multimodal retrieval for complex, real-world image understanding. The code is available at https://github.com/vdkhoi20/EVENT-Retriever.
中文摘要:我们提出的多阶段检索框架融合了密集文档检索、事件感知重排序与多模态匹配技术,通过结合语言推理与视觉检索能力,在EVENTA 2025挑战赛中实现了基于复杂事件描述的图像检索最佳性能。
English Summary: Our multi-stage retrieval framework, integrating dense article retrieval, event-aware reranking, and multimodal matching, achieves top performance in the EVENTA 2025 Challenge by effectively combining language reasoning with visual retrieval for complex event-based image search.

Authors:Yumeng Lin, Dong Li, Xintao Wu, Minglai Shao, Xujiang Zhao, Zhong Chen, Chen Zhao
Title: Face4FairShifts: A Large Image Benchmark for Fairness and Robust Learning across Visual Domains
Abstract:
Ensuring fairness and robustness in machine learning models remains a challenge, particularly under domain shifts. We present Face4FairShifts, a large-scale facial image benchmark designed to systematically evaluate fairness-aware learning and domain generalization. The dataset includes 100,000 images across four visually distinct domains with 39 annotations within 14 attributes covering demographic and facial features. Through extensive experiments, we analyze model performance under distribution shifts and identify significant gaps. Our findings emphasize the limitations of existing related datasets and the need for more effective fairness-aware domain adaptation techniques. Face4FairShifts provides a comprehensive testbed for advancing equitable and reliable AI systems. The dataset is available online at https://meviuslab.github.io/Face4FairShifts/.

Authors:Aviral Chharia, Wenbo Gou, Haoye Dong
Title: MV-SSM: Multi-View State Space Modeling for 3D Human Pose Estimation
Abstract:
While significant progress has been made in single-view 3D human pose estimation, multi-view 3D human pose estimation remains challenging, particularly in terms of generalizing to new camera configurations. Existing attention-based transformers often struggle to accurately model the spatial arrangement of keypoints, especially in occluded scenarios. Additionally, they tend to overfit specific camera arrangements and visual scenes from training data, resulting in substantial performance drops in new settings. In this study, we introduce a novel Multi-View State Space Modeling framework, named MV-SSM, for robustly estimating 3D human keypoints. We explicitly model the joint spatial sequence at two distinct levels: the feature level from multi-view images and the person keypoint level. We propose a Projective State Space (PSS) block to learn a generalized representation of joint spatial arrangements using state space modeling. Moreover, we modify Mamba's traditional scanning into an effective Grid Token-guided Bidirectional Scanning (GTBS), which is integral to the PSS block. Multiple experiments demonstrate that MV-SSM achieves strong generalization, outperforming state-of-the-art methods: +10.8 on AP25 (+24%) on the challenging three-camera setting in CMU Panoptic, +7.0 on AP25 (+13%) on varying camera arrangements, and +15.3 PCP (+38%) on Campus A1 in cross-dataset evaluations. Project Website: https://aviralchharia.github.io/MV-SSM

Authors:Maggie Chen, Hala Lambdouar, Luca Marini, Laura Martínez-Ferrer, Chris Bridges, Giacomo Acciarini
Title: Towards Methane Detection Onboard Satellites
Abstract:
Methane is a potent greenhouse gas and a major driver of climate change, making its timely detection critical for effective mitigation. Machine learning (ML) deployed onboard satellites can enable rapid detection while reducing downlink costs, supporting faster response systems. Conventional methane detection methods often rely on image processing techniques, such as orthorectification to correct geometric distortions and matched filters to enhance plume signals. We introduce a novel approach that bypasses these preprocessing steps by using \textit{unorthorectified} data (UnorthoDOS). We find that ML models trained on this dataset achieve performance comparable to those trained on orthorectified data. Moreover, we also train models on an orthorectified dataset, showing that they can outperform the matched filter baseline (mag1c). We release model checkpoints and two ML-ready datasets comprising orthorectified and unorthorectified hyperspectral images from the Earth Surface Mineral Dust Source Investigation (EMIT) sensor at https://huggingface.co/datasets/SpaceML/UnorthoDOS , along with code at https://github.com/spaceml-org/plume-hunter.
中文摘要:一种利用未正射校正卫星数据的新型机器学习方法,在减少预处理需求的同时,实现了与传统正射校正方法相当的甲烷检测性能。
English summary: A new machine learning approach using unorthorectified satellite data achieves methane detection performance comparable to traditional orthorectified methods while reducing preprocessing requirements.

Authors:Yannick Kirchhoff, Maximilian Rokuss, Fabian Isensee, Klaus H. Maier-Hein
Title: Promptable Longitudinal Lesion Segmentation in Whole-Body CT
Abstract:
Accurate segmentation of lesions in longitudinal whole-body CT is essential for monitoring disease progression and treatment response. While automated methods benefit from incorporating longitudinal information, they remain limited in their ability to consistently track individual lesions across time. Task 2 of the autoPET/CT IV Challenge addresses this by providing lesion localizations and baseline delineations, framing the problem as longitudinal promptable segmentation. In this work, we extend the recently proposed LongiSeg framework with promptable capabilities, enabling lesion-specific tracking through point and mask interactions. To address the limited size of the provided training set, we leverage large-scale pretraining on a synthetic longitudinal CT dataset. Our experiments show that pretraining substantially improves the ability to exploit longitudinal context, yielding an improvement of up to 6 Dice points compared to models trained from scratch. These findings demonstrate the effectiveness of combining longitudinal context with interactive prompting for robust lesion tracking. Code is publicly available at https://github.com/MIC-DKFZ/LongiSeg/tree/autoPET.
中文摘要:本研究通过引入可提示分割功能扩展了LongiSeg框架,利用合成数据的大规模预训练,在纵向CT扫描中实现病灶精准追踪,将模型性能提升最高达6个Dice值点。
English Summary: The study enhances the LongiSeg framework with promptable segmentation for tracking individual lesions in longitudinal CT scans, using large-scale pretraining on synthetic data to improve performance by up to 6 Dice points.

Authors:Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, Hang Zhao
Title: Galaxea Open-World Dataset and G0 Dual-System VLA Model
Abstract:
We present Galaxea Open-World Dataset, a large-scale, diverse collection of robot behaviors recorded in authentic human living and working environments. All demonstrations are gathered using a consistent robotic embodiment, paired with precise subtask-level language annotations to facilitate both training and evaluation. Building on this dataset, we introduce G0, a dual-system framework that couples a Vision-Language Model (VLM) for multimodal planning with a Vision-Language-Action (VLA) model for fine-grained execution. G0 is trained using a three-stage curriculum: cross-embodiment pre-training, single-embodiment pre-training, and task-specific post-training. A comprehensive benchmark spanning tabletop manipulation, few-shot learning, and long-horizon mobile manipulation, demonstrates the effectiveness of our approach. In particular, we find that the single-embodiment pre-training stage, together with the Galaxea Open-World Dataset, plays a critical role in achieving strong performance.

Authors:Peirong Liu, Oula Puonti, Xiaoling Hu, Karthik Gopinath, Annabel Sorby-Adams, Daniel C. Alexander, W. Taylor Kimberly, Juan E. Iglesias
Title: A Modality-agnostic Multi-task Foundation Model for Human Brain Imaging
Abstract:
Recent learning-based approaches have made astonishing advances in calibrated medical imaging like computerized tomography (CT), yet they struggle to generalize in uncalibrated modalities -- notably magnetic resonance (MR) imaging, where performance is highly sensitive to the differences in MR contrast, resolution, and orientation. This prevents broad applicability to diverse real-world clinical protocols. Here we introduce BrainFM, a modality-agnostic, multi-task vision foundation model for human brain imaging. With the proposed "mild-to-severe" intra-subject generation and "real-synth" mix-up training strategy, BrainFM is resilient to the appearance of acquired images (e.g., modality, contrast, deformation, resolution, artifacts), and can be directly applied to five fundamental brain imaging tasks, including image synthesis for CT and T1w/T2w/FLAIR MRI, anatomy segmentation, scalp-to-cortical distance, bias field estimation, and registration. We evaluate the efficacy of BrainFM on eleven public datasets, and demonstrate its robustness and effectiveness across all tasks and input modalities. Code is available at https://github.com/jhuldr/BrainFM.
中文:BrainFM是一种适用于人脑成像的多任务视觉基础模型,通过创新的训练策略解决了未校准模态的泛化难题,能在多种临床数据上稳定执行五项核心成像任务。
English: BrainFM is a versatile vision foundation model for brain imaging that overcomes generalization challenges in uncalibrated modalities like MRI through innovative training strategies, enabling robust performance across five key imaging tasks on diverse clinical data.

Authors:Yasser Benigmim, Subhankar Roy, Khalid Oublal, Imad Eddine Marouf, Slim Essid, Vicky Kalogeiton, Stéphane Lathuilière
Title: Make me an Expert: Distilling from Generalist Black-Box Models into Specialized Models for Semantic Segmentation
Abstract:
The rise of Artificial Intelligence as a Service (AIaaS) democratizes access to pre-trained models via Application Programming Interfaces (APIs), but also raises a fundamental question: how can local models be effectively trained using black-box models that do not expose their weights, training data, or logits, a constraint in which current domain adaptation paradigms are impractical ? To address this challenge, we introduce the Black-Box Distillation (B2D) setting, which enables local model adaptation under realistic constraints: (1) the API model is open-vocabulary and trained on large-scale general-purpose data, and (2) access is limited to one-hot predictions only. We identify that open-vocabulary models exhibit significant sensitivity to input resolution, with different object classes being segmented optimally at different scales, a limitation termed the "curse of resolution". Our method, ATtention-Guided sCaler (ATGC), addresses this challenge by leveraging DINOv2 attention maps to dynamically select optimal scales for black-box model inference. ATGC scores the attention maps with entropy to identify informative scales for pseudo-labelling, enabling effective distillation. Experiments demonstrate substantial improvements under black-box supervision across multiple datasets while requiring only one-hot API predictions. Our code is available at https://github.com/yasserben/ATGC.
AIaaS通过API提供预训练模型,但在无法获取模型内部参数的情况下增加了本地模型训练的难度,因此提出了黑盒蒸馏(B2D)框架和ATGC方法,利用注意力引导的动态尺度选择优化伪标签生成,实现高效知识蒸馏。
AIaaS enables access to pre-trained models via APIs but complicates local model training without access to model internals, leading to the proposed Black-Box Distillation (B2D) setting and ATGC method that uses attention-guided scaling to optimize pseudo-labeling for effective distillation.

Authors:Xiang Chen, Renjiu Hu, Jinwei Zhang, Yuxi Zhang, Xinyao Yue, Min Liu, Yaonan Wang, Hang Zhang
Title: Encoder-Only Image Registration
Abstract:
Learning-based techniques have significantly improved the accuracy and speed of deformable image registration. However, challenges such as reducing computational complexity and handling large deformations persist. To address these challenges, we analyze how convolutional neural networks (ConvNets) influence registration performance using the Horn-Schunck optical flow equation. Supported by prior studies and our empirical experiments, we observe that ConvNets play two key roles in registration: linearizing local intensities and harmonizing global contrast variations. Based on these insights, we propose the Encoder-Only Image Registration (EOIR) framework, designed to achieve a better accuracy-efficiency trade-off. EOIR separates feature learning from flow estimation, employing only a 3-layer ConvNet for feature extraction and a set of 3-layer flow estimators to construct a Laplacian feature pyramid, progressively composing diffeomorphic deformations under a large-deformation model. Results on five datasets across different modalities and anatomical regions demonstrate EOIR's effectiveness, achieving superior accuracy-efficiency and accuracy-smoothness trade-offs. With comparable accuracy, EOIR provides better efficiency and smoothness, and vice versa. The source code of EOIR is publicly available on https://github.com/XiangChen1994/EOIR.
Chinese: 提出的仅编码器图像配准(EOIR)框架通过轻量级卷积网络将特征学习与流估计分离,在可变形图像配准中实现了精度与效率、精度与平滑度的更优平衡。
English: The proposed Encoder-Only Image Registration (EOIR) framework separates feature learning from flow estimation using a lightweight convolutional network to achieve superior accuracy-efficiency and accuracy-smoothness trade-offs in deformable image registration.

Authors:Xuechao Zou, Shun Zhang, Xing Fu, Yue Li, Kai Li, Yushe Cao, Congyan Lang, Pin Tao, Junliang Xing
Title: Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation
Abstract:
Controllable face generation poses critical challenges in generative modeling due to the intricate balance required between semantic controllability and photorealism. While existing approaches struggle with disentangling semantic controls from generation pipelines, we revisit the architectural potential of Diffusion Transformers (DiTs) through the lens of expert specialization. This paper introduces Face-MoGLE, a novel framework featuring: (1) Semantic-decoupled latent modeling through mask-conditioned space factorization, enabling precise attribute manipulation; (2) A mixture of global and local experts that captures holistic structure and region-level semantics for fine-grained controllability; (3) A dynamic gating network producing time-dependent coefficients that evolve with diffusion steps and spatial locations. Face-MoGLE provides a powerful and flexible solution for high-quality, controllable face generation, with strong potential in generative modeling and security applications. Extensive experiments demonstrate its effectiveness in multimodal and monomodal face generation settings and its robust zero-shot generalization capability. Project page is available at https://github.com/XavierJiezou/Face-MoGLE.
中文: Face-MoGLE提出了一种新颖的扩散变换器框架,通过语义解耦的潜在建模和全局-局部专家混合机制,实现了对人脸生成的精细控制,在高质量输出和强泛化能力方面表现卓越。
English: Face-MoGLE introduces a novel diffusion transformer framework that enables precise, fine-grained control over face generation through semantic-decoupled latent modeling and a mixture of global-local experts, achieving high-quality results with robust generalization.

Authors:Shumpei Takezaki, Ryoma Bise, Shinnosuke Matsuo
Title: NoiseCutMix: A Novel Data Augmentation Approach by Mixing Estimated Noise in Diffusion Models
Abstract:
In this study, we propose a novel data augmentation method that introduces the concept of CutMix into the generation process of diffusion models, thereby exploiting both the ability of diffusion models to generate natural and high-resolution images and the characteristic of CutMix, which combines features from two classes to create diverse augmented data. Representative data augmentation methods for combining images from multiple classes include CutMix and MixUp. However, techniques like CutMix often result in unnatural boundaries between the two images due to contextual differences. Therefore, in this study, we propose a method, called NoiseCutMix, to achieve natural, high-resolution image generation featuring the fused characteristics of two classes by partially combining the estimated noise corresponding to two different classes in a diffusion model. In the classification experiments, we verified the effectiveness of the proposed method by comparing it with conventional data augmentation techniques that combine multiple classes, random image generation using Stable Diffusion, and combinations of these methods. Our codes are available at: https://github.com/shumpei-takezaki/NoiseCutMix
本研究提出NoiseCutMix新方法,将CutMix概念融入扩散模型,通过融合两类噪声生成自然高清的双类特征图像,在分类实验中验证了其优于传统数据增强技术的有效性。
This study introduces NoiseCutMix, a novel data augmentation technique that integrates CutMix into diffusion models to generate natural, high-resolution images by blending noise from two classes, enhancing classification performance over traditional methods.

Authors:Zhen Chen, Xingjian Luo, Kun Yuan, Jinlin Wu, Danny T. M. Chan, Nassir Navab, Hongbin Liu, Zhen Lei, Jiebo Luo
Title: SurgLLM: A Versatile Large Multimodal Model with Spatial Focus and Temporal Awareness for Surgical Video Understanding
Abstract:
Surgical video understanding is crucial for facilitating Computer-Assisted Surgery (CAS) systems. Despite significant progress in existing studies, two major limitations persist, including inadequate visual content perception and insufficient temporal awareness in surgical videos, and hinder the development of versatile CAS solutions. In this work, we propose the SurgLLM framework, an effective large multimodal model tailored for versatile surgical video understanding tasks with enhanced spatial focus and temporal awareness. Specifically, to empower the spatial focus of surgical videos, we first devise Surgical Context-aware Multimodal Pretraining (Surg-Pretrain) for the video encoder of SurgLLM, by performing instrument-centric Masked Video Reconstruction (MV-Recon) and subsequent multimodal alignment. To incorporate surgical temporal knowledge into SurgLLM, we further propose Temporal-aware Multimodal Tuning (TM-Tuning) to enhance temporal reasoning with interleaved multimodal embeddings. Moreover, to accommodate various understanding tasks of surgical videos without conflicts, we devise a Surgical Task Dynamic Ensemble to efficiently triage a query with optimal learnable parameters in our SurgLLM. Extensive experiments performed on diverse surgical video understanding tasks, including captioning, general VQA, and temporal VQA, demonstrate significant improvements over the state-of-the-art approaches, validating the effectiveness of our SurgLLM in versatile surgical video understanding. The source code is available at https://github.com/franciszchen/SurgLLM.
中文: SurgLLM框架提出了一种大型多模态模型,通过创新的预训练和调优策略增强手术视频的空间聚焦和时间感知能力,在多种理解任务中实现了卓越性能。
English: The SurgLLM framework introduces a large multimodal model that enhances spatial focus and temporal awareness in surgical video understanding, achieving superior performance across various tasks through innovative pretraining and tuning strategies.

Authors:Xunpeng Yi, Yibing Zhang, Xinyu Xiang, Qinglong Yan, Han Xu, Jiayi Ma
Title: LUT-Fuse: Towards Extremely Fast Infrared and Visible Image Fusion via Distillation to Learnable Look-Up Tables
Abstract:
Current advanced research on infrared and visible image fusion primarily focuses on improving fusion performance, often neglecting the applicability on real-time fusion devices. In this paper, we propose a novel approach that towards extremely fast fusion via distillation to learnable lookup tables specifically designed for image fusion, termed as LUT-Fuse. Firstly, we develop a look-up table structure that utilizing low-order approximation encoding and high-level joint contextual scene encoding, which is well-suited for multi-modal fusion. Moreover, given the lack of ground truth in multi-modal image fusion, we naturally proposed the efficient LUT distillation strategy instead of traditional quantization LUT methods. By integrating the performance of the multi-modal fusion network (MM-Net) into the MM-LUT model, our method achieves significant breakthroughs in efficiency and performance. It typically requires less than one-tenth of the time compared to the current lightweight SOTA fusion algorithms, ensuring high operational speed across various scenarios, even in low-power mobile devices. Extensive experiments validate the superiority, reliability, and stability of our fusion approach. The code is available at https://github.com/zyb5/LUT-Fuse.
中文: 本文提出LUT-Fuse方法,通过可学习查找表与蒸馏策略实现极速红外与可见光图像融合,在保持高性能的同时,其速度比现有轻量级算法快十倍以上,适用于各类移动设备。
English: This paper introduces LUT-Fuse, a novel method that uses learnable lookup tables via distillation to achieve extremely fast infrared and visible image fusion, significantly outperforming current lightweight algorithms in speed while maintaining high performance across various devices.

Authors:Wei Ao, Vishnu Naresh Boddeti
Title: CryptoFace: End-to-End Encrypted Face Recognition
Abstract:
Face recognition is central to many authentication, security, and personalized applications. Yet, it suffers from significant privacy risks, particularly arising from unauthorized access to sensitive biometric data. This paper introduces CryptoFace, the first end-to-end encrypted face recognition system with fully homomorphic encryption (FHE). It enables secure processing of facial data across all stages of a face-recognition process--feature extraction, storage, and matching--without exposing raw images or features. We introduce a mixture of shallow patch convolutional networks to support higher-dimensional tensors via patch-based processing while reducing the multiplicative depth and, thus, inference latency. Parallel FHE evaluation of these networks ensures near-resolution-independent latency. On standard face recognition benchmarks, CryptoFace significantly accelerates inference and increases verification accuracy compared to the state-of-the-art FHE neural networks adapted for face recognition. CryptoFace will facilitate secure face recognition systems requiring robust and provable security. The code is available at https://github.com/human-analysis/CryptoFace.
中文:CryptoFace是首个采用全同态加密的端到端加密人脸识别系统,可在确保面部数据安全处理的同时,相比现有方法显著加速推理并提高验证准确率。
English: CryptoFace is the first end-to-end encrypted face recognition system using fully homomorphic encryption, enabling secure processing of facial data while accelerating inference and improving verification accuracy compared to existing methods.

Authors:Hikmat Khan, Syed Farhan Alam Zaidi, Pir Masoom Shah, Kiruthika Balakrishnan, Rabia Khan, Muhammad Waqas, Jia Wu
Title: MorphGen: Morphology-Guided Representation Learning for Robust Single-Domain Generalization in Histopathological Cancer Classification
Abstract:
Domain generalization in computational histopathology is hindered by heterogeneity in whole slide images (WSIs), caused by variations in tissue preparation, staining, and imaging conditions across institutions. Unlike machine learning systems, pathologists rely on domain-invariant morphological cues such as nuclear atypia (enlargement, irregular contours, hyperchromasia, chromatin texture, spatial disorganization), structural atypia (abnormal architecture and gland formation), and overall morphological atypia that remain diagnostic across diverse settings. Motivated by this, we hypothesize that explicitly modeling biologically robust nuclear morphology and spatial organization will enable the learning of cancer representations that are resilient to domain shifts. We propose MorphGen (Morphology-Guided Generalization), a method that integrates histopathology images, augmentations, and nuclear segmentation masks within a supervised contrastive learning framework. By aligning latent representations of images and nuclear masks, MorphGen prioritizes diagnostic features such as nuclear and morphological atypia and spatial organization over staining artifacts and domain-specific features. To further enhance out-of-distribution robustness, we incorporate stochastic weight averaging (SWA), steering optimization toward flatter minima. Attention map analyses revealed that MorphGen primarily relies on nuclear morphology, cellular composition, and spatial cell organization within tumors or normal regions for final classification. Finally, we demonstrate resilience of the learned representations to image corruptions (such as staining artifacts) and adversarial attacks, showcasing not only OOD generalization but also addressing critical vulnerabilities in current deep learning systems for digital pathology. Code, datasets, and trained models are available at: https://github.com/hikmatkhan/MorphGen
中文: MorphGen通过将核形态和空间组织整合到监督对比学习框架中,增强了组织病理学中的域泛化能力,提高了对域偏移和对抗攻击的鲁棒性。
English: MorphGen enhances domain generalization in histopathology by integrating nuclear morphology and spatial organization into a supervised contrastive learning framework, improving resilience to domain shifts and adversarial attacks.

Authors:Ghassen Baklouti, Maxime Zanella, Ismail Ben Ayed
Title: Language-Aware Information Maximization for Transductive Few-Shot CLIP
Abstract:
Transductive few-shot learning has triggered an abundant literature focusing on vision-only models, but is still at a nascent stage within the recent context of foundational vision-language models (VLMs). Only a few recent methods addressed the problem, pointing to the potential of tranduction in VLMs and to the need for VLM-tailored methods. Building on this momentum, we leverage information-theoretic concepts and recent progress in parameter-efficient fine-tuning (PEFT), developing a highly competitive transductive few-shot CLIP method. Specifically, we introduce a novel Language-aware Information MaximizatiOn (LIMO) loss integrating three complementary terms: (i) the mutual information between the vision inputs and the textual class descriptions; (ii) a Kullback-Leibler (KL) divergence penalizing deviation of the network's probabilistic outputs from the text-driven zero-shot predictions; and (iii) a standard cross-entropy loss based on the labeled shots. Furthermore, we challenge the commonly followed fine-tuning practices in the context of transductive few-shot learning, and explore PEFT strategies, completely overlooked in this context. Surprisingly, we observe substantial boosts in performances, which points to the potential of adapting a subset of the model's parameters in the transductive few-shot setting. We report comprehensive evaluations, which show that LIMO outperforms the very recent transductive few-shot CLIP methods by a large margin and yields significant gains over the best-performing inductive methods. Our code is publicly available at:\[ \href{https://github.com/ghassenbaklouti/LIMO}{\text{here}} \]
中文: 本文提出LIMO方法,通过信息论框架和参数高效微调技术,在视觉语言模型的转导式小样本学习中实现了突破性性能提升。
English: This paper introduces LIMO, a novel transductive few-shot learning method for vision-language models that combines information-theoretic principles with parameter-efficient fine-tuning to significantly outperform existing approaches.

Authors:Faizan Farooq Khan, Vladan Stojnić, Zakaria Laskar, Mohamed Elhoseiny, Giorgos Tolias
Title: Category-level Text-to-Image Retrieval Improved: Bridging the Domain Gap with Diffusion Models and Vision Encoders
Abstract:
This work explores text-to-image retrieval for queries that specify or describe a semantic category. While vision-and-language models (VLMs) like CLIP offer a straightforward open-vocabulary solution, they map text and images to distant regions in the representation space, limiting retrieval performance. To bridge this modality gap, we propose a two-step approach. First, we transform the text query into a visual query using a generative diffusion model. Then, we estimate image-to-image similarity with a vision model. Additionally, we introduce an aggregation network that combines multiple generated images into a single vector representation and fuses similarity scores across both query modalities. Our approach leverages advancements in vision encoders, VLMs, and text-to-image generation models. Extensive evaluations show that it consistently outperforms retrieval methods relying solely on text queries. Source code is available at: https://github.com/faixan-khan/cletir
中文: 本研究提出了一种新颖的两步文本-图像检索方法,先通过扩散模型将文本查询转换为视觉表示,再融合跨模态相似度分数,显著超越了仅依赖文本的检索方法。
English: This study introduces a novel two-step method for text-to-image retrieval that first converts text queries into visual representations using a diffusion model and then fuses similarity scores across modalities, significantly outperforming text-only approaches.

Authors:Jasper Uijlings, Xingyi Zhou, Xiuye Gu, Arsha Nagrani, Anurag Arnab, Alireza Fathi, David Ross, Cordelia Schmid
Title: VoCap: Video Object Captioning and Segmentation from Any Prompt
Abstract:
Understanding objects in videos in terms of fine-grained localization masks and detailed semantic properties is a fundamental task in video understanding. In this paper, we propose VoCap, a flexible video model that consumes a video and a prompt of various modalities (text, box or mask), and produces a spatio-temporal masklet with a corresponding object-centric caption. As such our model addresses simultaneously the tasks of promptable video object segmentation, referring expression segmentation, and object captioning. Since obtaining data for this task is tedious and expensive, we propose to annotate an existing large-scale segmentation dataset (SAV) with pseudo object captions. We do so by preprocessing videos with their ground-truth masks to highlight the object of interest and feed this to a large Vision Language Model (VLM). For an unbiased evaluation, we collect manual annotations on the validation set. We call the resulting dataset SAV-Caption. We train our VoCap model at scale on a SAV-Caption together with a mix of other image and video datasets. Our model yields state-of-the-art results on referring expression video object segmentation, is competitive on semi-supervised video object segmentation, and establishes a benchmark for video object captioning. Our dataset will be made available at https://github.com/google-deepmind/vocap.
中文: 本文提出VoCap模型,通过利用SAV-Caption数据集的伪标注数据,实现了可提示的视频对象分割与描述功能,在多项视频理解任务中取得了领先性能。
English: This paper introduces VoCap, a versatile video model that performs promptable object segmentation and captioning by leveraging pseudo-annotated data from SAV-Caption, achieving state-of-the-art results in multiple video understanding tasks.

Authors:Jiawei Liu, Jiahe Hou, Wei Wang, Jinsong Du, Yang Cong, Huijie Fan
Title: TMUAD: Enhancing Logical Capabilities in Unified Anomaly Detection Models with a Text Memory Bank
Abstract:
Anomaly detection, which aims to identify anomalies deviating from normal patterns, is challenging due to the limited amount of normal data available. Unlike most existing unified methods that rely on carefully designed image feature extractors and memory banks to capture logical relationships between objects, we introduce a text memory bank to enhance the detection of logical anomalies. Specifically, we propose a Three-Memory framework for Unified structural and logical Anomaly Detection (TMUAD). First, we build a class-level text memory bank for logical anomaly detection by the proposed logic-aware text extractor, which can capture rich logical descriptions of objects from input images. Second, we construct an object-level image memory bank that preserves complete object contours by extracting features from segmented objects. Third, we employ visual encoders to extract patch-level image features for constructing a patch-level memory bank for structural anomaly detection. These three complementary memory banks are used to retrieve and compare normal images that are most similar to the query image, compute anomaly scores at multiple levels, and fuse them into a final anomaly score. By unifying structural and logical anomaly detection through collaborative memory banks, TMUAD achieves state-of-the-art performance across seven publicly available datasets involving industrial and medical domains. The model and code are available at https://github.com/SIA-IDE/TMUAD.
中文:TMUAD框架通过结合文本与图像特征的三重记忆系统,统一了结构和逻辑异常检测,在多个数据集上取得了领先性能。
English: The TMUAD framework introduces a three-memory system combining text and image features to unify structural and logical anomaly detection, achieving state-of-the-art results across multiple datasets.

Authors:Qiyue Sun, Qiming Huang, Yang Yang, Hongjun Wang, Jianbo Jiao
Title: What Can We Learn from Harry Potter? An Exploratory Study of Visual Representation Learning from Atypical Videos
Abstract:
Humans usually show exceptional generalisation and discovery ability in the open world, when being shown uncommon new concepts. Whereas most existing studies in the literature focus on common typical data from closed sets, open-world novel discovery is under-explored in videos. In this paper, we are interested in asking: What if atypical unusual videos are exposed in the learning process? To this end, we collect a new video dataset consisting of various types of unusual atypical data (e.g., sci-fi, animation, etc.). To study how such atypical data may benefit open-world learning, we feed them into the model training process for representation learning. Focusing on three key tasks in open-world learning: out-of-distribution (OOD) detection, novel category discovery (NCD), and zero-shot action recognition (ZSAR), we found that even straightforward learning approaches with atypical data consistently improve performance across various settings. Furthermore, we found that increasing the categorical diversity of the atypical samples further boosts OOD detection performance. Additionally, in the NCD task, using a smaller yet more semantically diverse set of atypical samples leads to better performance compared to using a larger but more typical dataset. In the ZSAR setting, the semantic diversity of atypical videos helps the model generalise better to unseen action classes. These observations in our extensive experimental evaluations reveal the benefits of atypical videos for visual representation learning in the open world, together with the newly proposed dataset, encouraging further studies in this direction. The project page is at: https://julysun98.github.io/atypical_dataset.

Authors:Xavier Juanola, Giovana Morais, Magdalena Fuentes, Gloria Haro
Title: Learning from Silence and Noise for Visual Sound Source Localization
Abstract:
Visual sound source localization is a fundamental perception task that aims to detect the location of sounding sources in a video given its audio. Despite recent progress, we identify two shortcomings in current methods: 1) most approaches perform poorly in cases with low audio-visual semantic correspondence such as silence, noise, and offscreen sounds, i.e. in the presence of negative audio; and 2) most prior evaluations are limited to positive cases, where both datasets and metrics convey scenarios with a single visible sound source in the scene. To address this, we introduce three key contributions. First, we propose a new training strategy that incorporates silence and noise, which improves performance in positive cases, while being more robust against negative sounds. Our resulting self-supervised model, SSL-SaN, achieves state-of-the-art performance compared to other self-supervised models, both in sound localization and cross-modal retrieval. Second, we propose a new metric that quantifies the trade-off between alignment and separability of auditory and visual features across positive and negative audio-visual pairs. Third, we present IS3+, an extended and improved version of the IS3 synthetic dataset with negative audio. Our data, metrics and code are available on the https://xavijuanola.github.io/SSL-SaN/.

Authors:Shashank Vempati, Nishit Anand, Gaurav Talebailkar, Arpan Garai, Chetan Arora
Title: Why Stop at Words? Unveiling the Bigger Picture through Line-Level OCR
Abstract:
Conventional optical character recognition (OCR) techniques segmented each character and then recognized. This made them prone to error in character segmentation, and devoid of context to exploit language models. Advances in sequence to sequence translation in last decade led to modern techniques first detecting words and then inputting one word at a time to a model to directly output full words as sequence of characters. This allowed better utilization of language models and bypass error-prone character segmentation step. We observe that the above transition in style has moved the bottleneck in accuracy to word segmentation. Hence, in this paper, we propose a natural and logical progression from word level OCR to line-level OCR. The proposal allows to bypass errors in word detection, and provides larger sentence context for better utilization of language models. We show that the proposed technique not only improves the accuracy but also efficiency of OCR. Despite our thorough literature survey, we did not find any public dataset to train and benchmark such shift from word to line-level OCR. Hence, we also contribute a meticulously curated dataset of 251 English page images with line-level annotations. Our experimentation revealed a notable end-to-end accuracy improvement of 5.4%, underscoring the potential benefits of transitioning towards line-level OCR, especially for document images. We also report a 4 times improvement in efficiency compared to word-based pipelines. With continuous improvements in large language models, our methodology also holds potential to exploit such advances. Project Website: https://nishitanand.github.io/line-level-ocr-website

Authors:Fatih Erdoğan, Merve Rabia Barın, Fatma Güney
Title: Mapping like a Skeptic: Probabilistic BEV Projection for Online HD Mapping
Abstract:
Constructing high-definition (HD) maps from sensory input requires accurately mapping the road elements in image space to the Bird's Eye View (BEV) space. The precision of this mapping directly impacts the quality of the final vectorized HD map. Existing HD mapping approaches outsource the projection to standard mapping techniques, such as attention-based ones. However, these methods struggle with accuracy due to generalization problems, often hallucinating non-existent road elements. Our key idea is to start with a geometric mapping based on camera parameters and adapt it to the scene to extract relevant map information from camera images. To implement this, we propose a novel probabilistic projection mechanism with confidence scores to (i) refine the mapping to better align with the scene and (ii) filter out irrelevant elements that should not influence HD map generation. In addition, we improve temporal processing by using confidence scores to selectively accumulate reliable information over time. Experiments on new splits of the nuScenes and Argoverse2 datasets demonstrate improved performance over state-of-the-art approaches, indicating better generalization. The improvements are particularly pronounced on nuScenes and in the challenging long perception range. Our code and model checkpoints are available at https://github.com/Fatih-Erdogan/mapping-like-skeptic .
中文摘要:本文提出了一种新颖的概率投影方法,通过相机参数和置信度得分优化几何映射,有效过滤无关元素并改进时序信息融合,在基准数据集上验证了其在高清地图构建中优于现有方法的性能表现。
English Summary: This paper introduces a novel probabilistic projection method that refines geometric mapping using camera parameters and confidence scores to enhance HD map accuracy by filtering irrelevant elements and improving temporal information integration, demonstrating superior performance on benchmark datasets.

Authors:Maximilian Rokuss, Yannick Kirchhoff, Fabian Isensee, Klaus H. Maier-Hein
Title: Towards Interactive Lesion Segmentation in Whole-Body PET/CT with Promptable Models
Abstract:
Whole-body PET/CT is a cornerstone of oncological imaging, yet accurate lesion segmentation remains challenging due to tracer heterogeneity, physiological uptake, and multi-center variability. While fully automated methods have advanced substantially, clinical practice benefits from approaches that keep humans in the loop to efficiently refine predicted masks. The autoPET/CT IV challenge addresses this need by introducing interactive segmentation tasks based on simulated user prompts. In this work, we present our submission to Task 1. Building on the winning autoPET III nnU-Net pipeline, we extend the framework with promptable capabilities by encoding user-provided foreground and background clicks as additional input channels. We systematically investigate representations for spatial prompts and demonstrate that Euclidean Distance Transform (EDT) encodings consistently outperform Gaussian kernels. Furthermore, we propose online simulation of user interactions and a custom point sampling strategy to improve robustness under realistic prompting conditions. Our ensemble of EDT-based models, trained with and without external data, achieves the strongest cross-validation performance, reducing both false positives and false negatives compared to baseline models. These results highlight the potential of promptable models to enable efficient, user-guided segmentation workflows in multi-tracer, multi-center PET/CT. Code is publicly available at https://github.com/MIC-DKFZ/autoPET-interactive
中文摘要:本研究通过将用户点击信息编码为欧氏距离变换,扩展了nnU-Net框架的交互分割能力,结合在线模拟交互策略,在自动PET挑战赛中实现了最佳性能,显著提升了多中心PET/CT影像中病灶分割的精确度。
English Summary: This study enhances the nnU-Net pipeline with interactive capabilities by encoding user clicks via Euclidean Distance Transform, demonstrating superior segmentation accuracy and robustness in multi-center PET/CT imaging through simulated user prompts and ensemble modeling.

Authors:Jonathan Tonglet, Jan Zimny, Tinne Tuytelaars, Iryna Gurevych
Title: Is this chart lying to me? Automating the detection of misleading visualizations
Abstract:
Misleading visualizations are a potent driver of misinformation on social media and the web. By violating chart design principles, they distort data and lead readers to draw inaccurate conclusions. Prior work has shown that both humans and multimodal large language models (MLLMs) are frequently deceived by such visualizations. Automatically detecting misleading visualizations and identifying the specific design rules they violate could help protect readers and reduce the spread of misinformation. However, the training and evaluation of AI models has been limited by the absence of large, diverse, and openly available datasets. In this work, we introduce Misviz, a benchmark of 2,604 real-world visualizations annotated with 12 types of misleaders. To support model training, we also release Misviz-synth, a synthetic dataset of 81,814 visualizations generated using Matplotlib and based on real-world data tables. We perform a comprehensive evaluation on both datasets using state-of-the-art MLLMs, rule-based systems, and fine-tuned classifiers. Our results reveal that the task remains highly challenging. We release Misviz, Misviz-synth, and the accompanying code.
中文: 误导性可视化扭曲数据并误导人类及AI模型,为此推出Misviz基准和Misviz-synth数据集以改进检测技术,但现有模型在此任务上仍面临巨大挑战。
English: Misleading visualizations distort data and mislead both humans and AI models, prompting the creation of the Misviz benchmark and Misviz-synth dataset to advance detection methods, though current models still struggle with the task.

Authors:Nicolas Soncini, Javier Cremona, Erica Vidal, Maximiliano García, Gastón Castro, Taihú Pire
Title: The Rosario Dataset v2: Multimodal Dataset for Agricultural Robotics
Abstract:
We present a multi-modal dataset collected in a soybean crop field, comprising over two hours of recorded data from sensors such as stereo infrared camera, color camera, accelerometer, gyroscope, magnetometer, GNSS (Single Point Positioning, Real-Time Kinematic and Post-Processed Kinematic), and wheel odometry. This dataset captures key challenges inherent to robotics in agricultural environments, including variations in natural lighting, motion blur, rough terrain, and long, perceptually aliased sequences. By addressing these complexities, the dataset aims to support the development and benchmarking of advanced algorithms for localization, mapping, perception, and navigation in agricultural robotics. The platform and data collection system is designed to meet the key requirements for evaluating multi-modal SLAM systems, including hardware synchronization of sensors, 6-DOF ground truth and loops on long trajectories. We run multimodal state-of-the art SLAM methods on the dataset, showcasing the existing limitations in their application on agricultural settings. The dataset and utilities to work with it are released on https://cifasis.github.io/rosariov2/.

Authors:Francisco Caetano, Christiaan Viviers, Peter H. H. de With, Fons van der Sommen
Title: MedShift: Implicit Conditional Transport for X-Ray Domain Adaptation
Abstract:
Synthetic medical data offers a scalable solution for training robust models, but significant domain gaps limit its generalizability to real-world clinical settings. This paper addresses the challenge of cross-domain translation between synthetic and real X-ray images of the head, focusing on bridging discrepancies in attenuation behavior, noise characteristics, and soft tissue representation. We propose MedShift, a unified class-conditional generative model based on Flow Matching and Schrodinger Bridges, which enables high-fidelity, unpaired image translation across multiple domains. Unlike prior approaches that require domain-specific training or rely on paired data, MedShift learns a shared domain-agnostic latent space and supports seamless translation between any pair of domains seen during training. We introduce X-DigiSkull, a new dataset comprising aligned synthetic and real skull X-rays under varying radiation doses, to benchmark domain translation models. Experimental results demonstrate that, despite its smaller model size compared to diffusion-based approaches, MedShift offers strong performance and remains flexible at inference time, as it can be tuned to prioritize either perceptual fidelity or structural consistency, making it a scalable and generalizable solution for domain adaptation in medical imaging. The code and dataset are available at https://caetas.github.io/medshift.html

Authors:Jakub Straka, Ivan Gruber
Title: SatDINO: A Deep Dive into Self-Supervised Pretraining for Remote Sensing
Abstract:
Self-supervised learning has emerged as a powerful tool for remote sensing, where large amounts of unlabeled data are available. In this work, we investigate the use of DINO, a contrastive self-supervised method, for pretraining on remote sensing imagery. We introduce SatDINO, a model tailored for representation learning in satellite imagery. Through extensive experiments on multiple datasets in multiple testing setups, we demonstrate that SatDINO outperforms other state-of-the-art methods based on much more common masked autoencoders (MAE) and achieves competitive results in multiple benchmarks. We also provide a rigorous ablation study evaluating SatDINO's individual components. Finally, we propose a few novel enhancements, such as a new way to incorporate ground sample distance (GSD) encoding and adaptive view sampling. These enhancements can be used independently on our SatDINO model. Our code and trained models are available at: https://github.com/strakaj/SatDINO.
中文: 本文提出SatDINO,一种针对卫星影像的自监督学习模型,通过引入地面采样距离编码和自适应视图采样等创新改进,在多项基准测试中超越掩码自编码器方法并取得领先性能。
English: This paper introduces SatDINO, a self-supervised model for satellite imagery that outperforms masked autoencoder methods and achieves competitive benchmark results through novel enhancements like GSD encoding and adaptive view sampling.

Authors:Zhizhong Huang, Xiaoming Liu
Title: Generalizable Object Re-Identification via Visual In-Context Prompting
Abstract:
Current object re-identification (ReID) methods train domain-specific models (e.g., for persons or vehicles), which lack generalization and demand costly labeled data for new categories. While self-supervised learning reduces annotation needs by learning instance-wise invariance, it struggles to capture \textit{identity-sensitive} features critical for ReID. This paper proposes Visual In-Context Prompting~(VICP), a novel framework where models trained on seen categories can directly generalize to unseen novel categories using only \textit{in-context examples} as prompts, without requiring parameter adaptation. VICP synergizes LLMs and vision foundation models~(VFM): LLMs infer semantic identity rules from few-shot positive/negative pairs through task-specific prompting, which then guides a VFM (\eg, DINO) to extract ID-discriminative features via \textit{dynamic visual prompts}. By aligning LLM-derived semantic concepts with the VFM's pre-trained prior, VICP enables generalization to novel categories, eliminating the need for dataset-specific retraining. To support evaluation, we introduce ShopID10K, a dataset of 10K object instances from e-commerce platforms, featuring multi-view images and cross-domain testing. Experiments on ShopID10K and diverse ReID benchmarks demonstrate that VICP outperforms baselines by a clear margin on unseen categories. Code is available at https://github.com/Hzzone/VICP.
中文: 本文提出的视觉上下文提示(VICP)框架结合大语言模型与视觉基础模型,仅需少量上下文示例即可将目标重识别泛化至未见类别,无需针对新数据集重新训练,并在实验中显著优于现有基线方法。
English: The paper introduces Visual In-Context Prompting (VICP), a framework that leverages large language models and vision foundation models to generalize object re-identification to unseen categories using in-context examples, eliminating the need for dataset-specific retraining and outperforming existing methods.

Authors:Zhenghao He, Sanchit Sinha, Guangzhi Xiong, Aidong Zhang
Title: GCAV: A Global Concept Activation Vector Framework for Cross-Layer Consistency in Interpretability
Abstract:
Concept Activation Vectors (CAVs) provide a powerful approach for interpreting deep neural networks by quantifying their sensitivity to human-defined concepts. However, when computed independently at different layers, CAVs often exhibit inconsistencies, making cross-layer comparisons unreliable. To address this issue, we propose the Global Concept Activation Vector (GCAV), a novel framework that unifies CAVs into a single, semantically consistent representation. Our method leverages contrastive learning to align concept representations across layers and employs an attention-based fusion mechanism to construct a globally integrated CAV. By doing so, our method significantly reduces the variance in TCAV scores while preserving concept relevance, ensuring more stable and reliable concept attributions. To evaluate the effectiveness of GCAV, we introduce Testing with Global Concept Activation Vectors (TGCAV) as a method to apply TCAV to GCAV-based representations. We conduct extensive experiments on multiple deep neural networks, demonstrating that our method effectively mitigates concept inconsistency across layers, enhances concept localization, and improves robustness against adversarial perturbations. By integrating cross-layer information into a coherent framework, our method offers a more comprehensive and interpretable understanding of how deep learning models encode human-defined concepts. Code and models are available at https://github.com/Zhenghao-He/GCAV.
中文: 提出的全局概念激活向量(GCAV)框架通过对比学习和注意力融合机制,统一了神经网络各层的概念表示,显著提升了概念一致性、定位能力及模型可解释性。
English: The proposed Global Concept Activation Vector (GCAV) framework unifies concept representations across neural network layers using contrastive learning and attention fusion, significantly improving concept consistency, localization, and model interpretability.

Authors:Kevin Mayer, Alex Vesel, Xinyi Zhao, Martin Fischer
Title: SYNBUILD-3D: A large, multi-modal, and semantically rich synthetic dataset of 3D building models at Level of Detail 4
Abstract:
3D building models are critical for applications in architecture, energy simulation, and navigation. Yet, generating accurate and semantically rich 3D buildings automatically remains a major challenge due to the lack of large-scale annotated datasets in the public domain. Inspired by the success of synthetic data in computer vision, we introduce SYNBUILD-3D, a large, diverse, and multi-modal dataset of over 6.2 million synthetic 3D residential buildings at Level of Detail (LoD) 4. In the dataset, each building is represented through three distinct modalities: a semantically enriched 3D wireframe graph at LoD 4 (Modality I), the corresponding floor plan images (Modality II), and a LiDAR-like roof point cloud (Modality III). The semantic annotations for each building wireframe are derived from the corresponding floor plan images and include information on rooms, doors, and windows. Through its tri-modal nature, future work can use SYNBUILD-3D to develop novel generative AI algorithms that automate the creation of 3D building models at LoD 4, subject to predefined floor plan layouts and roof geometries, while enforcing semantic-geometric consistency. Dataset and code samples are publicly available at https://github.com/kdmayer/SYNBUILD-3D.
中文摘要:SYNBUILD-3D数据集通过提供620多万个包含三种模态的合成住宅建筑,解决了标注3D建筑数据匮乏的问题,支持开发生成式AI以实现自动化、语义一致的LoD 4级别三维模型构建。
English Summary: The SYNBUILD-3D dataset addresses the shortage of annotated 3D building data by providing over 6.2 million synthetic residential buildings with three modalities, enabling the development of generative AI for automated, semantically consistent 3D model creation at LoD 4.

Authors:Ao Shen, Xueming Fu, Junfeng Jiang, Qiang Zeng, Ye Tang, Zhengming Chen, Luming Nong, Feng Wang, S. Kevin Zhou
Title: RadGS-Reg: Registering Spine CT with Biplanar X-rays via Joint 3D Radiative Gaussians Reconstruction and 3D/3D Registration
Abstract:
Computed Tomography (CT)/X-ray registration in image-guided navigation remains challenging because of its stringent requirements for high accuracy and real-time performance. Traditional "render and compare" methods, relying on iterative projection and comparison, suffer from spatial information loss and domain gap. 3D reconstruction from biplanar X-rays supplements spatial and shape information for 2D/3D registration, but current methods are limited by dense-view requirements and struggles with noisy X-rays. To address these limitations, we introduce RadGS-Reg, a novel framework for vertebral-level CT/X-ray registration through joint 3D Radiative Gaussians (RadGS) reconstruction and 3D/3D registration. Specifically, our biplanar X-rays vertebral RadGS reconstruction module explores learning-based RadGS reconstruction method with a Counterfactual Attention Learning (CAL) mechanism, focusing on vertebral regions in noisy X-rays. Additionally, a patient-specific pre-training strategy progressively adapts the RadGS-Reg from simulated to real data while simultaneously learning vertebral shape prior knowledge. Experiments on in-house datasets demonstrate the state-of-the-art performance for both tasks, surpassing existing methods. The code is available at: https://github.com/shenao1995/RadGS_Reg.
Chinese Summary: RadGS-Reg是一种创新框架,通过结合3D辐射高斯重建与反事实注意力学习机制及3D/3D配准,在椎骨成像中实现了从模拟到真实数据的患者自适应优化,显著提升了CT/X射线配准性能。
English Summary: RadGS-Reg is a novel framework that enhances CT/X-ray registration by jointly performing 3D Radiative Gaussians reconstruction with Counterfactual Attention Learning and 3D/3D registration, achieving state-of-the-art performance on vertebral imaging through patient-specific adaptation from simulated to real data.

Authors:Xurui Peng, Hong Liu, Chenqian Yan, Rui Ma, Fangmin Chen, Xing Wang, Zhihua Wu, Songwei Liu, Mingbao Lin
Title: ERTACache: Error Rectification and Timesteps Adjustment for Efficient Diffusion
Abstract:
Diffusion models suffer from substantial computational overhead due to their inherently iterative inference process. While feature caching offers a promising acceleration strategy by reusing intermediate outputs across timesteps, naive reuse often incurs noticeable quality degradation. In this work, we formally analyze the cumulative error introduced by caching and decompose it into two principal components: feature shift error, caused by inaccuracies in cached outputs, and step amplification error, which arises from error propagation under fixed timestep schedules. To address these issues, we propose ERTACache, a principled caching framework that jointly rectifies both error types. Our method employs an offline residual profiling stage to identify reusable steps, dynamically adjusts integration intervals via a trajectory-aware correction coefficient, and analytically approximates cache-induced errors through a closed-form residual linearization model. Together, these components enable accurate and efficient sampling under aggressive cache reuse. Extensive experiments across standard image and video generation benchmarks show that ERTACache achieves up to 2x inference speedup while consistently preserving or even improving visual quality. Notably, on the state-of-the-art Wan2.1 video diffusion model, ERTACache delivers 2x acceleration with minimal VBench degradation, effectively maintaining baseline fidelity while significantly improving efficiency. The code is available at https://github.com/bytedance/ERTACache.
中文: ERTACache是一种创新的缓存框架,通过纠正特征缓存引起的累积误差来加速扩散模型,在图像和视频生成任务中实现高达2倍的推理加速,同时保持甚至提升输出质量。
English: ERTACache is a novel caching framework that accelerates diffusion models by mitigating cumulative errors from feature caching, achieving up to 2x faster inference while maintaining or enhancing output quality across image and video generation tasks.

Authors:Jun-Kun Chen, Aayush Bansal, Minh Phuoc Vo, Yu-Xiong Wang
Title: Dress&Dance: Dress up and Dance as You Like It - Technical Preview
Abstract:
We present Dress&Dance, a video diffusion framework that generates high quality 5-second-long 24 FPS virtual try-on videos at 1152x720 resolution of a user wearing desired garments while moving in accordance with a given reference video. Our approach requires a single user image and supports a range of tops, bottoms, and one-piece garments, as well as simultaneous tops and bottoms try-on in a single pass. Key to our framework is CondNet, a novel conditioning network that leverages attention to unify multi-modal inputs (text, images, and videos), thereby enhancing garment registration and motion fidelity. CondNet is trained on heterogeneous training data, combining limited video data and a larger, more readily available image dataset, in a multistage progressive manner. Dress&Dance outperforms existing open source and commercial solutions and enables a high quality and flexible try-on experience.

Authors:Frano Rajič, Haofei Xu, Marko Mihajlovic, Siyuan Li, Irem Demir, Emircan Gündoğdu, Lei Ke, Sergey Prokudin, Marc Pollefeys, Siyu Tang
Title: Multi-View 3D Point Tracking
Abstract:
We introduce the first data-driven multi-view 3D point tracker, designed to track arbitrary points in dynamic scenes using multiple camera views. Unlike existing monocular trackers, which struggle with depth ambiguities and occlusion, or prior multi-camera methods that require over 20 cameras and tedious per-sequence optimization, our feed-forward model directly predicts 3D correspondences using a practical number of cameras (e.g., four), enabling robust and accurate online tracking. Given known camera poses and either sensor-based or estimated multi-view depth, our tracker fuses multi-view features into a unified point cloud and applies k-nearest-neighbors correlation alongside a transformer-based update to reliably estimate long-range 3D correspondences, even under occlusion. We train on 5K synthetic multi-view Kubric sequences and evaluate on two real-world benchmarks: Panoptic Studio and DexYCB, achieving median trajectory errors of 3.1 cm and 2.0 cm, respectively. Our method generalizes well to diverse camera setups of 1-8 views with varying vantage points and video lengths of 24-150 frames. By releasing our tracker alongside training and evaluation datasets, we aim to set a new standard for multi-view 3D tracking research and provide a practical tool for real-world applications. Project page available at https://ethz-vlg.github.io/mvtracker.

Authors:Shengqu Cai, Ceyuan Yang, Lvmin Zhang, Yuwei Guo, Junfei Xiao, Ziyan Yang, Yinghao Xu, Zhenheng Yang, Alan Yuille, Leonidas Guibas, Maneesh Agrawala, Lu Jiang, Gordon Wetzstein
Title: Mixture of Contexts for Long Video Generation
Abstract:
Long video generation is fundamentally a long context memory problem: models must retain and retrieve salient events across a long range without collapsing or drifting. However, scaling diffusion transformers to generate long-context videos is fundamentally limited by the quadratic cost of self-attention, which makes memory and computation intractable and difficult to optimize for long sequences. We recast long-context video generation as an internal information retrieval task and propose a simple, learnable sparse attention routing module, Mixture of Contexts (MoC), as an effective long-term memory retrieval engine. In MoC, each query dynamically selects a few informative chunks plus mandatory anchors (caption, local windows) to attend to, with causal routing that prevents loop closures. As we scale the data and gradually sparsify the routing, the model allocates compute to salient history, preserving identities, actions, and scenes over minutes of content. Efficiency follows as a byproduct of retrieval (near-linear scaling), which enables practical training and synthesis, and the emergence of memory and consistency at the scale of minutes.
中文摘要:长视频生成被重新定义为信息检索任务,通过提出的混合上下文(MoC)稀疏注意力路由机制,实现了高效的长时记忆检索,在分钟级时长中保持内容连贯性并实现近线性计算复杂度。
English Summary: Long video generation is addressed by reframing it as an information retrieval task and introducing Mixture of Contexts (MoC), a sparse attention mechanism that enables efficient long-term memory retrieval while maintaining content consistency over extended durations.

Authors:Hao Tan, Jun Lan, Zichang Tan, Ajian Liu, Chuanbiao Song, Senyuan Shi, Huijia Zhu, Weiqiang Wang, Jun Wan, Zhen Lei
Title: Veritas: Generalizable Deepfake Detection via Pattern-Aware Reasoning
Abstract:
Deepfake detection remains a formidable challenge due to the complex and evolving nature of fake content in real-world scenarios. However, existing academic benchmarks suffer from severe discrepancies from industrial practice, typically featuring homogeneous training sources and low-quality testing images, which hinder the practical deployments of current detectors. To mitigate this gap, we introduce HydraFake, a dataset that simulates real-world challenges with hierarchical generalization testing. Specifically, HydraFake involves diversified deepfake techniques and in-the-wild forgeries, along with rigorous training and evaluation protocol, covering unseen model architectures, emerging forgery techniques and novel data domains. Building on this resource, we propose Veritas, a multi-modal large language model (MLLM) based deepfake detector. Different from vanilla chain-of-thought (CoT), we introduce pattern-aware reasoning that involves critical reasoning patterns such as "planning" and "self-reflection" to emulate human forensic process. We further propose a two-stage training pipeline to seamlessly internalize such deepfake reasoning capacities into current MLLMs. Experiments on HydraFake dataset reveal that although previous detectors show great generalization on cross-model scenarios, they fall short on unseen forgeries and data domains. Our Veritas achieves significant gains across different OOD scenarios, and is capable of delivering transparent and faithful detection outputs.
中文摘要:HydraFake数据集通过分层泛化测试解决了现实世界深度伪造检测的挑战,而Veritas检测器基于多模态框架采用模式感知推理,在跨域场景中实现卓越性能并提供透明可信的检测结果。
English Summary: The HydraFake dataset addresses real-world deepfake detection challenges through hierarchical generalization testing, while the Veritas detector leverages pattern-aware reasoning within a multi-modal framework to achieve superior cross-domain performance with transparent results.

Authors:Wei Li, Renshan Zhang, Rui Shao, Jie He, Liqiang Nie
Title: CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification
Abstract:
Recent Vision-Language-Action (VLA) models built on pre-trained Vision-Language Models (VLMs) require extensive post-training, resulting in high computational overhead that limits scalability and deployment.We propose CogVLA, a Cognition-Aligned Vision-Language-Action framework that leverages instruction-driven routing and sparsification to improve both efficiency and performance. CogVLA draws inspiration from human multimodal coordination and introduces a 3-stage progressive architecture. 1) Encoder-FiLM based Aggregation Routing (EFA-Routing) injects instruction information into the vision encoder to selectively aggregate and compress dual-stream visual tokens, forming a instruction-aware latent representation. 2) Building upon this compact visual encoding, LLM-FiLM based Pruning Routing (LFP-Routing) introduces action intent into the language model by pruning instruction-irrelevant visually grounded tokens, thereby achieving token-level sparsity. 3) To ensure that compressed perception inputs can still support accurate and coherent action generation, we introduce V-L-A Coupled Attention (CAtten), which combines causal vision-language attention with bidirectional action parallel decoding. Extensive experiments on the LIBERO benchmark and real-world robotic tasks demonstrate that CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0%, respectively, while reducing training costs by 2.5-fold and decreasing inference latency by 2.8-fold compared to OpenVLA. CogVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/CogVLA.
中文: CogVLA是一种认知对齐的视觉-语言-行动框架,通过指令驱动的路由和稀疏化技术提升效率与性能,在取得顶尖成果的同时大幅降低了训练和推理成本。
English: CogVLA is a cognition-aligned vision-language-action framework that enhances efficiency and performance through instruction-driven routing and sparsification, achieving state-of-the-art results while significantly reducing training and inference costs.

Authors:Huynh Tong Dang Khoa, Dang Hoai Nam, Vo Nguyen Le Duy
Title: FW-GAN: Frequency-Driven Handwriting Synthesis with Wave-Modulated MLP Generator
Abstract:
Labeled handwriting data is often scarce, limiting the effectiveness of recognition systems that require diverse, style-consistent training samples. Handwriting synthesis offers a promising solution by generating artificial data to augment training. However, current methods face two major limitations. First, most are built on conventional convolutional architectures, which struggle to model long-range dependencies and complex stroke patterns. Second, they largely ignore the crucial role of frequency information, which is essential for capturing fine-grained stylistic and structural details in handwriting. To address these challenges, we propose FW-GAN, a one-shot handwriting synthesis framework that generates realistic, writer-consistent text from a single example. Our generator integrates a phase-aware Wave-MLP to better capture spatial relationships while preserving subtle stylistic cues. We further introduce a frequency-guided discriminator that leverages high-frequency components to enhance the authenticity detection of generated samples. Additionally, we introduce a novel Frequency Distribution Loss that aligns the frequency characteristics of synthetic and real handwriting, thereby enhancing visual fidelity. Experiments on Vietnamese and English handwriting datasets demonstrate that FW-GAN generates high-quality, style-consistent handwriting, making it a valuable tool for augmenting data in low-resource handwriting recognition (HTR) pipelines. Official implementation is available at https://github.com/DAIR-Group/FW-GAN
中文: 提出的FW-GAN框架通过整合相位感知Wave-MLP和频率引导组件,从单一样本生成逼真且风格统一的手写文字,有效解决了手写合成中长距离依赖和细节捕捉的难题,为识别系统提供了优质数据增强方案。
English: The proposed FW-GAN framework overcomes limitations in handwriting synthesis by integrating phase-aware Wave-MLP and frequency-guided components to generate realistic, style-consistent handwriting from a single sample, effectively augmenting training data for recognition systems.

Authors:Dale Decatur, Thibault Groueix, Wang Yifan, Rana Hanocka, Vladimir Kim, Matheus Gadelha
Title: Reusing Computation in Text-to-Image Diffusion for Efficient Generation of Image Sets
Abstract:
Text-to-image diffusion models enable high-quality image generation but are computationally expensive. While prior work optimizes per-inference efficiency, we explore an orthogonal approach: reducing redundancy across correlated prompts. Our method leverages the coarse-to-fine nature of diffusion models, where early denoising steps capture shared structures among similar prompts. We propose a training-free approach that clusters prompts based on semantic similarity and shares computation in early diffusion steps. Experiments show that for models trained conditioned on image embeddings, our approach significantly reduces compute cost while improving image quality. By leveraging UnClip's text-to-image prior, we enhance diffusion step allocation for greater efficiency. Our method seamlessly integrates with existing pipelines, scales with prompt sets, and reduces the environmental and financial burden of large-scale text-to-image generation. Project page: https://ddecatur.github.io/hierarchical-diffusion/

Authors:Paritosh Parmar, Eric Peh, Basura Fernando
Title: ChainReaction! Structured Approach with Causal Chains as Intermediate Representations for Improved and Explainable Causal Video Question Answering
Abstract:
Existing Causal-Why Video Question Answering (VideoQA) models often struggle with higher-order reasoning, relying on opaque, monolithic pipelines that entangle video understanding, causal inference, and answer generation. These black-box approaches offer limited interpretability and tend to depend on shallow heuristics. We propose a novel, modular framework that explicitly decouples causal reasoning from answer generation, introducing natural language causal chains as interpretable intermediate representations. Inspired by human cognitive models, these structured cause-effect sequences bridge low-level video content with high-level causal reasoning, enabling transparent and logically coherent inference. Our two-stage architecture comprises a Causal Chain Extractor (CCE) that generates causal chains from video-question pairs, and a Causal Chain-Driven Answerer (CCDA) that produces answers grounded in these chains. To address the lack of annotated reasoning traces, we introduce a scalable method for generating high-quality causal chains from existing datasets using large language models. We also propose CauCo, a new evaluation metric for causality-oriented captioning. Experiments on three large-scale benchmarks demonstrate that our approach not only outperforms state-of-the-art models, but also yields substantial gains in explainability, user trust, and generalization -- positioning the CCE as a reusable causal reasoning engine across diverse domains. Project page: https://paritoshparmar.github.io/chainreaction/

Authors:Patryk Będkowski, Jan Dubiński, Filip Szatkowski, Kamil Deja, Przemysław Rokita, Tomasz Trzciński
Title: ExpertSim: Fast Particle Detector Simulation Using Mixture-of-Generative-Experts
Abstract:
Simulating detector responses is a crucial part of understanding the inner workings of particle collisions in the Large Hadron Collider at CERN. Such simulations are currently performed with statistical Monte Carlo methods, which are computationally expensive and put a significant strain on CERN's computational grid. Therefore, recent proposals advocate for generative machine learning methods to enable more efficient simulations. However, the distribution of the data varies significantly across the simulations, which is hard to capture with out-of-the-box methods. In this study, we present ExpertSim - a deep learning simulation approach tailored for the Zero Degree Calorimeter in the ALICE experiment. Our method utilizes a Mixture-of-Generative-Experts architecture, where each expert specializes in simulating a different subset of the data. This allows for a more precise and efficient generation process, as each expert focuses on a specific aspect of the calorimeter response. ExpertSim not only improves accuracy, but also provides a significant speedup compared to the traditional Monte-Carlo methods, offering a promising solution for high-efficiency detector simulations in particle physics experiments at CERN. We make the code available at https://github.com/patrick-bedkowski/expertsim-mix-of-generative-experts.
中文:ExpertSim采用一种混合生成专家架构的深度学习模拟方法,专门用于提升CERN的ALICE实验中探测器响应的模拟精度与效率,显著优于传统蒙特卡洛方法。
English: ExpertSim introduces a specialized deep learning approach using a Mixture-of-Generative-Experts architecture to enhance the accuracy and efficiency of simulating detector responses in CERN's ALICE experiment, outperforming traditional Monte Carlo methods.

Authors:Chenfan Qu, Yiwu Zhong, Bin Li, Lianwen Jin
Title: Webly-Supervised Image Manipulation Localization via Category-Aware Auto-Annotation
Abstract:
Images manipulated using image editing tools can mislead viewers and pose significant risks to social security. However, accurately localizing the manipulated regions within an image remains a challenging problem. One of the main barriers in this area is the high cost of data acquisition and the severe lack of high-quality annotated datasets. To address this challenge, we introduce novel methods that mitigate data scarcity by leveraging readily available web data. We utilize a large collection of manually forged images from the web, as well as automatically generated annotations derived from a simpler auxiliary task, constrained image manipulation localization. Specifically, we introduce a new paradigm CAAAv2, which automatically and accurately annotates manipulated regions at the pixel level. To further improve annotation quality, we propose a novel metric, QES, which filters out unreliable annotations. Through CAAA v2 and QES, we construct MIMLv2, a large-scale, diverse, and high-quality dataset containing 246,212 manually forged images with pixel-level mask annotations. This is over 120x larger than existing handcrafted datasets like IMD20. Additionally, we introduce Object Jitter, a technique that further enhances model training by generating high-quality manipulation artifacts. Building on these advances, we develop a new model, Web-IML, designed to effectively leverage web-scale supervision for the image manipulation localization task. Extensive experiments demonstrate that our approach substantially alleviates the data scarcity problem and significantly improves the performance of various models on multiple real-world forgery benchmarks. With the proposed web supervision, Web-IML achieves a striking performance gain of 31% and surpasses previous SOTA TruFor by 24.1 average IoU points. The dataset and code will be made publicly available at https://github.com/qcf-568/MIML.
中文摘要:本研究通过利用网络数据和自动标注技术,提出了解决图像篡改定位中数据稀缺问题的新方法,构建了大规模高质量数据集,并开发出在多个真实篡改基准测试中性能显著超越现有水平的模型。
English Summary: This study introduces innovative methods to overcome data scarcity in image manipulation localization by utilizing web data and automatic annotation techniques, resulting in the creation of a large-scale dataset and a model that significantly outperforms existing benchmarks.

Authors:Yajiao Xiong, Xiaoyu Zhou, Yongtao Wan, Deqing Sun, Ming-Hsuan Yang
Title: DrivingGaussian++: Towards Realistic Reconstruction and Editable Simulation for Surrounding Dynamic Driving Scenes
Abstract:
We present DrivingGaussian++, an efficient and effective framework for realistic reconstructing and controllable editing of surrounding dynamic autonomous driving scenes. DrivingGaussian++ models the static background using incremental 3D Gaussians and reconstructs moving objects with a composite dynamic Gaussian graph, ensuring accurate positions and occlusions. By integrating a LiDAR prior, it achieves detailed and consistent scene reconstruction, outperforming existing methods in dynamic scene reconstruction and photorealistic surround-view synthesis. DrivingGaussian++ supports training-free controllable editing for dynamic driving scenes, including texture modification, weather simulation, and object manipulation, leveraging multi-view images and depth priors. By integrating large language models (LLMs) and controllable editing, our method can automatically generate dynamic object motion trajectories and enhance their realism during the optimization process. DrivingGaussian++ demonstrates consistent and realistic editing results and generates dynamic multi-view driving scenarios, while significantly enhancing scene diversity. More results and code can be found at the project site: https://xiong-creator.github.io/DrivingGaussian_plus.github.io

Authors:Enrico Martini, Ho Jin Choi, Nadia Figueroa, Nicola Bombieri
Title: COMETH: Convex Optimization for Multiview Estimation and Tracking of Humans
Abstract:
In the era of Industry 5.0, monitoring human activity is essential for ensuring both ergonomic safety and overall well-being. While multi-camera centralized setups improve pose estimation accuracy, they often suffer from high computational costs and bandwidth requirements, limiting scalability and real-time applicability. Distributing processing across edge devices can reduce network bandwidth and computational load. On the other hand, the constrained resources of edge devices lead to accuracy degradation, and the distribution of computation leads to temporal and spatial inconsistencies. We address this challenge by proposing COMETH (Convex Optimization for Multiview Estimation and Tracking of Humans), a lightweight algorithm for real-time multi-view human pose fusion that relies on three concepts: it integrates kinematic and biomechanical constraints to increase the joint positioning accuracy; it employs convex optimization-based inverse kinematics for spatial fusion; and it implements a state observer to improve temporal consistency. We evaluate COMETH on both public and industrial datasets, where it outperforms state-of-the-art methods in localization, detection, and tracking accuracy. The proposed fusion pipeline enables accurate and scalable human motion tracking, making it well-suited for industrial and safety-critical applications. The code is publicly available at https://github.com/PARCO-LAB/COMETH.
中文摘要:COMETH是一种轻量级算法,通过整合运动学约束与凸优化技术,在降低计算需求的同时实现了工业场景中实时多视角人体姿态跟踪的更高精度。
English Summary: COMETH is a lightweight algorithm that enhances real-time multi-view human pose tracking by integrating kinematic constraints and convex optimization, achieving superior accuracy in industrial applications while reducing computational demands.

Authors:Yifan Gao, Haoyue Li, Feng Yuan, Xiaosong Wang, Xin Gao
Title: Dino U-Net: Exploiting High-Fidelity Dense Features from Foundation Models for Medical Image Segmentation
Abstract:
Foundation models pre-trained on large-scale natural image datasets offer a powerful paradigm for medical image segmentation. However, effectively transferring their learned representations for precise clinical applications remains a challenge. In this work, we propose Dino U-Net, a novel encoder-decoder architecture designed to exploit the high-fidelity dense features of the DINOv3 vision foundation model. Our architecture introduces an encoder built upon a frozen DINOv3 backbone, which employs a specialized adapter to fuse the model's rich semantic features with low-level spatial details. To preserve the quality of these representations during dimensionality reduction, we design a new fidelity-aware projection module (FAPM) that effectively refines and projects the features for the decoder. We conducted extensive experiments on seven diverse public medical image segmentation datasets. Our results show that Dino U-Net achieves state-of-the-art performance, consistently outperforming previous methods across various imaging modalities. Our framework proves to be highly scalable, with segmentation accuracy consistently improving as the backbone model size increases up to the 7-billion-parameter variant. The findings demonstrate that leveraging the superior, dense-pretrained features from a general-purpose foundation model provides a highly effective and parameter-efficient approach to advance the accuracy of medical image segmentation. The code is available at https://github.com/yifangao112/DinoUNet.
中文: Dino U-Net 利用 DINOv3 基础模型,通过专用适配器和保真度感知投影模块,在多种医学图像分割数据集上实现了可扩展的最优性能。
English: Dino U-Net leverages the DINOv3 foundation model with a specialized adapter and fidelity-aware projection to achieve state-of-the-art, scalable performance in medical image segmentation across diverse datasets.

Authors:Ye Zhang, Yu Zhou, Jingwen Qi, Yongbing Zhang, Simon Puettmann, Finn Wichmann, Larissa Pereira Ferreira, Lara Sichward, Julius Keyl, Sylvia Hartmann, Shuo Zhao, Hongxiao Wang, Xiaowei Xu, Jianxu Chen
Title: PathMR: Multimodal Visual Reasoning for Interpretable Pathology Diagnosis
Abstract:
Deep learning based automated pathological diagnosis has markedly improved diagnostic efficiency and reduced variability between observers, yet its clinical adoption remains limited by opaque model decisions and a lack of traceable rationale. To address this, recent multimodal visual reasoning architectures provide a unified framework that generates segmentation masks at the pixel level alongside semantically aligned textual explanations. By localizing lesion regions and producing expert style diagnostic narratives, these models deliver the transparent and interpretable insights necessary for dependable AI assisted pathology. Building on these advancements, we propose PathMR, a cell-level Multimodal visual Reasoning framework for Pathological image analysis. Given a pathological image and a textual query, PathMR generates expert-level diagnostic explanations while simultaneously predicting cell distribution patterns. To benchmark its performance, we evaluated our approach on the publicly available PathGen dataset as well as on our newly developed GADVR dataset. Extensive experiments on these two datasets demonstrate that PathMR consistently outperforms state-of-the-art visual reasoning methods in text generation quality, segmentation accuracy, and cross-modal alignment. These results highlight the potential of PathMR for improving interpretability in AI-driven pathological diagnosis. The code will be publicly available in https://github.com/zhangye-zoe/PathMR.
Chinese: PathMR是一种细胞级多模态视觉推理框架,通过生成专家级文本解释和精确的细胞分割来提升AI病理诊断的可解释性,在文本生成质量和分割精度上均优于现有先进方法。
English: PathMR is a cell-level multimodal visual reasoning framework that enhances AI-driven pathological diagnosis by generating expert-level textual explanations and precise cell segmentation, outperforming existing methods in text quality and segmentation accuracy while improving interpretability.

Authors:Tao Luo, Han Wu, Tong Yang, Dinggang Shen, Zhiming Cui
Title: Adapting Foundation Model for Dental Caries Detection with Dual-View Co-Training
Abstract:
Accurate dental caries detection from panoramic X-rays plays a pivotal role in preventing lesion progression. However, current detection methods often yield suboptimal accuracy due to subtle contrast variations and diverse lesion morphology of dental caries. In this work, inspired by the clinical workflow where dentists systematically combine whole-image screening with detailed tooth-level inspection, we present DVCTNet, a novel Dual-View Co-Training network for accurate dental caries detection. Our DVCTNet starts with employing automated tooth detection to establish two complementary views: a global view from panoramic X-ray images and a local view from cropped tooth images. We then pretrain two vision foundation models separately on the two views. The global-view foundation model serves as the detection backbone, generating region proposals and global features, while the local-view model extracts detailed features from corresponding cropped tooth patches matched by the region proposals. To effectively integrate information from both views, we introduce a Gated Cross-View Attention (GCV-Atten) module that dynamically fuses dual-view features, enhancing the detection pipeline by integrating the fused features back into the detection model for final caries detection. To rigorously evaluate our DVCTNet, we test it on a public dataset and further validate its performance on a newly curated, high-precision dental caries detection dataset, annotated using both intra-oral images and panoramic X-rays for double verification. Experimental results demonstrate DVCTNet's superior performance against existing state-of-the-art (SOTA) methods on both datasets, indicating the clinical applicability of our method. Our code and labeled dataset are available at https://github.com/ShanghaiTech-IMPACT/DVCTNet.
中文: 本研究提出DVCTNet双视角协同训练网络,通过门控跨视角注意力模块融合全景图像全局视角与牙齿局部视角特征,在公开数据集和新构建数据集上均超越现有最优方法,显著提升龋齿检测准确率。
English: This study introduces DVCTNet, a dual-view co-training network that enhances dental caries detection accuracy by integrating global panoramic and local tooth-level views through a gated cross-view attention module, outperforming existing methods on public and newly curated datasets.

Authors:Beth Pearson, Bilal Boulbarss, Michael Wray, Martha Lewis
Title: Evaluating Compositional Generalisation in VLMs and Diffusion Models
Abstract:
A fundamental aspect of the semantics of natural language is that novel meanings can be formed from the composition of previously known parts. Vision-language models (VLMs) have made significant progress in recent years, however, there is evidence that they are unable to perform this kind of composition. For example, given an image of a red cube and a blue cylinder, a VLM such as CLIP is likely to incorrectly label the image as a red cylinder or a blue cube, indicating it represents the image as a `bag-of-words' and fails to capture compositional semantics. Diffusion models have recently gained significant attention for their impressive generative abilities, and zero-shot classifiers based on diffusion models have been shown to perform competitively with CLIP in certain compositional tasks. In this work we explore whether the generative Diffusion Classifier has improved compositional generalisation abilities compared to discriminative models. We assess three models -- Diffusion Classifier, CLIP, and ViLT -- on their ability to bind objects with attributes and relations in both zero-shot learning (ZSL) and generalised zero-shot learning (GZSL) settings. Our results show that the Diffusion Classifier and ViLT perform well at concept binding tasks, but that all models struggle significantly with the relational GZSL task, underscoring the broader challenges VLMs face with relational reasoning. Analysis of CLIP embeddings suggests that the difficulty may stem from overly similar representations of relational concepts such as left and right. Code and dataset are available at: https://github.com/otmive/diffusion_classifier_clip
Chinese: 视觉语言模型如CLIP在组合语义方面存在困难,常无法正确关联属性和对象,而扩散分类器在概念绑定上表现更佳,但在关系推理任务中仍面临挑战。
English: Vision-language models like CLIP struggle with compositional semantics, often failing to correctly bind attributes and objects, while the Diffusion Classifier shows improved performance in concept binding but still faces challenges with relational reasoning.

Authors:Jiawen Lin, Shiran Bian, Yihang Zhu, Wenbin Tan, Yachao Zhang, Yuan Xie, Yanyun Qu
Title: SeqVLM: Proposal-Guided Multi-View Sequences Reasoning via VLM for Zero-Shot 3D Visual Grounding
Abstract:
3D Visual Grounding (3DVG) aims to localize objects in 3D scenes using natural language descriptions. Although supervised methods achieve higher accuracy in constrained settings, zero-shot 3DVG holds greater promise for real-world applications since eliminating scene-specific training requirements. However, existing zero-shot methods face challenges of spatial-limited reasoning due to reliance on single-view localization, and contextual omissions or detail degradation. To address these issues, we propose SeqVLM, a novel zero-shot 3DVG framework that leverages multi-view real-world scene images with spatial information for target object reasoning. Specifically, SeqVLM first generates 3D instance proposals via a 3D semantic segmentation network and refines them through semantic filtering, retaining only semantic-relevant candidates. A proposal-guided multi-view projection strategy then projects these candidate proposals onto real scene image sequences, preserving spatial relationships and contextual details in the conversion process of 3D point cloud to images. Furthermore, to mitigate VLM computational overload, we implement a dynamic scheduling mechanism that iteratively processes sequances-query prompts, leveraging VLM's cross-modal reasoning capabilities to identify textually specified objects. Experiments on the ScanRefer and Nr3D benchmarks demonstrate state-of-the-art performance, achieving Acc@0.25 scores of 55.6% and 53.2%, surpassing previous zero-shot methods by 4.0% and 5.2%, respectively, which advance 3DVG toward greater generalization and real-world applicability. The code is available at https://github.com/JiawLin/SeqVLM.
中文: SeqVLM是一种新颖的零样本三维视觉定位框架,通过多视角空间信息和动态调度机制克服单视图局限,在基准测试中实现最优性能,推动了实际应用的发展。
English: SeqVLM is a novel zero-shot 3D visual grounding framework that leverages multi-view images with spatial information and dynamic scheduling to overcome single-view limitations, achieving state-of-the-art performance on benchmarks and advancing real-world applicability.

Authors:Yuxi Hu, Jun Zhang, Kuangyi Chen, Zhe Zhang, Friedrich Fraundorfer
Title: ${C}^{3}$-GS: Learning Context-aware, Cross-dimension, Cross-scale Feature for Generalizable Gaussian Splatting
Abstract:
Generalizable Gaussian Splatting aims to synthesize novel views for unseen scenes without per-scene optimization. In particular, recent advancements utilize feed-forward networks to predict per-pixel Gaussian parameters, enabling high-quality synthesis from sparse input views. However, existing approaches fall short in encoding discriminative, multi-view consistent features for Gaussian predictions, which struggle to construct accurate geometry with sparse views. To address this, we propose $\mathbf{C}^{3}$-GS, a framework that enhances feature learning by incorporating context-aware, cross-dimension, and cross-scale constraints. Our architecture integrates three lightweight modules into a unified rendering pipeline, improving feature fusion and enabling photorealistic synthesis without requiring additional supervision. Extensive experiments on benchmark datasets validate that $\mathbf{C}^{3}$-GS achieves state-of-the-art rendering quality and generalization ability. Code is available at: https://github.com/YuhsiHu/C3-GS.
中文: C³-GS框架通过引入上下文感知、跨维度和跨尺度的约束来增强高斯泼溅的特征学习能力,无需逐场景优化即可实现照片级真实感的新视角合成,在渲染质量和泛化能力上达到领先水平。
English: The proposed C³-GS framework enhances Gaussian Splatting by incorporating context-aware, cross-dimension, and cross-scale constraints to improve feature learning and enable photorealistic novel view synthesis without per-scene optimization.

Authors:Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang
Title: Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning
Abstract:
Recent advancements highlight the importance of GRPO-based reinforcement learning methods and benchmarking in enhancing text-to-image (T2I) generation. However, current methods using pointwise reward models (RM) for scoring generated images are susceptible to reward hacking. We reveal that this happens when minimal score differences between images are amplified after normalization, creating illusory advantages that drive the model to over-optimize for trivial gains, ultimately destabilizing the image generation process. To address this, we propose Pref-GRPO, a pairwise preference reward-based GRPO method that shifts the optimization objective from score maximization to preference fitting, ensuring more stable training. In Pref-GRPO, images are pairwise compared within each group using preference RM, and the win rate is used as the reward signal. Extensive experiments demonstrate that PREF-GRPO differentiates subtle image quality differences, providing more stable advantages and mitigating reward hacking. Additionally, existing T2I benchmarks are limited by coarse evaluation criteria, hindering comprehensive model assessment. To solve this, we introduce UniGenBench, a unified T2I benchmark comprising 600 prompts across 5 main themes and 20 subthemes. It evaluates semantic consistency through 10 primary and 27 sub-criteria, leveraging MLLM for benchmark construction and evaluation. Our benchmarks uncover the strengths and weaknesses of both open and closed-source T2I models and validate the effectiveness of Pref-GRPO.
中文摘要:本文提出Pref-GRPO方法,通过将优化目标从分数最大化转为偏好匹配来解决文本生成图像中的奖励破解问题,同时开发了具有细粒度评估标准的UniGenBench基准,以更全面评估模型性能。
English Summary: This paper introduces Pref-GRPO, a pairwise preference-based reinforcement learning method that mitigates reward hacking in text-to-image generation by shifting optimization from score maximization to preference fitting, alongside UniGenBench, a comprehensive benchmark with fine-grained evaluation criteria to better assess model performance.

Authors:Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander Toshev, Oncel Tuzel, Hadi Pouransari
Title: MobileCLIP2: Improving Multi-Modal Reinforced Training
Abstract:
Foundation image-text models such as CLIP with zero-shot capabilities enable a wide array of applications. MobileCLIP is a recent family of image-text models at 3-15ms latency and 50-150M parameters with state-of-the-art zero-shot accuracy. The main ingredients in MobileCLIP were its low-latency and light architectures and a novel multi-modal reinforced training that made knowledge distillation from multiple caption-generators and CLIP teachers efficient, scalable, and reproducible. In this paper, we improve the multi-modal reinforced training of MobileCLIP through: 1) better CLIP teacher ensembles trained on the DFN dataset, 2) improved captioner teachers trained on the DFN dataset and fine-tuned on a diverse selection of high-quality image-caption datasets. We discover new insights through ablations such as the importance of temperature tuning in contrastive knowledge distillation, the effectiveness of caption-generator fine-tuning for caption diversity, and the additive improvement from combining synthetic captions generated by multiple models. We train a new family of models called MobileCLIP2 and achieve state-of-the-art ImageNet-1k zero-shot accuracies at low latencies. In particular, we observe 2.2% improvement in ImageNet-1k accuracy for MobileCLIP2-B compared with MobileCLIP-B architecture. Notably, MobileCLIP2-S4 matches the zero-shot accuracy of SigLIP-SO400M/14 on ImageNet-1k while being 2$\times$ smaller and improves on DFN ViT-L/14 at 2.5$\times$ lower latency. We release our pretrained models (https://github.com/apple/ml-mobileclip) and the data generation code (https://github.com/apple/ml-mobileclip-dr). The data generation code makes it easy to create new reinforced datasets with arbitrary teachers using distributed scalable processing.
中文: MobileCLIP2通过增强的多模态强化训练和改进的教师模型集成,以低延迟和小模型尺寸实现了最先进的零样本准确率。
English: MobileCLIP2 introduces enhanced multi-modal reinforced training and improved teacher ensembles, achieving state-of-the-art zero-shot accuracy with low latency and smaller model sizes.

Authors:Smriti Joshi, Lidia Garrucho, Richard Osuala, Oliver Diaz, Karim Lekadir
Title: Mask-Guided Multi-Channel SwinUNETR Framework for Robust MRI Classification
Abstract:
Breast cancer is one of the leading causes of cancer-related mortality in women, and early detection is essential for improving outcomes. Magnetic resonance imaging (MRI) is a highly sensitive tool for breast cancer detection, particularly in women at high risk or with dense breast tissue, where mammography is less effective. The ODELIA consortium organized a multi-center challenge to foster AI-based solutions for breast cancer diagnosis and classification. The dataset included 511 studies from six European centers, acquired on scanners from multiple vendors at both 1.5 T and 3 T. Each study was labeled for the left and right breast as no lesion, benign lesion, or malignant lesion. We developed a SwinUNETR-based deep learning framework that incorporates breast region masking, extensive data augmentation, and ensemble learning to improve robustness and generalizability. Our method achieved second place on the challenge leaderboard, highlighting its potential to support clinical breast MRI interpretation. We publicly share our codebase at https://github.com/smriti-joshi/bcnaim-odelia-challenge.git.
中文: ODELIA联盟通过多中心挑战赛推动乳腺癌诊断的AI解决方案,我们基于SwinUNETR的框架结合乳腺区域掩模和集成学习,在挑战赛中荣获第二名,展现了临床应用的潜力。
English: The ODELIA consortium's multi-center challenge promoted AI solutions for breast cancer diagnosis, where our SwinUNETR-based framework secured second place by enhancing detection robustness through breast masking and ensemble learning.

Authors:Yiguo Jiang, Xiaodong Cun, Yong Zhang, Yudian Zheng, Fan Tang, Chi-Man Pun
Title: EmoCAST: Emotional Talking Portrait via Emotive Text Description
Abstract:
Emotional talking head synthesis aims to generate talking portrait videos with vivid expressions. Existing methods still exhibit limitations in control flexibility, motion naturalness, and expression quality. Moreover, currently available datasets are primarily collected in lab settings, further exacerbating these shortcomings. Consequently, these limitations substantially hinder practical applications in real-world scenarios. To address these challenges, we propose EmoCAST, a diffusion-based framework with two key modules for precise text-driven emotional synthesis. In appearance modeling, emotional prompts are integrated through a text-guided decoupled emotive module, enhancing the spatial knowledge to improve emotion comprehension. To improve the relationship between audio and emotion, we introduce an emotive audio attention module to capture the interplay between controlled emotion and driving audio, generating emotion-aware features to guide more precise facial motion synthesis. Additionally, we construct an emotional talking head dataset with comprehensive emotive text descriptions to optimize the framework's performance. Based on the proposed dataset, we propose an emotion-aware sampling training strategy and a progressive functional training strategy that further improve the model's ability to capture nuanced expressive features and achieve accurate lip-synchronization. Overall, EmoCAST achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos. Project Page: https://github.com/GVCLab/EmoCAST
中文摘要:EmoCAST是一种基于扩散的框架,通过文本引导模块和新构建的数据集,显著提升了情感说话头合成的真实感和音视频同步性能。
English Summary: EmoCAST is a diffusion-based framework that enhances emotional talking head synthesis through text-guided modules and a new dataset, achieving superior realism and synchronization.

Authors:Jingyun Yang, Guoqing Zhang, Jingge Wang, Yang Li
Title: Learning What is Worth Learning: Active and Sequential Domain Adaptation for Multi-modal Gross Tumor Volume Segmentation
Abstract:
Accurate gross tumor volume segmentation on multi-modal medical data is critical for radiotherapy planning in nasopharyngeal carcinoma and glioblastoma. Recent advances in deep neural networks have brought promising results in medical image segmentation, leading to an increasing demand for labeled data. Since labeling medical images is time-consuming and labor-intensive, active learning has emerged as a solution to reduce annotation costs by selecting the most informative samples to label and adapting high-performance models with as few labeled samples as possible. Previous active domain adaptation (ADA) methods seek to minimize sample redundancy by selecting samples that are farthest from the source domain. However, such one-off selection can easily cause negative transfer, and access to source medical data is often limited. Moreover, the query strategy for multi-modal medical data remains unexplored. In this work, we propose an active and sequential domain adaptation framework for dynamic multi-modal sample selection in ADA. We derive a query strategy to prioritize labeling and training on the most valuable samples based on their informativeness and representativeness. Empirical validation on diverse gross tumor volume segmentation tasks demonstrates that our method achieves favorable segmentation performance, significantly outperforming state-of-the-art ADA methods. Code is available at the git repository: \href{https://github.com/Hiyoochan/mmActS}{mmActS}.
中文: 本研究提出了一种主动序贯域自适应框架,通过优先选择信息量和代表性最强的多模态医学数据样本,在显著降低标注成本的同时,相比现有方法大幅提升了肿瘤靶区勾画的性能表现。
English: This study introduces an active and sequential domain adaptation framework that optimizes multi-modal medical data selection by prioritizing the most informative and representative samples, significantly enhancing gross tumor volume segmentation performance while reducing annotation costs compared to existing methods.

Authors:Jeong Hun Yeo, Hyeongseop Rha, Sungjune Park, Junil Won, Yong Man Ro
Title: Towards Inclusive Communication: A Unified LLM-Based Framework for Sign Language, Lip Movements, and Audio Understanding
Abstract:
Audio is the primary modality for human communication and has driven the success of Automatic Speech Recognition (ASR) technologies. However, such systems remain inherently inaccessible to individuals who are deaf or hard of hearing. Visual alternatives such as sign language and lip reading offer effective substitutes, and recent advances in Sign Language Translation (SLT) and Visual Speech Recognition (VSR) have improved audio-less communication. Yet, these modalities have largely been studied in isolation, and their integration within a unified framework remains underexplored. In this paper, we introduce the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation. We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or superior to state-of-the-art models specialized for individual tasks. Building on this framework, we achieve performance on par with or better than task-specific state-of-the-art models across SLT, VSR, ASR, and AVSR. Furthermore, our analysis reveals that explicitly modeling lip movements as a separate modality significantly improves SLT performance.
Chinese: 本文提出了首个统一框架,将手语、唇部动作和音频结合用于口语文本生成,在多项任务中达到或优于最先进模型的性能,并揭示了唇部动作作为独立模态对提升手语翻译效果的关键作用。
English: This paper introduces the first unified framework that integrates sign language, lip movements, and audio for spoken-language text generation, achieving performance comparable to or better than state-of-the-art models in multiple tasks while revealing the critical role of lip movements in enhancing sign language translation.

Authors:Xiaochuan Li, Guoguang Du, Runze Zhang, Liang Jin, Qi Jia, Lihua Lu, Zhenhua Guo, Yaqian Zhao, Haiyang Liu, Tianqi Wang, Changsheng Li, Xiaoli Gong, Rengang Li, Baoyu Fan
Title: Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation
Abstract:
Scaling laws have validated the success and promise of large-data-trained models in creative generation across text, image, and video domains. However, this paradigm faces data scarcity in the 3D domain, as there is far less of it available on the internet compared to the aforementioned modalities. Fortunately, there exist adequate videos that inherently contain commonsense priors, offering an alternative supervisory signal to mitigate the generalization bottleneck caused by limited native 3D data. On the one hand, videos capturing multiple views of an object or scene provide a spatial consistency prior for 3D generation. On the other hand, the rich semantic information contained within the videos enables the generated content to be more faithful to the text prompts and semantically plausible. This paper explores how to apply the video modality in 3D asset generation, spanning datasets to models. We introduce Droplet3D-4M, the first large-scale video dataset with multi-view level annotations, and train Droplet3D, a generative model supporting both image and dense text input. Extensive experiments validate the effectiveness of our approach, demonstrating its ability to produce spatially consistent and semantically plausible content. Moreover, in contrast to the prevailing 3D solutions, our approach exhibits the potential for extension to scene-level applications. This indicates that the commonsense priors from the videos significantly facilitate 3D creation. We have open-sourced all resources including the dataset, code, technical framework, and model weights: https://dropletx.github.io/.

Authors:Pengpeng Yu, Haoran Li, Dingquan Li, Runqing Jiang, Jing Wang, Liang Lin, Yulan Guo
Title: Re-Densification Meets Cross-Scale Propagation: Real-Time Compression of LiDAR Point Clouds
Abstract:
LiDAR point clouds are fundamental to various applications, yet high-precision scans incur substantial storage and transmission overhead. Existing methods typically convert unordered points into hierarchical octree or voxel structures for dense-to-sparse predictive coding. However, the extreme sparsity of geometric details hinders efficient context modeling, thereby limiting their compression performance and speed. To address this challenge, we propose to generate compact features for efficient predictive coding. Our framework comprises two lightweight modules. First, the Geometry Re-Densification Module re-densifies encoded sparse geometry, extracts features at denser scale, and then re-sparsifies the features for predictive coding. This module avoids costly computation on highly sparse details while maintaining a lightweight prediction head. Second, the Cross-scale Feature Propagation Module leverages occupancy cues from multiple resolution levels to guide hierarchical feature propagation. This design facilitates information sharing across scales, thereby reducing redundant feature extraction and providing enriched features for the Geometry Re-Densification Module. By integrating these two modules, our method yields a compact feature representation that provides efficient context modeling and accelerates the coding process. Experiments on the KITTI dataset demonstrate state-of-the-art compression ratios and real-time performance, achieving 26 FPS for both encoding and decoding at 12-bit quantization. Code is available at https://github.com/pengpeng-yu/FastPCC.
中文摘要:该方法通过两个轻量级模块生成紧凑特征来实现高效激光雷达点云压缩,在KITTI数据集上实现了最优压缩性能并具备实时处理能力。
English Summary: The proposed method introduces two lightweight modules that generate compact features for efficient LiDAR point cloud compression, achieving state-of-the-art performance with real-time processing speeds on the KITTI dataset.

Authors:Yuqi Xiong, Wuzhen Shi, Yang Wen, Ruhan Liu
Title: Graph-Based Uncertainty Modeling and Multimodal Fusion for Salient Object Detection
Abstract:
In view of the problems that existing salient object detection (SOD) methods are prone to losing details, blurring edges, and insufficient fusion of single-modal information in complex scenes, this paper proposes a dynamic uncertainty propagation and multimodal collaborative reasoning network (DUP-MCRNet). Firstly, a dynamic uncertainty graph convolution module (DUGC) is designed to propagate uncertainty between layers through a sparse graph constructed based on spatial semantic distance, and combined with channel adaptive interaction, it effectively improves the detection accuracy of small structures and edge regions. Secondly, a multimodal collaborative fusion strategy (MCF) is proposed, which uses learnable modality gating weights to weightedly fuse the attention maps of RGB, depth, and edge features. It can dynamically adjust the importance of each modality according to different scenes, effectively suppress redundant or interfering information, and strengthen the semantic complementarity and consistency between cross-modalities, thereby improving the ability to identify salient regions under occlusion, weak texture or background interference. Finally, the detection performance at the pixel level and region level is optimized through multi-scale BCE and IoU loss, cross-scale consistency constraints, and uncertainty-guided supervision mechanisms. Extensive experiments show that DUP-MCRNet outperforms various SOD methods on most common benchmark datasets, especially in terms of edge clarity and robustness to complex backgrounds. Our code is publicly available at https://github.com/YukiBear426/DUP-MCRNet.
Chinese: 本文提出DUP-MCRNet网络,通过动态不确定性图卷积传播和多模态协同融合策略,有效提升显著目标检测在边缘细节和复杂场景下的性能表现,在多个基准数据集上验证了其优越性。
English: This paper introduces DUP-MCRNet, a novel network that enhances salient object detection by dynamically propagating uncertainty through graph convolution and adaptively fusing multimodal features, achieving superior performance in edge clarity and complex scene robustness.

Authors:Mang Cao, Sanping Zhou, Yizhe Li, Ye Deng, Wenli Huang, Le Wang
Title: Enhancing Mamba Decoder with Bidirectional Interaction in Multi-Task Dense Prediction
Abstract:
Sufficient cross-task interaction is crucial for success in multi-task dense prediction. However, sufficient interaction often results in high computational complexity, forcing existing methods to face the trade-off between interaction completeness and computational efficiency. To address this limitation, this work proposes a Bidirectional Interaction Mamba (BIM), which incorporates novel scanning mechanisms to adapt the Mamba modeling approach for multi-task dense prediction. On the one hand, we introduce a novel Bidirectional Interaction Scan (BI-Scan) mechanism, which constructs task-specific representations as bidirectional sequences during interaction. By integrating task-first and position-first scanning modes within a unified linear complexity architecture, BI-Scan efficiently preserves critical cross-task information. On the other hand, we employ a Multi-Scale Scan~(MS-Scan) mechanism to achieve multi-granularity scene modeling. This design not only meets the diverse granularity requirements of various tasks but also enhances nuanced cross-task feature interactions. Extensive experiments on two challenging benchmarks, \emph{i.e.}, NYUD-V2 and PASCAL-Context, show the superiority of our BIM vs its state-of-the-art competitors.
本研究提出双向交互Mamba(BIM),通过新型扫描机制在多任务密集预测中实现高效的跨任务交互,同时保持线性计算复杂度。
This work introduces Bidirectional Interaction Mamba (BIM), which uses novel scanning mechanisms to achieve efficient cross-task interaction in multi-task dense prediction while maintaining linear computational complexity.

Authors:Zhixiang Chi, Yanan Wu, Li Gu, Huan Liu, Ziqiang Wang, Yang Zhang, Yang Wang, Konstantinos N. Plataniotis
Title: Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation
Abstract:
CLIP exhibits strong visual-textual alignment but struggle with open-vocabulary segmentation due to poor localization. Prior methods enhance spatial coherence by modifying intermediate attention. But, this coherence isn't consistently propagated to the final output due to subsequent operations such as projections. Additionally, intermediate attention lacks direct interaction with text representations, such semantic discrepancy limits the full potential of CLIP. In this work, we propose a training-free, feedback-driven self-adaptive framework that adapts output-based patch-level correspondences back to the intermediate attention. The output predictions, being the culmination of the model's processing, encapsulate the most comprehensive visual and textual semantics about each patch. Our approach enhances semantic consistency between internal representations and final predictions by leveraging the model's outputs as a stronger spatial coherence prior. We design key modules, including attention isolation, confidence-based pruning for sparse adaptation, and adaptation ensemble, to effectively feedback the output coherence cues. Our method functions as a plug-in module, seamlessly integrating into four state-of-the-art approaches with three backbones (ViT-B, ViT-L, ViT-H). We further validate our framework across multiple attention types (Q-K, self-self, and Proxy augmented with MAE, SAM, and DINO). Our approach consistently improves their performance across eight benchmarks.
中文: 本文提出了一种无需训练的自适应框架,通过将基于输出的补丁级语义一致性反馈至中间注意力层,有效提升了CLIP在开放词汇分割中的空间连贯性,在多种基准测试中均实现性能提升且无需修改模型结构。
English: This paper introduces a training-free, self-adaptive framework that enhances CLIP's open-vocabulary segmentation by feeding back output-based patch-level semantic coherence to intermediate attention, improving spatial consistency and performance across multiple benchmarks without altering model architecture.

Authors:Mert Cokelek, Halit Ozsoy, Nevrez Imamoglu, Cagri Ozcinar, Inci Ayhan, Erkut Erdem, Aykut Erdem
Title: Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360-Degree Videos
Abstract:
Omnidirectional videos (ODVs) are redefining viewer experiences in virtual reality (VR) by offering an unprecedented full field-of-view (FOV). This study extends the domain of saliency prediction to 360-degree environments, addressing the complexities of spherical distortion and the integration of spatial audio. Contextually, ODVs have transformed user experience by adding a spatial audio dimension that aligns sound direction with the viewer's perspective in spherical scenes. Motivated by the lack of comprehensive datasets for 360-degree audio-visual saliency prediction, our study curates YT360-EyeTracking, a new dataset of 81 ODVs, each observed under varying audio-visual conditions. Our goal is to explore how to utilize audio-visual cues to effectively predict visual saliency in 360-degree videos. Towards this aim, we propose two novel saliency prediction models: SalViT360, a vision-transformer-based framework for ODVs equipped with spherical geometry-aware spatio-temporal attention layers, and SalViT360-AV, which further incorporates transformer adapters conditioned on audio input. Our results on a number of benchmark datasets, including our YT360-EyeTracking, demonstrate that SalViT360 and SalViT360-AV significantly outperform existing methods in predicting viewer attention in 360-degree scenes. Interpreting these results, we suggest that integrating spatial audio cues in the model architecture is crucial for accurate saliency prediction in omnidirectional videos. Code and dataset will be available at https://cyberiada.github.io/SalViT360.
中文: 本研究提出了两种新颖的视听显著性预测模型SalViT360和SalViT360-AV,通过结合球面几何感知注意力和音频适配器,在360度视频中显著提升了观众注意力预测的准确性,并在新构建的YT360-EyeTracking数据集上验证了其优越性能。
English: This study introduces two novel audio-visual saliency prediction models, SalViT360 and SalViT360-AV, which leverage spherical geometry-aware attention and audio transformer adapters to significantly outperform existing methods in predicting viewer attention in 360-degree videos, as validated on the newly curated YT360-EyeTracking dataset.

Authors:Alberto Compagnoni, Davide Caffagni, Nicholas Moratelli, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara
Title: Mitigating Hallucinations in Multimodal LLMs via Object-aware Preference Optimization
Abstract:
Multimodal Large Language Models (MLLMs) emerge as a unified interface to address a multitude of tasks, ranging from NLP to computer vision. Despite showcasing state-of-the-art results in many benchmarks, a long-standing issue is the tendency of MLLMs to hallucinate, that is to generate answers to the user's query that are not reflected in the visual input. In this paper, we address the problem of hallucinations as an alignment problem, seeking to steer the MLLM so that it prefers generating content without hallucinations. In contrast to recent approaches that require complicated pipelines to build synthetic preference data for alignment training, often relying on proprietary models, we capitalize on the well-known CHAIR metric, originally proposed to gauge the degree of hallucinations in image captioning. Given a pair of generated answers, we leverage CHAIR to distinguish winner and loser options (i.e., non-hallucinated and hallucinated samples) and fine-tune off-the-shelf MLLMs via Direct Preference Optimization (DPO). The resulting method, which we refer to as CHAIR-DPO, effectively diminishes the amount of hallucinated answers on several hallucination benchmarks, demonstrating the effectiveness of fine-tuning the MLLM with a CHAIR-based reward. Source code and trained models are publicly available at https://github.com/aimagelab/CHAIR-DPO.
中文摘要:CHAIR-DPO方法通过CHAIR指标区分非幻觉与幻觉样本,并利用直接偏好优化微调多模态大语言模型,在多个基准测试中显著减少了幻觉答案的生成。
English Summary: CHAIR-DPO addresses hallucinations in Multimodal Large Language Models by using the CHAIR metric to identify non-hallucinated responses and fine-tuning models with Direct Preference Optimization, effectively reducing errors across multiple benchmarks.

Authors:Guoping Xu, Jayaram K. Udupa, Jax Luo, Songlin Zhao, Yajun Yu, Scott B. Raymond, Hao Peng, Lipeng Ning, Yogesh Rathi, Wei Liu, You Zhang
Title: Is the medical image segmentation problem solved? A survey of current developments and future directions
Abstract:
Medical image segmentation has advanced rapidly over the past two decades, largely driven by deep learning, which has enabled accurate and efficient delineation of cells, tissues, organs, and pathologies across diverse imaging modalities. This progress raises a fundamental question: to what extent have current models overcome persistent challenges, and what gaps remain? In this work, we provide an in-depth review of medical image segmentation, tracing its progress and key developments over the past decade. We examine core principles, including multiscale analysis, attention mechanisms, and the integration of prior knowledge, across the encoder, bottleneck, skip connections, and decoder components of segmentation networks. Our discussion is organized around seven key dimensions: (1) the shift from supervised to semi-/unsupervised learning, (2) the transition from organ segmentation to lesion-focused tasks, (3) advances in multi-modality integration and domain adaptation, (4) the role of foundation models and transfer learning, (5) the move from deterministic to probabilistic segmentation, (6) the progression from 2D to 3D and 4D segmentation, and (7) the trend from model invocation to segmentation agents. Together, these perspectives provide a holistic overview of the trajectory of deep learning-based medical image segmentation and aim to inspire future innovation. To support ongoing research, we maintain a continually updated repository of relevant literature and open-source resources at https://github.com/apple1986/medicalSegReview
中文摘要:本文全面回顾了过去十年医学图像分割的发展历程,从七个关键维度分析了技术演进,并指出了当前挑战与未来研究方向。
English Summary: This review comprehensively examines the evolution of medical image segmentation over the past decade, analyzing key technical developments across seven critical dimensions while identifying remaining challenges and future directions.

Authors:Zeyi Sun, Yuhang Cao, Jianze Liang, Qiushi Sun, Ziyu Liu, Zhixiong Zhang, Yuhang Zang, Xiaoyi Dong, Kai Chen, Dahua Lin, Jiaqi Wang
Title: CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning
Abstract:
Autonomous agents for Graphical User Interfaces (GUIs) face significant challenges in specialized domains such as scientific computing, where both long-horizon planning and precise execution are required. Existing approaches suffer from a trade-off: generalist agents excel at planning but perform poorly in execution, while specialized agents demonstrate the opposite weakness. Recent compositional frameworks attempt to bridge this gap by combining a planner and an actor, but they are typically static and non-trainable, which prevents adaptation from experience. This is a critical limitation given the scarcity of high-quality data in scientific domains. To address these limitations, we introduce CODA, a novel and trainable compositional framework that integrates a generalist planner (Cerebrum) with a specialist executor (Cerebellum), trained via a dedicated two-stage pipeline. In the first stage, Specialization, we apply a decoupled GRPO approach to train an expert planner for each scientific application individually, bootstrapping from a small set of task trajectories. In the second stage, Generalization, we aggregate all successful trajectories from the specialized experts to build a consolidated dataset, which is then used for supervised fine-tuning of the final planner. This equips CODA with both robust execution and cross-domain generalization. Evaluated on four challenging applications from the ScienceBoard benchmark, CODA significantly outperforms baselines and establishes a new state of the art among open-source models.
中文: CODA提出了一种可训练的复合框架,通过两阶段训练流程将通用规划器与专业执行器相结合,在科学计算GUI任务中实现了卓越的执行鲁棒性和跨领域泛化能力。
English: CODA introduces a trainable compositional framework that combines a generalist planner with specialist executors, achieving superior performance in scientific GUI tasks through a two-stage training pipeline for robust execution and cross-domain generalization.

Authors:Yuxin Guo, Teng Wang, Yuying Ge, Shijie Ma, Yixiao Ge, Wei Zou, Ying Shan
Title: AudioStory: Generating Long-Form Narrative Audio with Large Language Models
Abstract:
Recent advances in text-to-audio (TTA) generation excel at synthesizing short audio clips but struggle with long-form narrative audio, which requires temporal coherence and compositional reasoning. To address this gap, we propose AudioStory, a unified framework that integrates large language models (LLMs) with TTA systems to generate structured, long-form audio narratives. AudioStory possesses strong instruction-following reasoning generation capabilities. It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues, enabling coherent scene transitions and emotional tone consistency. AudioStory has two appealing features: (1) Decoupled bridging mechanism: AudioStory disentangles LLM-diffuser collaboration into two specialized components, i.e., a bridging query for intra-event semantic alignment and a residual query for cross-event coherence preservation. (2) End-to-end training: By unifying instruction comprehension and audio generation within a single end-to-end framework, AudioStory eliminates the need for modular training pipelines while enhancing synergy between components. Furthermore, we establish a benchmark AudioStory-10K, encompassing diverse domains such as animated soundscapes and natural sound narratives. Extensive experiments show the superiority of AudioStory on both single-audio generation and narrative audio generation, surpassing prior TTA baselines in both instruction-following ability and audio fidelity. Our code is available at https://github.com/TencentARC/AudioStory
Chinese: AudioStory是一个将大语言模型与文本到音频系统相结合的统一框架,通过将复杂叙事查询分解为有序子任务来生成连贯的长篇音频,在指令遵循能力和音频保真度上均优于现有基线。
English: AudioStory is a unified framework that integrates large language models with text-to-audio systems to generate coherent long-form narratives by decomposing queries into structured sub-tasks, outperforming existing methods in both instruction-following and audio quality.

Authors:Manogna Sreenivas, Soma Biswas
Title: Segmentation Assisted Incremental Test Time Adaptation in an Open World
Abstract:
In dynamic environments, unfamiliar objects and distribution shifts are often encountered, which challenge the generalization abilities of the deployed trained models. This work addresses Incremental Test Time Adaptation of Vision Language Models, tackling scenarios where unseen classes and unseen domains continuously appear during testing. Unlike traditional Test Time Adaptation approaches, where the test stream comes only from a predefined set of classes, our framework allows models to adapt simultaneously to both covariate and label shifts, actively incorporating new classes as they emerge. Towards this goal, we establish a new benchmark for ITTA, integrating single image TTA methods for VLMs with active labeling techniques that query an oracle for samples potentially representing unseen classes during test time. We propose a segmentation assisted active labeling module, termed SegAssist, which is training free and repurposes the segmentation capabilities of VLMs to refine active sample selection, prioritizing samples likely to belong to unseen classes. Extensive experiments on several benchmark datasets demonstrate the potential of SegAssist to enhance the performance of VLMs in real world scenarios, where continuous adaptation to emerging data is essential. Project-page:https://manogna-s.github.io/segassist/

Authors:Gianluca Guzzetta
Title: Reimagining Image Segmentation using Active Contour: From Chan Vese Algorithm into a Proposal Novel Functional Loss Framework
Abstract:
In this paper, we present a comprehensive study and analysis of the Chan-Vese algorithm for image segmentation. We employ a discretized scheme derived from the empirical study of the Chan-Vese model's functional energy and its partial differential equation based on its level set function. We provide a proof of the results and an implementation using MATLAB. Leveraging modern computer vision methodologies, we propose a functional segmentation loss based on active contours, utilizing pytorch.nn.ModuleLoss and a level set based on the Chan-Vese algorithm. We compare our results with common computer vision segmentation datasets and evaluate the performance of classical loss functions against our proposed method. All code and materials used are available at https://github.com/gguzzy/chan_vese_functional_loss.
中文: 本文对Chan-Vese算法进行了全面研究,提出了一种基于活动轮廓的功能分割损失方法,并在标准数据集上与经典方法进行了性能比较。
English: This paper conducts a comprehensive study of the Chan-Vese algorithm, proposing a functional segmentation loss using active contours and comparing its performance with classical methods on standard datasets.

Authors:Taebaek Hwang, Minseo Kim, Gisang Lee, Seonuk Kim, Hyunjun Eun
Title: KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts
Abstract:
Understanding and reasoning over text within visual contexts poses a significant challenge for Vision-Language Models (VLMs), given the complexity and diversity of real-world scenarios. To address this challenge, text-rich Visual Question Answering (VQA) datasets and benchmarks have emerged for high-resource languages like English. However, a critical gap persists for low-resource languages such as Korean, where the lack of comprehensive benchmarks hinders robust model evaluation and comparison. To bridge this gap, we introduce KRETA, a benchmark for Korean Reading and rEasoning in Text-rich VQA Attuned to diverse visual contexts. KRETA facilitates an in-depth evaluation of both visual text understanding and reasoning capabilities, while also supporting a multifaceted assessment across 15 domains and 26 image types. Additionally, we introduce a semi-automated VQA generation pipeline specifically optimized for text-rich settings, leveraging refined stepwise image decomposition and a rigorous seven-metric evaluation protocol to ensure data quality. While KRETA is tailored for Korean, we hope our adaptable and extensible pipeline will facilitate the development of similar benchmarks in other languages, thereby accelerating multilingual VLM research. The code and dataset for KRETA are available at https://github.com/tabtoyou/KRETA.
中文:该摘要介绍了KRETA,一个针对韩语文本丰富视觉问答的评估基准,旨在弥补低资源语言资源不足的问题,并采用半自动化流程确保高质量数据生成。
English: This abstract introduces KRETA, a benchmark for evaluating Korean text-rich visual question answering, addressing the lack of resources for low-resource languages and featuring a semi-automated pipeline for high-quality data generation.

Authors:Moussa Kassem Sbeyti, Nadja Klein, Michelle Karg, Christian Wirth, Sahin Albayrak
Title: Streamlining the Development of Active Learning Methods in Real-World Object Detection
Abstract:
Active learning (AL) for real-world object detection faces computational and reliability challenges that limit practical deployment. Developing new AL methods requires training multiple detectors across iterations to compare against existing approaches. This creates high costs for autonomous driving datasets where the training of one detector requires up to 282 GPU hours. Additionally, AL method rankings vary substantially across validation sets, compromising reliability in safety-critical transportation systems. We introduce object-based set similarity ($\mathrm{OSS}$), a metric that addresses these challenges. $\mathrm{OSS}$ (1) quantifies AL method effectiveness without requiring detector training by measuring similarity between training sets and target domains using object-level features. This enables the elimination of ineffective AL methods before training. Furthermore, $\mathrm{OSS}$ (2) enables the selection of representative validation sets for robust evaluation. We validate our similarity-based approach on three autonomous driving datasets (KITTI, BDD100K, CODA) using uncertainty-based AL methods as a case study with two detector architectures (EfficientDet, YOLOv3). This work is the first to unify AL training and evaluation strategies in object detection based on object similarity. $\mathrm{OSS}$ is detector-agnostic, requires only labeled object crops, and integrates with existing AL pipelines. This provides a practical framework for deploying AL in real-world applications where computational efficiency and evaluation reliability are critical. Code is available at https://mos-ks.github.io/publications/.

Authors:Long Chen, Ashiv Patel, Mengyun Qiao, Mohammad Yousuf Salmasi, Salah A. Hammouche, Vasilis Stavrinides, Jasleen Nagi, Soodeh Kalaie, Xiao Yun Xu, Wenjia Bai, Declan P. O'Regan
Title: Multimodal Conditional MeshGAN for Personalized Aneurysm Growth Prediction
Abstract:
Personalized, accurate prediction of aortic aneurysm progression is essential for timely intervention but remains challenging due to the need to model both subtle local deformations and global anatomical changes within complex 3D geometries. We propose MCMeshGAN, the first multimodal conditional mesh-to-mesh generative adversarial network for 3D aneurysm growth prediction. MCMeshGAN introduces a dual-branch architecture combining a novel local KNN-based convolutional network (KCN) to preserve fine-grained geometric details and a global graph convolutional network (GCN) to capture long-range structural context, overcoming the over-smoothing limitations of deep GCNs. A dedicated condition branch encodes clinical attributes (age, sex) and the target time interval to generate anatomically plausible, temporally controlled predictions, enabling retrospective and prospective modeling. We curated TAAMesh, a new longitudinal thoracic aortic aneurysm mesh dataset consisting of 590 multimodal records (CT scans, 3D meshes, and clinical data) from 208 patients. Extensive experiments demonstrate that MCMeshGAN consistently outperforms state-of-the-art baselines in both geometric accuracy and clinically important diameter estimation. This framework offers a robust step toward clinically deployable, personalized 3D disease trajectory modeling. The source code for MCMeshGAN and the baseline methods is publicly available at https://github.com/ImperialCollegeLondon/MCMeshGAN.
中文: MCMeshGAN是一种新型多模态条件生成对抗网络,通过结合局部几何细节、全局结构背景和临床数据,精准预测3D主动脉瘤进展,其性能显著优于现有方法。
English: MCMeshGAN is a novel multimodal conditional generative adversarial network that accurately predicts 3D aortic aneurysm progression by integrating local geometric details and global structural context with clinical data, demonstrating superior performance over existing methods.

Authors:Xiaoqi Wang, Yun Zhang, Weisi Lin
Title: Image Quality Assessment for Machines: Paradigm, Large-scale Database, and Models
Abstract:
Machine vision systems (MVS) are intrinsically vulnerable to performance degradation under adverse visual conditions. To address this, we propose a machine-centric image quality assessment (MIQA) framework that quantifies the impact of image degradations on MVS performance. We establish an MIQA paradigm encompassing the end-to-end assessment workflow. To support this, we construct a machine-centric image quality database (MIQD-2.5M), comprising 2.5 million samples that capture distinctive degradation responses in both consistency and accuracy metrics, spanning 75 vision models, 250 degradation types, and three representative vision tasks. We further propose a region-aware MIQA (RA-MIQA) model to evaluate MVS visual quality through fine-grained spatial degradation analysis. Extensive experiments benchmark the proposed RA-MIQA against seven human visual system (HVS)-based IQA metrics and five retrained classical backbones. Results demonstrate RA-MIQA's superior performance in multiple dimensions, e.g., achieving SRCC gains of 13.56% on consistency and 13.37% on accuracy for image classification, while also revealing task-specific degradation sensitivities. Critically, HVS-based metrics prove inadequate for MVS quality prediction, while even specialized MIQA models struggle with background degradations, accuracy-oriented estimation, and subtle distortions. This study can advance MVS reliability and establish foundations for machine-centric image processing and optimization. The model and code are available at: https://github.com/XiaoqiWang/MIQA.
中文摘要:本研究提出了一种以机器为中心的图像质量评估框架,通过构建包含250万样本的数据库和区域感知模型,有效量化图像退化对机器视觉系统的影响,其性能显著优于基于人类视觉的评估方法。
English Summary: This study introduces a machine-centric image quality assessment (MIQA) framework that evaluates image degradation impacts on machine vision systems, supported by a 2.5-million-sample database and a region-aware model demonstrating superior performance over human vision-based metrics.

Authors:Kaixuan Lu, Mehmet Onurcan Kaya, Dim P. Papadopoulos
Title: AutoQ-VIS: Improving Unsupervised Video Instance Segmentation via Automatic Quality Assessment
Abstract:
Video Instance Segmentation (VIS) faces significant annotation challenges due to its dual requirements of pixel-level masks and temporal consistency labels. While recent unsupervised methods like VideoCutLER eliminate optical flow dependencies through synthetic data, they remain constrained by the synthetic-to-real domain gap. We present AutoQ-VIS, a novel unsupervised framework that bridges this gap through quality-guided self-training. Our approach establishes a closed-loop system between pseudo-label generation and automatic quality assessment, enabling progressive adaptation from synthetic to real videos. Experiments demonstrate state-of-the-art performance with 52.6 $\text{AP}_{50}$ on YouTubeVIS-2019 val set, surpassing the previous state-of-the-art VideoCutLER by 4.4$\%$, while requiring no human annotations. This demonstrates the viability of quality-aware self-training for unsupervised VIS. The source code of our method is available at https://github.com/wcbup/AutoQ-VIS.
中文:AutoQ-VIS提出了一种质量引导的自训练框架,通过伪标签生成与自动质量评估的闭环系统,在无监督视频实例分割中弥合了合成到真实数据的领域差距,无需人工标注即实现了最优性能。
English: AutoQ-VIS introduces a quality-guided self-training framework that bridges the synthetic-to-real domain gap in unsupervised Video Instance Segmentation, achieving state-of-the-art performance without human annotations.

Authors:Shay Shomer Chai, Wenxuan Peng, Bharath Hariharan, Hadar Averbuch-Elor
Title: Not Every Gift Comes in Gold Paper or with a Red Ribbon: Exploring Color Perception in Text-to-Image Models
Abstract:
Text-to-image generation has recently seen remarkable success, granting users with the ability to create high-quality images through the use of text. However, contemporary methods face challenges in capturing the precise semantics conveyed by complex multi-object prompts. Consequently, many works have sought to mitigate such semantic misalignments, typically via inference-time schemes that modify the attention layers of the denoising networks. However, prior work has mostly utilized coarse metrics, such as the cosine similarity between text and image CLIP embeddings, or human evaluations, which are challenging to conduct on a larger-scale. In this work, we perform a case study on colors -- a fundamental attribute commonly associated with objects in text prompts, which offer a rich test bed for rigorous evaluation. Our analysis reveals that pretrained models struggle to generate images that faithfully reflect multiple color attributes-far more so than with single-color prompts-and that neither inference-time techniques nor existing editing methods reliably resolve these semantic misalignments. Accordingly, we introduce a dedicated image editing technique, mitigating the issue of multi-object semantic alignment for prompts containing multiple colors. We demonstrate that our approach significantly boosts performance over a wide range of metrics, considering images generated by various text-to-image diffusion-based techniques.

Authors:Qiyao Xu, Qiming Wu, Xiaowei Li
Title: SPLF-SAM: Self-Prompting Segment Anything Model for Light Field Salient Object Detection
Abstract:
Segment Anything Model (SAM) has demonstrated remarkable capabilities in solving light field salient object detection (LF SOD). However, most existing models tend to neglect the extraction of prompt information under this task. Meanwhile, traditional models ignore the analysis of frequency-domain information, which leads to small objects being overwhelmed by noise. In this paper, we put forward a novel model called self-prompting light field segment anything model (SPLF-SAM), equipped with unified multi-scale feature embedding block (UMFEB) and a multi-scale adaptive filtering adapter (MAFA). UMFEB is capable of identifying multiple objects of varying sizes, while MAFA, by learning frequency features, effectively prevents small objects from being overwhelmed by noise. Extensive experiments have demonstrated the superiority of our method over ten state-of-the-art (SOTA) LF SOD methods. Our code will be available at https://github.com/XucherCH/splfsam.
Chinese: 提出的SPLF-SAM模型通过结合自提示机制、统一多尺度特征嵌入块和多尺度自适应滤波适配器,显著提升了光场显著目标检测的性能,有效抑制噪声干扰,并在十种先进方法中表现最优。
English: The proposed SPLF-SAM model enhances light field salient object detection by integrating a self-prompting mechanism with a unified multi-scale feature embedding block and a multi-scale adaptive filtering adapter, effectively addressing noise interference and outperforming ten state-of-the-art methods.

Authors:Yupeng Zhang, Dezhi Zheng, Ping Lu, Han Zhang, Lei Wang, Liping xiang, Cheng Luo, Kaijun Deng, Xiaowen Fu, Linlin Shen, Jinbao Wang
Title: LabelGS: Label-Aware 3D Gaussian Splatting for 3D Scene Segmentation
Abstract:
3D Gaussian Splatting (3DGS) has emerged as a novel explicit representation for 3D scenes, offering both high-fidelity reconstruction and efficient rendering. However, 3DGS lacks 3D segmentation ability, which limits its applicability in tasks that require scene understanding. The identification and isolating of specific object components is crucial. To address this limitation, we propose Label-aware 3D Gaussian Splatting (LabelGS), a method that augments the Gaussian representation with object label.LabelGS introduces cross-view consistent semantic masks for 3D Gaussians and employs a novel Occlusion Analysis Model to avoid overfitting occlusion during optimization, Main Gaussian Labeling model to lift 2D semantic prior to 3D Gaussian and Gaussian Projection Filter to avoid Gaussian label conflict. Our approach achieves effective decoupling of Gaussian representations and refines the 3DGS optimization process through a random region sampling strategy, significantly improving efficiency. Extensive experiments demonstrate that LabelGS outperforms previous state-of-the-art methods, including Feature-3DGS, in the 3D scene segmentation task. Notably, LabelGS achieves a remarkable 22X speedup in training compared to Feature-3DGS, at a resolution of 1440X1080. Our code will be at https://github.com/garrisonz/LabelGS.
Chinese: LabelGS通过引入对象标签和创新的优化技术,显著提升了3D高斯泼溅的语义分割能力,在保持高保真重建的同时实现了22倍的训练加速。
English: LabelGS enhances 3D Gaussian Splatting by integrating object labels and novel optimization techniques, achieving superior 3D segmentation performance and a 22X training speedup over previous methods.

Authors:Hou Xia, Zheren Fu, Fangcan Ling, Jiajun Li, Yi Tu, Zhendong Mao, Yongdong Zhang
Title: Video-LevelGauge: Investigating Contextual Positional Bias in Large Video Language Models
Abstract:
Large video language models (LVLMs) have made notable progress in video understanding, spurring the development of corresponding evaluation benchmarks. However, existing benchmarks generally assess overall performance across entire video sequences, overlooking nuanced behaviors such as contextual positional bias, a critical yet under-explored aspect of LVLM performance. We present Video-LevelGauge, a dedicated benchmark designed to systematically assess positional bias in LVLMs. We employ standardized probes and customized contextual setups, allowing flexible control over context length, probe position, and contextual types to simulate diverse real-world scenarios. In addition, we introduce a comprehensive analysis method that combines statistical measures with morphological pattern recognition to characterize bias. Our benchmark comprises 438 manually curated videos spanning multiple types, yielding 1,177 high-quality multiple-choice questions and 120 open-ended questions, validated for their effectiveness in exposing positional bias. Based on these, we evaluate 27 state-of-the-art LVLMs, including both commercial and open-source models. Our findings reveal significant positional biases in many leading open-source models, typically exhibiting head or neighbor-content preferences. In contrast, commercial models such as Gemini2.5-Pro show impressive, consistent performance across entire video sequences. Further analyses on context length, context variation, and model scale provide actionable insights for mitigating bias and guiding model enhancement . https://github.com/Cola-any/Video-LevelGauge
中文摘要:Video-LevelGauge基准通过标准化测试和定制化情境设置,系统评估大型视频语言模型的位置偏差,发现开源模型存在显著偏差,而Gemini2.5-Pro等商业模型在完整视频序列中表现稳定。
English Summary: The Video-LevelGauge benchmark systematically evaluates positional bias in large video language models, revealing significant biases in open-source models while commercial models like Gemini2.5-Pro demonstrate consistent performance across video sequences.

Authors:Dongjin Kim, Jaekyun Ko, Muhammad Kashif Ali, Tae Hyun Kim
Title: IDF: Iterative Dynamic Filtering Networks for Generalizable Image Denoising
Abstract:
Image denoising is a fundamental challenge in computer vision, with applications in photography and medical imaging. While deep learning-based methods have shown remarkable success, their reliance on specific noise distributions limits generalization to unseen noise types and levels. Existing approaches attempt to address this with extensive training data and high computational resources but they still suffer from overfitting. To address these issues, we conduct image denoising by utilizing dynamically generated kernels via efficient operations. This approach helps prevent overfitting and improves resilience to unseen noise. Specifically, our method leverages a Feature Extraction Module for robust noise-invariant features, Global Statistics and Local Correlation Modules to capture comprehensive noise characteristics and structural correlations. The Kernel Prediction Module then employs these cues to produce pixel-wise varying kernels adapted to local structures, which are then applied iteratively for denoising. This ensures both efficiency and superior restoration quality. Despite being trained on single-level Gaussian noise, our compact model (~ 0.04 M) excels across diverse noise types and levels, demonstrating the promise of iterative dynamic filtering for practical image denoising.

Authors:Yang Li, Quan Yuan, Guiyang Luo, Xiaoyuan Fu, Rui Pan, Yujia Yang, Congzhang Shao, Yuewen Liu, Jinglin Li
Title: Beyond BEV: Optimizing Point-Level Tokens for Collaborative Perception
Abstract:
Collaborative perception allows agents to enhance their perceptual capabilities by exchanging intermediate features. Existing methods typically organize these intermediate features as 2D bird's-eye-view (BEV) representations, which discard critical fine-grained 3D structural cues essential for accurate object recognition and localization. To this end, we first introduce point-level tokens as intermediate representations for collaborative perception. However, point-cloud data are inherently unordered, massive, and position-sensitive, making it challenging to produce compact and aligned point-level token sequences that preserve detailed structural information. Therefore, we present CoPLOT, a novel Collaborative perception framework that utilizes Point-Level Optimized Tokens. It incorporates a point-native processing pipeline, including token reordering, sequence modeling, and multi-agent spatial alignment. A semantic-aware token reordering module generates adaptive 1D reorderings by leveraging scene-level and token-level semantic information. A frequency-enhanced state space model captures long-range sequence dependencies across both spatial and spectral domains, improving the differentiation between foreground tokens and background clutter. Lastly, a neighbor-to-ego alignment module applies a closed-loop process, combining global agent-level correction with local token-level refinement to mitigate localization noise. Extensive experiments on both simulated and real-world datasets show that CoPLOT outperforms state-of-the-art models, with even lower communication and computation overhead. Code will be available at https://github.com/CheeryLeeyy/CoPLOT.
中文:CoPLOT框架通过语义感知重排序、频率增强序列建模和多智能体对齐技术,利用点级优化标记保留三维结构细节,以更低的通信和计算成本实现了协同感知的性能突破。
English: The CoPLOT framework introduces point-level optimized tokens to enhance collaborative perception by preserving 3D structural details through semantic-aware reordering, frequency-enhanced sequence modeling, and multi-agent alignment, achieving superior performance with reduced overhead.

Authors:Jiajun Sun, Zhen Yu, Siyuan Yan, Jason J. Ong, Zongyuan Ge, Lei Zhang
Title: Controllable Skin Synthesis via Lesion-Focused Vector Autoregression Model
Abstract:
Skin images from real-world clinical practice are often limited, resulting in a shortage of training data for deep-learning models. While many studies have explored skin image synthesis, existing methods often generate low-quality images and lack control over the lesion's location and type. To address these limitations, we present LF-VAR, a model leveraging quantified lesion measurement scores and lesion type labels to guide the clinically relevant and controllable synthesis of skin images. It enables controlled skin synthesis with specific lesion characteristics based on language prompts. We train a multiscale lesion-focused Vector Quantised Variational Auto-Encoder (VQVAE) to encode images into discrete latent representations for structured tokenization. Then, a Visual AutoRegressive (VAR) Transformer trained on tokenized representations facilitates image synthesis. Lesion measurement from the lesion region and types as conditional embeddings are integrated to enhance synthesis fidelity. Our method achieves the best overall FID score (average 0.74) among seven lesion types, improving upon the previous state-of-the-art (SOTA) by 6.3%. The study highlights our controllable skin synthesis model's effectiveness in generating high-fidelity, clinically relevant synthetic skin images. Our framework code is available at https://github.com/echosun1996/LF-VAR.
Chinese: LF-VAR模型通过结合病变测量评分、类型标签和语言提示,提出了一种生成高保真、临床相关皮肤图像的新方法,其FID评分比现有最佳方法提高了6.3%。
English: The LF-VAR model introduces a novel approach for generating high-fidelity, clinically relevant skin images by integrating lesion measurement scores and type labels with language prompts, achieving a 6.3% improvement in FID score over previous methods.

Authors:Toghrul Karimov, Hassan Imani, Allan Kazakov
Title: Quantization Robustness to Input Degradations for Object Detection
Abstract:
Post-training quantization (PTQ) is crucial for deploying efficient object detection models, like YOLO, on resource-constrained devices. However, the impact of reduced precision on model robustness to real-world input degradations such as noise, blur, and compression artifacts is a significant concern. This paper presents a comprehensive empirical study evaluating the robustness of YOLO models (nano to extra-large scales) across multiple precision formats: FP32, FP16 (TensorRT), Dynamic UINT8 (ONNX), and Static INT8 (TensorRT). We introduce and evaluate a degradation-aware calibration strategy for Static INT8 PTQ, where the TensorRT calibration process is exposed to a mix of clean and synthetically degraded images. Models were benchmarked on the COCO dataset under seven distinct degradation conditions (including various types and levels of noise, blur, low contrast, and JPEG compression) and a mixed-degradation scenario. Results indicate that while Static INT8 TensorRT engines offer substantial speedups (~1.5-3.3x) with a moderate accuracy drop (~3-7% mAP50-95) on clean data, the proposed degradation-aware calibration did not yield consistent, broad improvements in robustness over standard clean-data calibration across most models and degradations. A notable exception was observed for larger model scales under specific noise conditions, suggesting model capacity may influence the efficacy of this calibration approach. These findings highlight the challenges in enhancing PTQ robustness and provide insights for deploying quantized detectors in uncontrolled environments. All code and evaluation tables are available at https://github.com/AllanK24/QRID.
中文: 本研究评估了YOLO模型在不同精度格式和退化条件下的鲁棒性,发现虽然静态INT8量化能显著提升速度,但提出的退化感知校准方法在多数情况下未能持续改善鲁棒性,仅在大模型的特定噪声条件下表现例外。
English: This study evaluates the robustness of YOLO models under various precision formats and degradation conditions, finding that while static INT8 quantization provides significant speed improvements, a proposed degradation-aware calibration method generally fails to enhance robustness consistently across most scenarios, except for specific noise conditions in larger models.

Authors:Yuhang Zhao, Zixing Wang
Title: FlowDet: Overcoming Perspective and Scale Challenges in Real-Time End-to-End Traffic Detection
Abstract:
End-to-end object detectors offer a promising NMS-free paradigm for real-time applications, yet their high computational cost remains a significant barrier, particularly for complex scenarios like intersection traffic monitoring. To address this challenge, we propose FlowDet, a high-speed detector featuring a decoupled encoder optimization strategy applied to the DETR architecture. Specifically, FlowDet employs a novel Geometric Deformable Unit (GDU) for traffic-aware geometric modeling and a Scale-Aware Attention (SAA) module to maintain high representational power across extreme scale variations. To rigorously evaluate the model's performance in environments with severe occlusion and high object density, we collected the Intersection-Flow-5k dataset, a new challenging scene for this task. Evaluated on Intersection-Flow-5k, FlowDet establishes a new state-of-the-art. Compared to the strong RT-DETR baseline, it improves AP(test) by 1.5% and AP50(test) by 1.6%, while simultaneously reducing GFLOPs by 63.2% and increasing inference speed by 16.2%. Our work demonstrates a new path towards building highly efficient and accurate detectors for demanding, real-world perception systems. The Intersection-Flow-5k dataset is available at https://github.com/AstronZh/Intersection-Flow-5K.
中文: FlowDet采用解耦编码器优化策略,结合创新的几何变形单元和尺度感知模块,在Intersection-Flow-5k数据集上实现最优性能,大幅降低计算成本的同时提升检测精度与速度。
English: FlowDet introduces a decoupled encoder optimization strategy with novel geometric and scale-aware modules to achieve state-of-the-art performance on the Intersection-Flow-5k dataset, significantly reducing computational costs while improving accuracy and speed.

Authors:Yu-Wei Zhang, Tongju Han, Lipeng Gao, Mingqiang Wei, Hui Liu, Changbao Li, Caiming Zhang
Title: MonoRelief V2: Leveraging Real Data for High-Fidelity Monocular Relief Recovery
Abstract:
This paper presents MonoRelief V2, an end-to-end model designed for directly recovering 2.5D reliefs from single images under complex material and illumination variations. In contrast to its predecessor, MonoRelief V1 [1], which was solely trained on synthetic data, MonoRelief V2 incorporates real data to achieve improved robustness, accuracy and efficiency. To overcome the challenge of acquiring large-scale real-world dataset, we generate approximately 15,000 pseudo real images using a text-to-image generative model, and derive corresponding depth pseudo-labels through fusion of depth and normal predictions. Furthermore, we construct a small-scale real-world dataset (800 samples) via multi-view reconstruction and detail refinement. MonoRelief V2 is then progressively trained on the pseudo-real and real-world datasets. Comprehensive experiments demonstrate its state-of-the-art performance both in depth and normal predictions, highlighting its strong potential for a range of downstream applications. Code is at: https://github.com/glp1001/MonoreliefV2.
中文摘要:MonoRelief V2 是一种改进的端到端模型,通过结合伪真实和真实世界数据,从单张图像中恢复 2.5D 浮雕,在深度和法线预测方面展现出优于前代模型的鲁棒性和准确性。
English Summary: MonoRelief V2 is an enhanced end-to-end model that recovers 2.5D reliefs from single images with greater robustness and accuracy by incorporating both pseudo-real and real-world data, outperforming its predecessor in depth and normal predictions.

Authors:Eduardo Davalos, Yike Zhang, Namrata Srivastava, Yashvitha Thatigotla, Jorge A. Salas, Sara McFadden, Sun-Joo Cho, Amanda Goodwin, Ashwin TS, Gautam Biswas
Title: WEBEYETRACK: Scalable Eye-Tracking for the Browser via On-Device Few-Shot Personalization
Abstract:
With advancements in AI, new gaze estimation methods are exceeding state-of-the-art (SOTA) benchmarks, but their real-world application reveals a gap with commercial eye-tracking solutions. Factors like model size, inference time, and privacy often go unaddressed. Meanwhile, webcam-based eye-tracking methods lack sufficient accuracy, in particular due to head movement. To tackle these issues, we introduce We bEyeTrack, a framework that integrates lightweight SOTA gaze estimation models directly in the browser. It incorporates model-based head pose estimation and on-device few-shot learning with as few as nine calibration samples (k < 9). WebEyeTrack adapts to new users, achieving SOTA performance with an error margin of 2.32 cm on GazeCapture and real-time inference speeds of 2.4 milliseconds on an iPhone 14. Our open-source code is available at https://github.com/RedForestAi/WebEyeTrack.
中文:WebEyeTrack推出了一种轻量级的浏览器内视线追踪框架,通过最少校准即可实现顶尖精度和实时性能,有效解决了现有AI模型与网络摄像头方法中的不足。
English: WebEyeTrack introduces a lightweight in-browser gaze estimation framework that achieves state-of-the-art accuracy with minimal calibration and real-time performance, addressing gaps in current AI models and webcam-based methods.

Authors:Nannan Zhu, Yonghao Dong, Teng Wang, Xueqian Li, Shengjun Deng, Yijia Wang, Zheng Hong, Tiantian Geng, Guo Niu, Hanyan Huang, Xiongfei Yao, Shuaiwei Jiao
Title: CVBench: Evaluating Cross-Video Synergies for Complex Multimodal Understanding and Reasoning
Abstract:
While multimodal large language models (MLLMs) exhibit strong performance on single-video tasks (e.g., video question answering), their ability across multiple videos remains critically underexplored. However, this capability is essential for real-world applications, including multi-camera surveillance and cross-video procedural learning. To bridge this gap, we present CVBench, the first comprehensive benchmark designed to assess cross-video relational reasoning rigorously. CVBench comprises 1,000 question-answer pairs spanning three hierarchical tiers: cross-video object association (identifying shared entities), cross-video event association (linking temporal or causal event chains), and cross-video complex reasoning (integrating commonsense and domain knowledge). Built from five domain-diverse video clusters (e.g., sports, life records), the benchmark challenges models to synthesise information across dynamic visual contexts. Extensive evaluation of 10+ leading MLLMs (including GPT-4o, Gemini-2.0-flash, Qwen2.5-VL) under zero-shot or chain-of-thought prompting paradigms. Key findings reveal stark performance gaps: even top models, such as GPT-4o, achieve only 60% accuracy on causal reasoning tasks, compared to the 91% accuracy of human performance. Crucially, our analysis reveals fundamental bottlenecks inherent in current MLLM architectures, notably deficient inter-video context retention and poor disambiguation of overlapping entities. CVBench establishes a rigorous framework for diagnosing and advancing multi-video reasoning, offering architectural insights for next-generation MLLMs. The data and evaluation code are available at https://github.com/Hokhim2/CVBench.
中文: CVBench是首个全面评估多模态大语言模型跨视频关系推理能力的基准,揭示了现有模型与人类表现间的显著差距及架构瓶颈。
English: CVBench is the first comprehensive benchmark designed to rigorously evaluate cross-video relational reasoning in multimodal large language models, revealing significant performance gaps and architectural bottlenecks compared to human capabilities.

Authors:Zhixin Lin, Jungang Li, Shidong Pan, Yibo Shi, Yue Yao, Dongliang Xu
Title: Mind the Third Eye! Benchmarking Privacy Awareness in MLLM-powered Smartphone Agents
Abstract:
Smartphones bring significant convenience to users but also enable devices to extensively record various types of personal information. Existing smartphone agents powered by Multimodal Large Language Models (MLLMs) have achieved remarkable performance in automating different tasks. However, as the cost, these agents are granted substantial access to sensitive users' personal information during this operation. To gain a thorough understanding of the privacy awareness of these agents, we present the first large-scale benchmark encompassing 7,138 scenarios to the best of our knowledge. In addition, for privacy context in scenarios, we annotate its type (e.g., Account Credentials), sensitivity level, and location. We then carefully benchmark seven available mainstream smartphone agents. Our results demonstrate that almost all benchmarked agents show unsatisfying privacy awareness (RA), with performance remaining below 60% even with explicit hints. Overall, closed-source agents show better privacy ability than open-source ones, and Gemini 2.0-flash achieves the best, achieving an RA of 67%. We also find that the agents' privacy detection capability is highly related to scenario sensitivity level, i.e., the scenario with a higher sensitivity level is typically more identifiable. We hope the findings enlighten the research community to rethink the unbalanced utility-privacy tradeoff about smartphone agents. Our code and benchmark are available at https://zhixin-l.github.io/SAPA-Bench.

Authors:Xinlong Zhao, Qixiang Pang, Shan Du
Title: JVLGS: Joint Vision-Language Gas Leak Segmentation
Abstract:
Gas leaks pose serious threats to human health and contribute significantly to atmospheric pollution, drawing increasing public concern. However, the lack of effective detection methods hampers timely and accurate identification of gas leaks. While some vision-based techniques leverage infrared videos for leak detection, the blurry and non-rigid nature of gas clouds often limits their effectiveness. To address these challenges, we propose a novel framework called Joint Vision-Language Gas leak Segmentation (JVLGS), which integrates the complementary strengths of visual and textual modalities to enhance gas leak representation and segmentation. Recognizing that gas leaks are sporadic and many video frames may contain no leak at all, our method incorporates a post-processing step to reduce false positives caused by noise and non-target objects, an issue that affects many existing approaches. Extensive experiments conducted across diverse scenarios show that JVLGS significantly outperforms state-of-the-art gas leak segmentation methods. We evaluate our model under both supervised and few-shot learning settings, and it consistently achieves strong performance in both, whereas competing methods tend to perform well in only one setting or poorly in both. Code available at: https://github.com/GeekEagle/JVLGS
中文:提出的JVLGS框架融合视觉与文本数据以提升气体泄漏分割效果,通过后处理减少误报,在监督学习和少样本学习场景下均显著优于现有方法。
English: The proposed JVLGS framework integrates visual and textual data to improve gas leak segmentation, incorporating post-processing to reduce false positives and demonstrating superior performance in both supervised and few-shot learning settings compared to existing methods.

Authors:Xueyang Li, Mingze Jiang, Gelei Xu, Jun Xia, Mengzhao Jia, Danny Chen, Yiyu Shi
Title: AT-CXR: Uncertainty-Aware Agentic Triage for Chest X-rays
Abstract:
Agentic AI is advancing rapidly, yet truly autonomous medical-imaging triage, where a system decides when to stop, escalate, or defer under real constraints, remains relatively underexplored. To address this gap, we introduce AT-CXR, an uncertainty-aware agent for chest X-rays. The system estimates per-case confidence and distributional fit, then follows a stepwise policy to issue an automated decision or abstain with a suggested label for human intervention. We evaluate two router designs that share the same inputs and actions: a deterministic rule-based router and an LLM-decided router. Across five-fold evaluation on a balanced subset of NIH ChestX-ray14 dataset, both variants outperform strong zero-shot vision-language models and state-of-the-art supervised classifiers, achieving higher full-coverage accuracy and superior selective-prediction performance, evidenced by a lower area under the risk-coverage curve (AURC) and a lower error rate at high coverage, while operating with lower latency that meets practical clinical constraints. The two routers provide complementary operating points, enabling deployments to prioritize maximal throughput or maximal accuracy. Our code is available at https://github.com/XLIAaron/uncertainty-aware-cxr-agent.
中文: 本文提出AT-CXR这一面向胸部X光分诊的不确定性感知AI代理,通过置信度估计和分级策略实现自动化决策或人工介入转交,在准确性和效率上均优于现有模型,并提供两种互补的路由器设计以适应不同临床需求。
English: This paper introduces AT-CXR, an uncertainty-aware AI agent for chest X-ray triage that uses confidence estimation and stepwise policies to automate decisions or defer to humans, outperforming existing models in accuracy and efficiency while offering complementary router designs for clinical deployment.

Authors:Ming Chen, Liyuan Cui, Wenyuan Zhang, Haoxian Zhang, Yan Zhou, Xiaohan Li, Songlin Tang, Jiwen Liu, Borui Liao, Hejia Chen, Xiaoqiang Liu, Pengfei Wan
Title: MIDAS: Multimodal Interactive Digital-humAn Synthesis via Real-time Autoregressive Video Generation
Abstract:
Recently, interactive digital human video generation has attracted widespread attention and achieved remarkable progress. However, building such a practical system that can interact with diverse input signals in real time remains challenging to existing methods, which often struggle with heavy computational cost and limited controllability. In this work, we introduce an autoregressive video generation framework that enables interactive multimodal control and low-latency extrapolation in a streaming manner. With minimal modifications to a standard large language model (LLM), our framework accepts multimodal condition encodings including audio, pose, and text, and outputs spatially and semantically coherent representations to guide the denoising process of a diffusion head. To support this, we construct a large-scale dialogue dataset of approximately 20,000 hours from multiple sources, providing rich conversational scenarios for training. We further introduce a deep compression autoencoder with up to 64$\times$ reduction ratio, which effectively alleviates the long-horizon inference burden of the autoregressive model. Extensive experiments on duplex conversation, multilingual human synthesis, and interactive world model highlight the advantages of our approach in low latency, high efficiency, and fine-grained multimodal controllability.

Authors:Chen Chu, Cyrus Shahabi
Title: Geo2Vec: Shape- and Distance-Aware Neural Representation of Geospatial Entities
Abstract:
Spatial representation learning is essential for GeoAI applications such as urban analytics, enabling the encoding of shapes, locations, and spatial relationships (topological and distance-based) of geo-entities like points, polylines, and polygons. Existing methods either target a single geo-entity type or, like Poly2Vec, decompose entities into simpler components to enable Fourier transformation, introducing high computational cost. Moreover, since the transformed space lacks geometric alignment, these methods rely on uniform, non-adaptive sampling, which blurs fine-grained features like edges and boundaries. To address these limitations, we introduce Geo2Vec, a novel method inspired by signed distance fields (SDF) that operates directly in the original space. Geo2Vec adaptively samples points and encodes their signed distances (positive outside, negative inside), capturing geometry without decomposition. A neural network trained to approximate the SDF produces compact, geometry-aware, and unified representations for all geo-entity types. Additionally, we propose a rotation-invariant positional encoding to model high-frequency spatial variations and construct a structured and robust embedding space for downstream GeoAI models. Empirical results show that Geo2Vec consistently outperforms existing methods in representing shape and location, capturing topological and distance relationships, and achieving greater efficiency in real-world GeoAI applications. Code and Data can be found at: https://github.com/chuchen2017/GeoNeuralRepresentation.
中文摘要:Geo2Vec提出了一种基于符号距离场的新颖空间表示方法,无需分解即可直接编码几何特征,在GeoAI应用中能更有效地捕捉形状、空间关系并提升性能。
English Summary: Geo2Vec introduces a novel spatial representation method using signed distance fields to directly encode geometry without decomposition, achieving superior performance in capturing shapes, spatial relationships, and efficiency in GeoAI applications.

Authors:Abu Sufian, Anirudha Ghosh, Debaditya Barman, Marco Leo, Cosimo Distante
Title: DemoBias: An Empirical Study to Trace Demographic Biases in Vision Foundation Models
Abstract:
Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities across various downstream tasks, including biometric face recognition (FR) with description. However, demographic biases remain a critical concern in FR, as these foundation models often fail to perform equitably across diverse demographic groups, considering ethnicity/race, gender, and age. Therefore, through our work DemoBias, we conduct an empirical evaluation to investigate the extent of demographic biases in LVLMs for biometric FR with textual token generation tasks. We fine-tuned and evaluated three widely used pre-trained LVLMs: LLaVA, BLIP-2, and PaliGemma on our own generated demographic-balanced dataset. We utilize several evaluation metrics, like group-specific BERTScores and the Fairness Discrepancy Rate, to quantify and trace the performance disparities. The experimental results deliver compelling insights into the fairness and reliability of LVLMs across diverse demographic groups. Our empirical study uncovered demographic biases in LVLMs, with PaliGemma and LLaVA exhibiting higher disparities for Hispanic/Latino, Caucasian, and South Asian groups, whereas BLIP-2 demonstrated comparably consistent. Repository: https://github.com/Sufianlab/DemoBias.
中文: 大型视觉语言模型在人脸识别任务中存在人口统计偏差,其中PaliGemma和LLaVA对西班牙裔/拉丁裔、高加索人和南亚群体表现出更高差异,而BLIP-2在不同人群中的表现相对一致。
English: Large Vision Language Models exhibit demographic biases in face recognition tasks, with PaliGemma and LLaVA showing higher disparities for Hispanic/Latino, Caucasian, and South Asian groups, while BLIP-2 performs more consistently across diverse populations.

Authors:Lin Li, Zehuan Huang, Haoran Feng, Gengxiong Zhuang, Rui Chen, Chunchao Guo, Lu Sheng
Title: VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space
Abstract:
3D local editing of specified regions is crucial for game industry and robot interaction. Recent methods typically edit rendered multi-view images and then reconstruct 3D models, but they face challenges in precisely preserving unedited regions and overall coherence. Inspired by structured 3D generative models, we propose VoxHammer, a novel training-free approach that performs precise and coherent editing in 3D latent space. Given a 3D model, VoxHammer first predicts its inversion trajectory and obtains its inverted latents and key-value tokens at each timestep. Subsequently, in the denoising and editing phase, we replace the denoising features of preserved regions with the corresponding inverted latents and cached key-value tokens. By retaining these contextual features, this approach ensures consistent reconstruction of preserved areas and coherent integration of edited parts. To evaluate the consistency of preserved regions, we constructed Edit3D-Bench, a human-annotated dataset comprising hundreds of samples, each with carefully labeled 3D editing regions. Experiments demonstrate that VoxHammer significantly outperforms existing methods in terms of both 3D consistency of preserved regions and overall quality. Our method holds promise for synthesizing high-quality edited paired data, thereby laying the data foundation for in-context 3D generation. See our project page at https://huanngzh.github.io/VoxHammer-Page/.

Authors:Oishi Deb, Anjun Hu, Ashkan Khakzar, Philip Torr, Christian Rupprecht
Title: Articulate3D: Zero-Shot Text-Driven 3D Object Posing
Abstract:
We propose a training-free method, Articulate3D, to pose a 3D asset through language control. Despite advances in vision and language models, this task remains surprisingly challenging. To achieve this goal, we decompose the problem into two steps. We modify a powerful image-generator to create target images conditioned on the input image and a text instruction. We then align the mesh to the target images through a multi-view pose optimisation step. In detail, we introduce a self-attention rewiring mechanism (RSActrl) that decouples the source structure from pose within an image generative model, allowing it to maintain a consistent structure across varying poses. We observed that differentiable rendering is an unreliable signal for articulation optimisation; instead, we use keypoints to establish correspondences between input and target images. The effectiveness of Articulate3D is demonstrated across a diverse range of 3D objects and free-form text prompts, successfully manipulating poses while maintaining the original identity of the mesh. Quantitative evaluations and a comparative user study, in which our method was preferred over 85\% of the time, confirm its superiority over existing approaches. Project page:https://odeb1.github.io/articulate3d_page_deb/

Authors:Beiqi Chen, Shuai Shao, Haitang Feng, Jianhuang Lai, Jianlou Si, Guangcong Wang
Title: Style4D-Bench: A Benchmark Suite for 4D Stylization
Abstract:
We introduce Style4D-Bench, the first benchmark suite specifically designed for 4D stylization, with the goal of standardizing evaluation and facilitating progress in this emerging area. Style4D-Bench comprises: 1) a comprehensive evaluation protocol measuring spatial fidelity, temporal coherence, and multi-view consistency through both perceptual and quantitative metrics, 2) a strong baseline that make an initial attempt for 4D stylization, and 3) a curated collection of high-resolution dynamic 4D scenes with diverse motions and complex backgrounds. To establish a strong baseline, we present Style4D, a novel framework built upon 4D Gaussian Splatting. It consists of three key components: a basic 4DGS scene representation to capture reliable geometry, a Style Gaussian Representation that leverages lightweight per-Gaussian MLPs for temporally and spatially aware appearance control, and a Holistic Geometry-Preserved Style Transfer module designed to enhance spatio-temporal consistency via contrastive coherence learning and structural content preservation. Extensive experiments on Style4D-Bench demonstrate that Style4D achieves state-of-the-art performance in 4D stylization, producing fine-grained stylistic details with stable temporal dynamics and consistent multi-view rendering. We expect Style4D-Bench to become a valuable resource for benchmarking and advancing research in stylized rendering of dynamic 3D scenes. Project page: https://becky-catherine.github.io/Style4D . Code: https://github.com/Becky-catherine/Style4D-Bench .
中文:Style4D-Bench是首个专门针对4D风格化的基准套件,包含评估协议、基于4D高斯溅射的Style4D基线模型和精选动态场景,旨在推动该领域研究的标准化发展。
English: Style4D-Bench is the first comprehensive benchmark for 4D stylization, featuring evaluation protocols, a strong baseline model called Style4D, and curated dynamic scenes to advance research in this field.

Authors:Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, Gao Huang
Title: MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation
Abstract:
Temporal context is essential for robotic manipulation because such tasks are inherently non-Markovian, yet mainstream VLA models typically overlook it and struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived representations for immediate control, while the hippocampal system preserves verbatim episodic details and semantic gist of past experience for long-term memory. Inspired by these mechanisms, we propose MemoryVLA, a Cognition-Memory-Action framework for long-horizon robotic manipulation. A pretrained VLM encodes the observation into perceptual and cognitive tokens that form working memory, while a Perceptual-Cognitive Memory Bank stores low-level details and high-level semantics consolidated from it. Working memory retrieves decision-relevant entries from the bank, adaptively fuses them with current tokens, and updates the bank by merging redundancies. Using these tokens, a memory-conditioned diffusion action expert yields temporally aware action sequences. We evaluate MemoryVLA on 150+ simulation and real-world tasks across three robots. On SimplerEnv-Bridge, Fractal, and LIBERO-5 suites, it achieves 71.9%, 72.7%, and 96.5% success rates, respectively, all outperforming state-of-the-art baselines CogACT and pi-0, with a notable +14.6 gain on Bridge. On 12 real-world tasks spanning general skills and long-horizon temporal dependencies, MemoryVLA achieves 84.0% success rate, with long-horizon tasks showing a +26 improvement over state-of-the-art baseline. Project Page: https://shihao1895.github.io/MemoryVLA
中文摘要:MemoryVLA提出了一种认知-记忆-行动框架,通过整合工作记忆与长期记忆机制来增强机器人操作能力,在仿真和现实任务中均超越现有模型,取得了卓越性能。
English Summary: MemoryVLA introduces a cognitive-memory-action framework that enhances robotic manipulation by integrating working memory and long-term memory mechanisms, achieving superior performance in both simulated and real-world tasks over existing models.

Authors:Kaveh Safavigerdini, Ramakrishna Surya, Jaired Collins, Prasad Calyam, Filiz Bunyak, Matthew R. Maschmann, Kannappan Palaniappan
Title: Automated Feature Tracking for Real-Time Kinematic Analysis and Shape Estimation of Carbon Nanotube Growth
Abstract:
Carbon nanotubes (CNTs) are critical building blocks in nanotechnology, yet the characterization of their dynamic growth is limited by the experimental challenges in nanoscale motion measurement using scanning electron microscopy (SEM) imaging. Existing ex situ methods offer only static analysis, while in situ techniques often require manual initialization and lack continuous per-particle trajectory decomposition. We present Visual Feature Tracking (VFTrack) an in-situ real-time particle tracking framework that automatically detects and tracks individual CNT particles in SEM image sequences. VFTrack integrates handcrafted or deep feature detectors and matchers within a particle tracking framework to enable kinematic analysis of CNT micropillar growth. A systematic using 13,540 manually annotated trajectories identifies the ALIKED detector with LightGlue matcher as an optimal combination (F1-score of 0.78, $α$-score of 0.89). VFTrack motion vectors decomposed into axial growth, lateral drift, and oscillations, facilitate the calculation of heterogeneous regional growth rates and the reconstruction of evolving CNT pillar morphologies. This work enables advancement in automated nano-material characterization, bridging the gap between physics-based models and experimental observation to enable real-time optimization of CNT synthesis.
中文摘要:VFTrack作为实时原位粒子追踪框架,能自动检测并跟踪扫描电镜图像中的碳纳米管粒子运动,通过运动矢量分解实现生长动力学分析与形态重建,推动纳米材料表征自动化发展。
English Summary: VFTrack is an automated in-situ real-time framework that tracks individual carbon nanotube particles in SEM images, enabling kinematic analysis and growth rate calculations to bridge experimental observation with material synthesis optimization.

Authors:Jianwen Jiang, Weihong Zeng, Zerong Zheng, Jiaqi Yang, Chao Liang, Wang Liao, Han Liang, Yuan Zhang, Mingyuan Gao
Title: OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation
Abstract:
Existing video avatar models can produce fluid human animations, yet they struggle to move beyond mere physical likeness to capture a character's authentic essence. Their motions typically synchronize with low-level cues like audio rhythm, lacking a deeper semantic understanding of emotion, intent, or context. To bridge this gap, \textbf{we propose a framework designed to generate character animations that are not only physically plausible but also semantically coherent and expressive.} Our model, \textbf{OmniHuman-1.5}, is built upon two key technical contributions. First, we leverage Multimodal Large Language Models to synthesize a structured textual representation of conditions that provides high-level semantic guidance. This guidance steers our motion generator beyond simplistic rhythmic synchronization, enabling the production of actions that are contextually and emotionally resonant. Second, to ensure the effective fusion of these multimodal inputs and mitigate inter-modality conflicts, we introduce a specialized Multimodal DiT architecture with a novel Pseudo Last Frame design. The synergy of these components allows our model to accurately interpret the joint semantics of audio, images, and text, thereby generating motions that are deeply coherent with the character, scene, and linguistic content. Extensive experiments demonstrate that our model achieves leading performance across a comprehensive set of metrics, including lip-sync accuracy, video quality, motion naturalness and semantic consistency with textual prompts. Furthermore, our approach shows remarkable extensibility to complex scenarios, such as those involving multi-person and non-human subjects. Homepage: \href{https://omnihuman-lab.github.io/v1_5/}

Authors:Weixin Ye, Hongguang Zhu, Wei Wang, Yahui Liu, Mengyu Wang
Title: All-in-One Slider for Attribute Manipulation in Diffusion Models
Abstract:
Text-to-image (T2I) diffusion models have made significant strides in generating high-quality images. However, progressively manipulating certain attributes of generated images to meet the desired user expectations remains challenging, particularly for content with rich details, such as human faces. Some studies have attempted to address this by training slider modules. However, they follow a One-for-One manner, where an independent slider is trained for each attribute, requiring additional training whenever a new attribute is introduced. This not only results in parameter redundancy accumulated by sliders but also restricts the flexibility of practical applications and the scalability of attribute manipulation. To address this issue, we introduce the All-in-One Slider, a lightweight module that decomposes the text embedding space into sparse, semantically meaningful attribute directions. Once trained, it functions as a general-purpose slider, enabling interpretable and fine-grained continuous control over various attributes. Moreover, by recombining the learned directions, the All-in-One Slider supports zero-shot manipulation of unseen attributes (e.g., races and celebrities) and the composition of multiple attributes. Extensive experiments demonstrate that our method enables accurate and scalable attribute manipulation, achieving notable improvements compared to previous methods. Furthermore, our method can be extended to integrate with the inversion framework to perform attribute manipulation on real images, broadening its applicability to various real-world scenarios. The code and trained model will be released at: https://github.com/ywxsuperstar/KSAE-FaceSteer.
中文摘要:All-in-One Slider 是一个轻量级模块,通过将文本嵌入空间分解为稀疏的语义属性方向,实现了对多种图像属性的精细控制,无需针对新属性重复训练即可支持零样本操作和真实图像编辑。
English Summary: The All-in-One Slider is a lightweight module that enables fine-grained control over multiple image attributes through sparse decomposition of text embeddings, supporting zero-shot manipulation and real-image editing without redundant training for new attributes.

Authors:Silvio Giancola, Anthony Cioppa, Marc Gutiérrez-Pérez, Jan Held, Carlos Hinojosa, Victor Joos, Arnaud Leduc, Floriane Magera, Karen Sanchez, Vladimir Somers, Artur Xarles, Antonio Agudo, Alexandre Alahi, Olivier Barnich, Albert Clapés, Christophe De Vleeschouwer, Sergio Escalera, Bernard Ghanem, Thomas B. Moeslund, Marc Van Droogenbroeck, Tomoki Abe, Saad Alotaibi, Faisal Altawijri, Steven Araujo, Xiang Bai, Xiaoyang Bi, Jiawang Cao, Vanyi Chao, Kamil Czarnogórski, Fabian Deuser, Mingyang Du, Tianrui Feng, Patrick Frenzel, Mirco Fuchs, Jorge García, Konrad Habel, Takaya Hashiguchi, Sadao Hirose, Xinting Hu, Yewon Hwang, Ririko Inoue, Riku Itsuji, Kazuto Iwai, Hongwei Ji, Yangguang Ji, Licheng Jiao, Yuto Kageyama, Yuta Kamikawa, Yuuki Kanasugi, Hyungjung Kim, Jinwook Kim, Takuya Kurihara, Bozheng Li, Lingling Li, Xian Li, Youxing Lian, Dingkang Liang, Hongkai Lin, Jiadong Lin, Jian Liu, Liang Liu, Shuaikun Liu, Zhaohong Liu, Yi Lu, Federico Méndez, Huadong Ma, Wenping Ma, Jacek Maksymiuk, Henry Mantilla, Ismail Mathkour, Daniel Matthes, Ayaha Motomochi, Amrulloh Robbani Muhammad, Haruto Nakayama, Joohyung Oh, Yin May Oo, Marcelo Ortega, Norbert Oswald, Rintaro Otsubo, Fabian Perez, Mengshi Qi, Cristian Rey, Abel Reyes-Angulo, Oliver Rose, Hoover Rueda-Chacón, Hideo Saito, Jose Sarmiento, Kanta Sawafuji, Atom Scott, Xi Shen, Pragyan Shrestha, Jae-Young Sim, Long Sun, Yuyang Sun, Tomohiro Suzuki, Licheng Tang, Masato Tonouchi, Ikuma Uchida, Henry O. Velesaca, Tiancheng Wang, Rio Watanabe, Jay Wu, Yongliang Wu, Shunzo Yamagishi, Di Yang, Xu Yang, Yuxin Yang, Hao Ye, Xinyu Ye, Calvin Yeung, Xuanlong Yu, Chao Zhang, Dingyuan Zhang, Kexing Zhang, Zhe Zhao, Xin Zhou, Wenbo Zhu, Julian Ziegler
Title: SoccerNet 2025 Challenges Results
Abstract:
The SoccerNet 2025 Challenges mark the fifth annual edition of the SoccerNet open benchmarking effort, dedicated to advancing computer vision research in football video understanding. This year's challenges span four vision-based tasks: (1) Team Ball Action Spotting, focused on detecting ball-related actions in football broadcasts and assigning actions to teams; (2) Monocular Depth Estimation, targeting the recovery of scene geometry from single-camera broadcast clips through relative depth estimation for each pixel; (3) Multi-View Foul Recognition, requiring the analysis of multiple synchronized camera views to classify fouls and their severity; and (4) Game State Reconstruction, aimed at localizing and identifying all players from a broadcast video to reconstruct the game state on a 2D top-view of the field. Across all tasks, participants were provided with large-scale annotated datasets, unified evaluation protocols, and strong baselines as starting points. This report presents the results of each challenge, highlights the top-performing solutions, and provides insights into the progress made by the community. The SoccerNet Challenges continue to serve as a driving force for reproducible, open research at the intersection of computer vision, artificial intelligence, and sports. Detailed information about the tasks, challenges, and leaderboards can be found at https://www.soccer-net.org, with baselines and development kits available at https://github.com/SoccerNet.
Chinese: SoccerNet 2025挑战赛推出四项足球视频分析的计算机视觉任务——团队控球动作识别、单目深度估计、多视角犯规识别和比赛状态重建,通过提供数据集和基准测试推动体育相关人工智能研究发展。
English: The SoccerNet 2025 Challenges introduce four computer vision tasks for football video analysis—team ball action spotting, monocular depth estimation, multi-view foul recognition, and game state reconstruction—providing datasets and benchmarks to advance sports-related AI research.

Authors:Rafael Sterzinger, Tingyu Lin, Robert Sablatnig
Title: Few-Shot Connectivity-Aware Text Line Segmentation in Historical Documents
Abstract:
A foundational task for the digital analysis of documents is text line segmentation. However, automating this process with deep learning models is challenging because it requires large, annotated datasets that are often unavailable for historical documents. Additionally, the annotation process is a labor- and cost-intensive task that requires expert knowledge, which makes few-shot learning a promising direction for reducing data requirements. In this work, we demonstrate that small and simple architectures, coupled with a topology-aware loss function, are more accurate and data-efficient than more complex alternatives. We pair a lightweight UNet++ with a connectivity-aware loss, initially developed for neuron morphology, which explicitly penalizes structural errors like line fragmentation and unintended line merges. To increase our limited data, we train on small patches extracted from a mere three annotated pages per manuscript. Our methodology significantly improves upon the current state-of-the-art on the U-DIADS-TL dataset, with a 200% increase in Recognition Accuracy and a 75% increase in Line Intersection over Union. Our method also achieves an F-Measure score on par with or even exceeding that of the competition winner of the DIVA-HisDB baseline detection task, all while requiring only three annotated pages, exemplifying the efficacy of our approach. Our implementation is publicly available at: https://github.com/RafaelSterzinger/acpr_few_shot_hist.
中文:本研究采用轻量级UNet++模型和拓扑感知损失函数,显著提升了历史文档文本行分割的准确性和数据效率,仅需每份手稿的三页标注即可达到最先进的性能。
English: This study introduces a lightweight UNet++ model with a topology-aware loss function that significantly enhances text line segmentation accuracy and data efficiency for historical documents, achieving state-of-the-art results using only three annotated pages per manuscript.

Authors:Florian Hahlbohm, Linus Franke, Leon Overkämping, Paula Wespe, Susana Castillo, Martin Eisemann, Marcus Magnor
Title: A Bag of Tricks for Efficient Implicit Neural Point Clouds
Abstract:
Implicit Neural Point Cloud (INPC) is a recent hybrid representation that combines the expressiveness of neural fields with the efficiency of point-based rendering, achieving state-of-the-art image quality in novel view synthesis. However, as with other high-quality approaches that query neural networks during rendering, the practical usability of INPC is limited by comparatively slow rendering. In this work, we present a collection of optimizations that significantly improve both the training and inference performance of INPC without sacrificing visual fidelity. The most significant modifications are an improved rasterizer implementation, more effective sampling techniques, and the incorporation of pre-training for the convolutional neural network used for hole-filling. Furthermore, we demonstrate that points can be modeled as small Gaussians during inference to further improve quality in extrapolated, e.g., close-up views of the scene. We design our implementations to be broadly applicable beyond INPC and systematically evaluate each modification in a series of experiments. Our optimized INPC pipeline achieves up to 25% faster training, 2x faster rendering, and 20% reduced VRAM usage paired with slight image quality improvements.

Authors:Blaž Rolih, Matic Fučka, Danijel Skočaj
Title: No Label Left Behind: A Unified Surface Defect Detection Model for all Supervision Regimes
Abstract:
Surface defect detection is a critical task across numerous industries, aimed at efficiently identifying and localising imperfections or irregularities on manufactured components. While numerous methods have been proposed, many fail to meet industrial demands for high performance, efficiency, and adaptability. Existing approaches are often constrained to specific supervision scenarios and struggle to adapt to the diverse data annotations encountered in real-world manufacturing processes, such as unsupervised, weakly supervised, mixed supervision, and fully supervised settings. To address these challenges, we propose SuperSimpleNet, a highly efficient and adaptable discriminative model built on the foundation of SimpleNet. SuperSimpleNet incorporates a novel synthetic anomaly generation process, an enhanced classification head, and an improved learning procedure, enabling efficient training in all four supervision scenarios, making it the first model capable of fully leveraging all available data annotations. SuperSimpleNet sets a new standard for performance across all scenarios, as demonstrated by its results on four challenging benchmark datasets. Beyond accuracy, it is very fast, achieving an inference time below 10 ms. With its ability to unify diverse supervision paradigms while maintaining outstanding speed and reliability, SuperSimpleNet represents a promising step forward in addressing real-world manufacturing challenges and bridging the gap between academic research and industrial applications. Code: https://github.com/blaz-r/SuperSimpleNet
中文摘要:SuperSimpleNet是一种高效且适应性强的表面缺陷检测模型,能统一四种监督场景并实现卓越性能与快速推理,有效弥合工业应用与学术研究之间的差距。
English Summary: SuperSimpleNet is a highly efficient and adaptable model that unifies four supervision scenarios for surface defect detection, achieving superior performance and fast inference times to bridge industrial and academic needs.

Authors:Gueter Josmy Faure, Min-Hung Chen, Jia-Fong Yeh, Ying Cheng, Hung-Ting Su, Yung-Hao Tang, Shang-Hong Lai, Winston H. Hsu
Title: MovieCORE: COgnitive REasoning in Movies
Abstract:
This paper introduces MovieCORE, a novel video question answering (VQA) dataset designed to probe deeper cognitive understanding of movie content. Unlike existing datasets that focus on surface-level comprehension, MovieCORE emphasizes questions that engage System-2 thinking while remaining specific to the video material. We present an innovative agentic brainstorming approach, utilizing multiple large language models (LLMs) as thought agents to generate and refine high-quality question-answer pairs. To evaluate dataset quality, we develop a set of cognitive tests assessing depth, thought-provocation potential, and syntactic complexity. We also propose a comprehensive evaluation scheme for assessing VQA model performance on deeper cognitive tasks. To address the limitations of existing video-language models (VLMs), we introduce an agentic enhancement module, Agentic Choice Enhancement (ACE), which improves model reasoning capabilities post-training by up to 25%. Our work contributes to advancing movie understanding in AI systems and provides valuable insights into the capabilities and limitations of current VQA models when faced with more challenging, nuanced questions about cinematic content. Our project page, dataset and code can be found at https://joslefaure.github.io/assets/html/moviecore.html.

Authors:Yi Pan, Yujia Zhang, Michael Kampffmeyer, Xiaoguang Zhao
Title: ProPy: Building Interactive Prompt Pyramids upon CLIP for Partially Relevant Video Retrieval
Abstract:
Partially Relevant Video Retrieval (PRVR) is a practical yet challenging task that involves retrieving videos based on queries relevant to only specific segments. While existing works follow the paradigm of developing models to process unimodal features, powerful pretrained vision-language models like CLIP remain underexplored in this field. To bridge this gap, we propose ProPy, a model with systematic architectural adaption of CLIP specifically designed for PRVR. Drawing insights from the semantic relevance of multi-granularity events, ProPy introduces two key innovations: (1) A Prompt Pyramid structure that organizes event prompts to capture semantics at multiple granularity levels, and (2) An Ancestor-Descendant Interaction Mechanism built on the pyramid that enables dynamic semantic interaction among events. With these designs, ProPy achieves SOTA performance on three public datasets, outperforming previous models by significant margins. Code is available at https://github.com/BUAAPY/ProPy.
Chinese: ProPy通过设计提示金字塔结构和祖先后代交互机制,专门针对部分相关视频检索任务优化CLIP模型,在三个公开数据集上实现了最先进的性能表现。
English: ProPy introduces a Prompt Pyramid structure and an Ancestor-Descendant Interaction Mechanism to adapt CLIP for Partially Relevant Video Retrieval, achieving state-of-the-art performance on three public datasets.

Authors:Shaojin Wu, Mengqi Huang, Yufeng Cheng, Wenxu Wu, Jiahe Tian, Yiming Luo, Fei Ding, Qian He
Title: USO: Unified Style and Subject-Driven Generation via Disentangled and Reward Learning
Abstract:
Existing literature typically treats style-driven and subject-driven generation as two disjoint tasks: the former prioritizes stylistic similarity, whereas the latter insists on subject consistency, resulting in an apparent antagonism. We argue that both objectives can be unified under a single framework because they ultimately concern the disentanglement and re-composition of content and style, a long-standing theme in style-driven research. To this end, we present USO, a Unified Style-Subject Optimized customization model. First, we construct a large-scale triplet dataset consisting of content images, style images, and their corresponding stylized content images. Second, we introduce a disentangled learning scheme that simultaneously aligns style features and disentangles content from style through two complementary objectives, style-alignment training and content-style disentanglement training. Third, we incorporate a style reward-learning paradigm denoted as SRL to further enhance the model's performance. Finally, we release USO-Bench, the first benchmark that jointly evaluates style similarity and subject fidelity across multiple metrics. Extensive experiments demonstrate that USO achieves state-of-the-art performance among open-source models along both dimensions of subject consistency and style similarity. Code and model: https://github.com/bytedance/USO
中文: 本文提出USO模型,通过构建大规模三元组数据集、引入解耦学习方案及风格奖励学习范式,将风格驱动与主体驱动生成统一于单一框架,在风格相似性和主体保真度上均达到开源模型的最优性能。
English: The paper introduces USO, a unified model that integrates style-driven and subject-driven generation by disentangling and recomposing content and style through a novel dataset, learning scheme, and benchmark, achieving state-of-the-art performance in both style similarity and subject consistency.

Authors:Thien-Phuc Tran, Minh-Quang Nguyen, Minh-Triet Tran, Tam V. Nguyen, Trong-Le Do, Duy-Nam Ly, Viet-Tham Huynh, Khanh-Duy Le, Mai-Khiem Tran, Trung-Nghia Le
Title: Event-Enriched Image Analysis Grand Challenge at ACM Multimedia 2025
Abstract:
The Event-Enriched Image Analysis (EVENTA) Grand Challenge, hosted at ACM Multimedia 2025, introduces the first large-scale benchmark for event-level multimodal understanding. Traditional captioning and retrieval tasks largely focus on surface-level recognition of people, objects, and scenes, often overlooking the contextual and semantic dimensions that define real-world events. EVENTA addresses this gap by integrating contextual, temporal, and semantic information to capture the who, when, where, what, and why behind an image. Built upon the OpenEvents V1 dataset, the challenge features two tracks: Event-Enriched Image Retrieval and Captioning, and Event-Based Image Retrieval. A total of 45 teams from six countries participated, with evaluation conducted through Public and Private Test phases to ensure fairness and reproducibility. The top three teams were invited to present their solutions at ACM Multimedia 2025. EVENTA establishes a foundation for context-aware, narrative-driven multimedia AI, with applications in journalism, media analysis, cultural archiving, and accessibility. Further details about the challenge are available at the official homepage: https://ltnghia.github.io/eventa/eventa-2025.
中文摘要:ACM Multimedia 2025的EVENTA挑战赛首次构建了事件级多模态理解的大规模基准,通过结合上下文与语义信息设计了图像检索和描述双赛道,吸引了六国45支团队参与,为叙事驱动型多媒体AI奠定了基础。
English Summary: The EVENTA Grand Challenge at ACM Multimedia 2025 establishes the first large-scale benchmark for event-level multimodal understanding, addressing gaps in contextual and semantic analysis through two tracks—Event-Enriched Image Retrieval and Captioning, and Event-Based Image Retrieval—with participation from 45 teams across six countries.

Authors:Zhehao Li, Chong Wang, Yi Chen, Yinghao Lu, Jiangbo Qian, Jiong Wang, Jiafei Wu
Title: DQEN: Dual Query Enhancement Network for DETR-based HOI Detection
Abstract:
Human-Object Interaction (HOI) detection focuses on localizing human-object pairs and recognizing their interactions. Recently, the DETR-based framework has been widely adopted in HOI detection. In DETR-based HOI models, queries with clear meaning are crucial for accurately detecting HOIs. However, prior works have typically relied on randomly initialized queries, leading to vague representations that limit the model's effectiveness. Meanwhile, humans in the HOI categories are fixed, while objects and their interactions are variable. Therefore, we propose a Dual Query Enhancement Network (DQEN) to enhance object and interaction queries. Specifically, object queries are enhanced with object-aware encoder features, enabling the model to focus more effectively on humans interacting with objects in an object-aware way. On the other hand, we design a novel Interaction Semantic Fusion module to exploit the HOI candidates that are promoted by the CLIP model. Semantic features are extracted to enhance the initialization of interaction queries, thereby improving the model's ability to understand interactions. Furthermore, we introduce an Auxiliary Prediction Unit aimed at improving the representation of interaction features. Our proposed method achieves competitive performance on both the HICO-Det and the V-COCO datasets. The source code is available at https://github.com/lzzhhh1019/DQEN.
中文摘要:双重查询增强网络(DQEN)通过对象感知特征增强对象查询,并利用CLIP的语义特征强化交互查询,从而在人-物交互检测任务中取得优异性能。
English Summary: The Dual Query Enhancement Network (DQEN) improves Human-Object Interaction detection by enhancing object queries with object-aware features and interaction queries with semantic features from CLIP, achieving competitive results on standard datasets.

Authors:Wei Li, Hangjie Yuan, Zixiang Zhao, Yifan Zhu, Aojun Lu, Tao Feng, Yanan Sun
Title: C-Flat++: Towards a More Efficient and Powerful Framework for Continual Learning
Abstract:
Balancing sensitivity to new tasks and stability for retaining past knowledge is crucial in continual learning (CL). Recently, sharpness-aware minimization has proven effective in transfer learning and has also been adopted in continual learning (CL) to improve memory retention and learning efficiency. However, relying on zeroth-order sharpness alone may favor sharper minima over flatter ones in certain settings, leading to less robust and potentially suboptimal solutions. In this paper, we propose \textbf{C}ontinual \textbf{Flat}ness (\textbf{C-Flat}), a method that promotes flatter loss landscapes tailored for CL. C-Flat offers plug-and-play compatibility, enabling easy integration with minimal modifications to the code pipeline. Besides, we present a general framework that integrates C-Flat into all major CL paradigms and conduct comprehensive comparisons with loss-minima optimizers and flat-minima-based CL methods. Our results show that C-Flat consistently improves performance across a wide range of settings. In addition, we introduce C-Flat++, an efficient yet effective framework that leverages selective flatness-driven promotion, significantly reducing the update cost required by C-Flat. Extensive experiments across multiple CL methods, datasets, and scenarios demonstrate the effectiveness and efficiency of our proposed approaches. Code is available at https://github.com/WanNaa/C-Flat.
中文: 本文提出C-Flat方法,通过在持续学习中促进更平坦的损失曲面来提升各种场景下的性能,其改进版C-Flat++在保持效果的同时显著降低了更新成本。
English: The paper introduces C-Flat, a plug-and-play method that promotes flatter loss landscapes in continual learning to enhance performance across various settings, with an improved version, C-Flat++, reducing update costs while maintaining effectiveness.

Authors:Zizheng Guo, Bochao Zou, Yinuo Jia, Xiangyu Li, Huimin Ma
Title: Boosting Micro-Expression Analysis via Prior-Guided Video-Level Regression
Abstract:
Micro-expressions (MEs) are involuntary, low-intensity, and short-duration facial expressions that often reveal an individual's genuine thoughts and emotions. Most existing ME analysis methods rely on window-level classification with fixed window sizes and hard decisions, which limits their ability to capture the complex temporal dynamics of MEs. Although recent approaches have adopted video-level regression frameworks to address some of these challenges, interval decoding still depends on manually predefined, window-based methods, leaving the issue only partially mitigated. In this paper, we propose a prior-guided video-level regression method for ME analysis. We introduce a scalable interval selection strategy that comprehensively considers the temporal evolution, duration, and class distribution characteristics of MEs, enabling precise spotting of the onset, apex, and offset phases. In addition, we introduce a synergistic optimization framework, in which the spotting and recognition tasks share parameters except for the classification heads. This fully exploits complementary information, makes more efficient use of limited data, and enhances the model's capability. Extensive experiments on multiple benchmark datasets demonstrate the state-of-the-art performance of our method, with an STRS of 0.0562 on CAS(ME)$^3$ and 0.2000 on SAMMLV. The code is available at https://github.com/zizheng-guo/BoostingVRME.
Chinese: 本文提出了一种先验引导的视频级回归方法用于微表情分析,采用可扩展区间选择策略和协同优化框架,在多个基准数据集上实现了最先进的性能。
English: This paper introduces a prior-guided video-level regression method for micro-expression analysis, featuring a scalable interval selection strategy and a synergistic optimization framework that achieves state-of-the-art performance on benchmark datasets.

Authors:Rui Zhang, Zihan Wang, Tianli Yang, Hongwei Li, Wenbo Jiang, Qingchuan Zhao, Yang Liu, Guowen Xu
Title: Hidden Tail: Adversarial Image Causing Stealthy Resource Consumption in Vision-Language Models
Abstract:
Vision-Language Models (VLMs) are increasingly deployed in real-world applications, but their high inference cost makes them vulnerable to resource consumption attacks. Prior attacks attempt to extend VLM output sequences by optimizing adversarial images, thereby increasing inference costs. However, these extended outputs often introduce irrelevant abnormal content, compromising attack stealthiness. This trade-off between effectiveness and stealthiness poses a major limitation for existing attacks. To address this challenge, we propose \textit{Hidden Tail}, a stealthy resource consumption attack that crafts prompt-agnostic adversarial images, inducing VLMs to generate maximum-length outputs by appending special tokens invisible to users. Our method employs a composite loss function that balances semantic preservation, repetitive special token induction, and suppression of the end-of-sequence (EOS) token, optimized via a dynamic weighting strategy. Extensive experiments show that \textit{Hidden Tail} outperforms existing attacks, increasing output length by up to 19.2$\times$ and reaching the maximum token limit, while preserving attack stealthiness. These results highlight the urgent need to improve the robustness of VLMs against efficiency-oriented adversarial threats. Our code is available at https://github.com/zhangrui4041/Hidden_Tail.
中文: "隐尾"攻击通过生成对抗性图像诱导视觉语言模型重复生成特殊标记,在保持语义连贯性的同时将输出长度提升19.2倍,实现了高效且隐蔽的资源消耗攻击。
English: The proposed "Hidden Tail" attack stealthily maximizes Vision-Language Model inference costs by generating adversarial images that induce repetitive special tokens while preserving semantic coherence, achieving 19.2× output elongation without compromising stealth.

Authors:Hassan Abid, Khan Muhammad, Muhammad Haris Khan
Title: Robust and Label-Efficient Deep Waste Detection
Abstract:
Effective waste sorting is critical for sustainable recycling, yet AI research in this domain continues to lag behind commercial systems due to limited datasets and reliance on legacy object detectors. In this work, we advance AI-driven waste detection by establishing strong baselines and introducing an ensemble-based semi-supervised learning framework. We first benchmark state-of-the-art Open-Vocabulary Object Detection (OVOD) models on the real-world ZeroWaste dataset, demonstrating that while class-only prompts perform poorly, LLM-optimized prompts significantly enhance zero-shot accuracy. Next, to address domain-specific limitations, we fine-tune modern transformer-based detectors, achieving a new baseline of 51.6 mAP. We then propose a soft pseudo-labeling strategy that fuses ensemble predictions using spatial and consensus-aware weighting, enabling robust semi-supervised training. Applied to the unlabeled ZeroWaste-s subset, our pseudo-annotations achieve performance gains that surpass fully supervised training, underscoring the effectiveness of scalable annotation pipelines. Our work contributes to the research community by establishing rigorous baselines, introducing a robust ensemble-based pseudo-labeling pipeline, generating high-quality annotations for the unlabeled ZeroWaste-s subset, and systematically evaluating OVOD models under real-world waste sorting conditions. Our code is available at: https://github.com/h-abid97/robust-waste-detection.
中文: 本研究通过基准测试开放词汇检测模型并引入集成半监督框架,采用优化提示和空间共识伪标注策略,为废弃物检测建立了强大的AI基线并实现卓越性能。
English: This research establishes strong AI baselines for waste detection by benchmarking open-vocabulary models and introducing an ensemble-based semi-supervised framework that achieves superior performance through optimized prompts and spatial-consensus pseudo-labeling.

Authors:Luqing Luo, Wenjin Gui, Yunfei Liu, Ziyue Zhang, Yunxi Zhang, Fengxiang Wang, Zonghao Guo, Zizhi Ma, Xinzhu Liu, Hanxiang He, Jinhai Li, Xin Qiu, Wupeng Xie, Yangang Sun
Title: EMind: A Foundation Model for Multi-task Electromagnetic Signals Understanding
Abstract:
Deep understanding of electromagnetic signals is fundamental to dynamic spectrum management, intelligent transportation, autonomous driving and unmanned vehicle perception. The field faces challenges because electromagnetic signals differ greatly from text and images, showing high heterogeneity, strong background noise and complex joint time frequency structure, which prevents existing general models from direct use. Electromagnetic communication and sensing tasks are diverse, current methods lack cross task generalization and transfer efficiency, and the scarcity of large high quality datasets blocks the creation of a truly general multitask learning framework. To overcome these issue, we introduce EMind, an electromagnetic signals foundation model that bridges large scale pretraining and the unique nature of this modality. We build the first unified and largest standardized electromagnetic signal dataset covering multiple signal types and tasks. By exploiting the physical properties of electromagnetic signals, we devise a length adaptive multi-signal packing method and a hardware-aware training strategy that enable efficient use and representation learning from heterogeneous multi-source signals. Experiments show that EMind achieves strong performance and broad generalization across many downstream tasks, moving decisively from task specific models to a unified framework for electromagnetic intelligence. The code is available at: https://github.com/GabrielleTse/EMind.
中文摘要:EMind基础模型通过构建统一数据集和采用自适应训练策略,有效解决了电磁信号处理中的难题,并在多种下游任务中展现出卓越的泛化能力。
English Summary: The EMind foundation model addresses challenges in electromagnetic signal analysis by introducing a unified dataset and innovative training strategies, achieving strong generalization across various tasks.

Authors:Byung-Joon Lee, Jin-Seop Lee, Jee-Hyong Lee
Title: Stabilizing Open-Set Test-Time Adaptation via Primary-Auxiliary Filtering and Knowledge-Integrated Prediction
Abstract:
Deep neural networks demonstrate strong performance under aligned training-test distributions. However, real-world test data often exhibit domain shifts. Test-Time Adaptation (TTA) addresses this challenge by adapting the model to test data during inference. While most TTA studies assume that the training and test data share the same class set (closed-set TTA), real-world scenarios often involve open-set data (open-set TTA), which can degrade closed-set accuracy. A recent study showed that identifying open-set data during adaptation and maximizing its entropy is an effective solution. However, the previous method relies on the source model for filtering, resulting in suboptimal filtering accuracy on domain-shifted test data. In contrast, we found that the adapting model, which learns domain knowledge from noisy test streams, tends to be unstable and leads to error accumulation when used for filtering. To address this problem, we propose Primary-Auxiliary Filtering (PAF), which employs an auxiliary filter to validate data filtered by the primary filter. Furthermore, we propose Knowledge-Integrated Prediction (KIP), which calibrates the outputs of the adapting model, EMA model, and source model to integrate their complementary knowledge for OSTTA. We validate our approach across diverse closed-set and open-set datasets. Our method enhances both closed-set accuracy and open-set discrimination over existing methods. The code is available at https://github.com/powerpowe/PAF-KIP-OSTTA .
中文摘要:本文提出的主辅助过滤机制和知识集成预测方法,通过提升数据筛选精度并融合多模型互补知识,有效解决了开放集测试时适应中的性能退化问题,在闭集精度和开放集识别方面均优于现有方法。
English Summary: This paper introduces Primary-Auxiliary Filtering (PAF) and Knowledge-Integrated Prediction (KIP) to improve open-set test-time adaptation by enhancing data filtering accuracy and integrating complementary knowledge from multiple models, achieving superior performance in both closed-set accuracy and open-set discrimination.

Authors:Feiwei Qin, Shichao Lu, Junhao Hou, Changmiao Wang, Meie Fang, Ligang Liu
Title: Drawing2CAD: Sequence-to-Sequence Learning for CAD Generation from Vector Drawings
Abstract:
Computer-Aided Design (CAD) generative modeling is driving significant innovations across industrial applications. Recent works have shown remarkable progress in creating solid models from various inputs such as point clouds, meshes, and text descriptions. However, these methods fundamentally diverge from traditional industrial workflows that begin with 2D engineering drawings. The automatic generation of parametric CAD models from these 2D vector drawings remains underexplored despite being a critical step in engineering design. To address this gap, our key insight is to reframe CAD generation as a sequence-to-sequence learning problem where vector drawing primitives directly inform the generation of parametric CAD operations, preserving geometric precision and design intent throughout the transformation process. We propose Drawing2CAD, a framework with three key technical components: a network-friendly vector primitive representation that preserves precise geometric information, a dual-decoder transformer architecture that decouples command type and parameter generation while maintaining precise correspondence, and a soft target distribution loss function accommodating inherent flexibility in CAD parameters. To train and evaluate Drawing2CAD, we create CAD-VGDrawing, a dataset of paired engineering drawings and parametric CAD models, and conduct thorough experiments to demonstrate the effectiveness of our method. Code and dataset are available at https://github.com/lllssc/Drawing2CAD.
中文: Drawing2CAD提出了一种创新的序列到序列框架,通过变压器架构将二维工程图自动转换为参数化CAD模型,在弥补工业流程关键空白的同时保持了几何精度。
English: Drawing2CAD introduces a novel sequence-to-sequence framework that automatically converts 2D engineering drawings into parametric CAD models using a transformer architecture, preserving geometric precision while addressing a critical gap in industrial workflows.

Authors:Chenxuan Miao, Yutong Feng, Jianshu Zeng, Zixiang Gao, Hantang Liu, Yunfeng Yan, Donglian Qi, Xi Chen, Bin Wang, Hengshuang Zhao
Title: ROSE: Remove Objects with Side Effects in Videos
Abstract:
Video object removal has achieved advanced performance due to the recent success of video generative models. However, when addressing the side effects of objects, e.g., their shadows and reflections, existing works struggle to eliminate these effects for the scarcity of paired video data as supervision. This paper presents ROSE, termed Remove Objects with Side Effects, a framework that systematically studies the object's effects on environment, which can be categorized into five common cases: shadows, reflections, light, translucency and mirror. Given the challenges of curating paired videos exhibiting the aforementioned effects, we leverage a 3D rendering engine for synthetic data generation. We carefully construct a fully-automatic pipeline for data preparation, which simulates a large-scale paired dataset with diverse scenes, objects, shooting angles, and camera trajectories. ROSE is implemented as an video inpainting model built on diffusion transformer. To localize all object-correlated areas, the entire video is fed into the model for reference-based erasing. Moreover, additional supervision is introduced to explicitly predict the areas affected by side effects, which can be revealed through the differential mask between the paired videos. To fully investigate the model performance on various side effect removal, we presents a new benchmark, dubbed ROSE-Bench, incorporating both common scenarios and the five special side effects for comprehensive evaluation. Experimental results demonstrate that ROSE achieves superior performance compared to existing video object erasing models and generalizes well to real-world video scenarios. The project page is https://rose2025-inpaint.github.io/.
中文摘要:ROSE框架通过合成数据和扩散变换器模型,能有效去除视频中的物体及其阴影、反射等副作用,性能优于现有方法。
English Summary: ROSE is a framework that removes objects and their side effects like shadows and reflections from videos using synthetic data and a diffusion transformer model, outperforming existing methods.

Authors:Xiaohao Sun, Divyam Goel, Angel X. Chang
Title: SemLayoutDiff: Semantic Layout Generation with Diffusion Model for Indoor Scene Synthesis
Abstract:
We present SemLayoutDiff, a unified model for synthesizing diverse 3D indoor scenes across multiple room types. The model introduces a scene layout representation combining a top-down semantic map and attributes for each object. Unlike prior approaches, which cannot condition on architectural constraints, SemLayoutDiff employs a categorical diffusion model capable of conditioning scene synthesis explicitly on room masks. It first generates a coherent semantic map, followed by a cross-attention-based network to predict furniture placements that respect the synthesized layout. Our method also accounts for architectural elements such as doors and windows, ensuring that generated furniture arrangements remain practical and unobstructed. Experiments on the 3D-FRONT dataset show that SemLayoutDiff produces spatially coherent, realistic, and varied scenes, outperforming previous methods.

Authors:Ajinkya Khoche, Qingwen Zhang, Yixi Cai, Sina Sharif Mansouri, Patric Jensfelt
Title: DoGFlow: Self-Supervised LiDAR Scene Flow via Cross-Modal Doppler Guidance
Abstract:
Accurate 3D scene flow estimation is critical for autonomous systems to navigate dynamic environments safely, but creating the necessary large-scale, manually annotated datasets remains a significant bottleneck for developing robust perception models. Current self-supervised methods struggle to match the performance of fully supervised approaches, especially in challenging long-range and adverse weather scenarios, while supervised methods are not scalable due to their reliance on expensive human labeling. We introduce DoGFlow, a novel self-supervised framework that recovers full 3D object motions for LiDAR scene flow estimation without requiring any manual ground truth annotations. This paper presents our cross-modal label transfer approach, where DoGFlow computes motion pseudo-labels in real-time directly from 4D radar Doppler measurements and transfers them to the LiDAR domain using dynamic-aware association and ambiguity-resolved propagation. On the challenging MAN TruckScenes dataset, DoGFlow substantially outperforms existing self-supervised methods and improves label efficiency by enabling LiDAR backbones to achieve over 90% of fully supervised performance with only 10% of the ground truth data. For more details, please visit https://ajinkyakhoche.github.io/DogFlow/

Authors:Md. Rashid Shahriar Khan, Md. Abrar Hasan, Mohammod Tareq Aziz Justice
Title: Context-Aware Zero-Shot Anomaly Detection in Surveillance Using Contrastive and Predictive Spatiotemporal Modeling
Abstract:
Detecting anomalies in surveillance footage is inherently challenging due to their unpredictable and context-dependent nature. This work introduces a novel context-aware zero-shot anomaly detection framework that identifies abnormal events without exposure to anomaly examples during training. The proposed hybrid architecture combines TimeSformer, DPC, and CLIP to model spatiotemporal dynamics and semantic context. TimeSformer serves as the vision backbone to extract rich spatial-temporal features, while DPC forecasts future representations to identify temporal deviations. Furthermore, a CLIP-based semantic stream enables concept-level anomaly detection through context-specific text prompts. These components are jointly trained using InfoNCE and CPC losses, aligning visual inputs with their temporal and semantic representations. A context-gating mechanism further enhances decision-making by modulating predictions with scene-aware cues or global video features. By integrating predictive modeling with vision-language understanding, the system can generalize to previously unseen behaviors in complex environments. This framework bridges the gap between temporal reasoning and semantic context in zero-shot anomaly detection for surveillance. The code for this research has been made available at https://github.com/NK-II/Context-Aware-Zero-Shot-Anomaly-Detection-in-Surveillance.
中文: 本研究提出一种上下文感知的零样本异常检测框架,通过整合TimeSformer、DPC和CLIP模型,在不接触异常样本的情况下,利用时空建模与语义理解实现对监控视频中未知异常行为的识别。
English: This research presents a context-aware zero-shot anomaly detection framework that integrates TimeSformer, DPC, and CLIP to identify unseen abnormal behaviors in surveillance footage through spatiotemporal modeling and semantic understanding without prior anomaly exposure.

Authors:Lucas Wojcik, Gabriel E. Lima, Valfride Nascimento, Eduil Nascimento, Rayson Laroca, David Menotti
Title: LPLC: A Dataset for License Plate Legibility Classification
Abstract:
Automatic License Plate Recognition (ALPR) faces a major challenge when dealing with illegible license plates (LPs). While reconstruction methods such as super-resolution (SR) have emerged, the core issue of recognizing these low-quality LPs remains unresolved. To optimize model performance and computational efficiency, image pre-processing should be applied selectively to cases that require enhanced legibility. To support research in this area, we introduce a novel dataset comprising 10,210 images of vehicles with 12,687 annotated LPs for legibility classification (the LPLC dataset). The images span a wide range of vehicle types, lighting conditions, and camera/image quality levels. We adopt a fine-grained annotation strategy that includes vehicle- and LP-level occlusions, four legibility categories (perfect, good, poor, and illegible), and character labels for three categories (excluding illegible LPs). As a benchmark, we propose a classification task using three image recognition networks to determine whether an LP image is good enough, requires super-resolution, or is completely unrecoverable. The overall F1 score, which remained below 80% for all three baseline models (ViT, ResNet, and YOLO), together with the analyses of SR and LP recognition methods, highlights the difficulty of the task and reinforces the need for further research. The proposed dataset is publicly available at https://github.com/lmlwojcik/lplc-dataset.
Chinese: 自动车牌识别技术在处理模糊车牌时面临挑战,为此我们推出了一个新的可读性分类数据集,用于基准测试并推动进一步研究,因为当前模型的F1分数均低于80%。
English: Automatic License Plate Recognition struggles with illegible plates, prompting the creation of a new dataset for legibility classification to benchmark performance and encourage further research, as current models achieve under 80% F1 scores.

Authors:Haitang Feng, Jie Liu, Jie Tang, Gangshan Wu, Beiqi Chen, Jianhuang Lai, Guangcong Wang
Title: ObjFiller-3D: Consistent Multi-view 3D Inpainting via Video Diffusion Models
Abstract:
3D inpainting often relies on multi-view 2D image inpainting, where the inherent inconsistencies across different inpainted views can result in blurred textures, spatial discontinuities, and distracting visual artifacts. These inconsistencies pose significant challenges when striving for accurate and realistic 3D object completion, particularly in applications that demand high fidelity and structural coherence. To overcome these limitations, we propose ObjFiller-3D, a novel method designed for the completion and editing of high-quality and consistent 3D objects. Instead of employing a conventional 2D image inpainting model, our approach leverages a curated selection of state-of-the-art video editing model to fill in the masked regions of 3D objects. We analyze the representation gap between 3D and videos, and propose an adaptation of a video inpainting model for 3D scene inpainting. In addition, we introduce a reference-based 3D inpainting method to further enhance the quality of reconstruction. Experiments across diverse datasets show that compared to previous methods, ObjFiller-3D produces more faithful and fine-grained reconstructions (PSNR of 26.6 vs. NeRFiller (15.9) and LPIPS of 0.19 vs. Instant3dit (0.25)). Moreover, it demonstrates strong potential for practical deployment in real-world 3D editing applications. Project page: https://objfiller3d.github.io/ Code: https://github.com/objfiller3d/ObjFiller-3D .
中文: ObjFiller-3D通过将视频修复模型适配于三维场景,解决了多视角修复中的不一致性问题,实现了更高质量的重建效果和实际应用潜力。
English: ObjFiller-3D overcomes inconsistencies in multi-view 3D inpainting by adapting video editing models to fill masked regions, achieving superior reconstruction quality and practical applicability.

Authors:Ashwath Vaithinathan Aravindan, Abha Jha, Matthew Salaway, Atharva Sandeep Bhide, Duygu Nur Yaldiz
Title: Sealing The Backdoor: Unlearning Adversarial Text Triggers In Diffusion Models Using Knowledge Distillation
Abstract:
Text-to-image diffusion models have revolutionized generative AI, but their vulnerability to backdoor attacks poses significant security risks. Adversaries can inject imperceptible textual triggers into training data, causing models to generate manipulated outputs. Although text-based backdoor defenses in classification models are well-explored, generative models lack effective mitigation techniques against. We address this by selectively erasing the model's learned associations between adversarial text triggers and poisoned outputs, while preserving overall generation quality. Our approach, Self-Knowledge Distillation with Cross-Attention Guidance (SKD-CAG), uses knowledge distillation to guide the model in correcting responses to poisoned prompts while maintaining image quality by exploiting the fact that the backdoored model still produces clean outputs in the absence of triggers. Using the cross-attention mechanism, SKD-CAG neutralizes backdoor influences at the attention level, ensuring the targeted removal of adversarial effects. Extensive experiments show that our method outperforms existing approaches, achieving removal accuracy 100\% for pixel backdoors and 93\% for style-based attacks, without sacrificing robustness or image fidelity. Our findings highlight targeted unlearning as a promising defense to secure generative models. Code and model weights can be found at https://github.com/Mystic-Slice/Sealing-The-Backdoor .
中文摘要:本文提出SKD-CAG方法,通过自知识蒸馏与交叉注意力引导,在保持图像生成质量的同时精准消除文本到图像扩散模型中的后门攻击,实现了对像素后门100%和风格攻击93%的清除准确率。
English Summary: This paper introduces SKD-CAG, a novel defense method that selectively erases backdoor triggers in text-to-image diffusion models through self-knowledge distillation and cross-attention guidance, achieving near-perfect attack removal while preserving image quality.

Authors:Ayce Idil Aytekin, Helge Rhodin, Rishabh Dabral, Christian Theobalt
Title: Follow My Hold: Hand-Object Interaction Reconstruction through Geometric Guidance
Abstract:
We propose a novel diffusion-based framework for reconstructing 3D geometry of hand-held objects from monocular RGB images by leveraging hand-object interaction as geometric guidance. Our method conditions a latent diffusion model on an inpainted object appearance and uses inference-time guidance to optimize the object reconstruction, while simultaneously ensuring plausible hand-object interactions. Unlike prior methods that rely on extensive post-processing or produce low-quality reconstructions, our approach directly generates high-quality object geometry during the diffusion process by introducing guidance with an optimization-in-the-loop design. Specifically, we guide the diffusion model by applying supervision to the velocity field while simultaneously optimizing the transformations of both the hand and the object being reconstructed. This optimization is driven by multi-modal geometric cues, including normal and depth alignment, silhouette consistency, and 2D keypoint reprojection. We further incorporate signed distance field supervision and enforce contact and non-intersection constraints to ensure physical plausibility of hand-object interaction. Our method yields accurate, robust and coherent reconstructions under occlusion while generalizing well to in-the-wild scenarios.
中文: 我们提出了一种基于扩散模型的新方法,通过融合手-物体交互线索和多模态几何约束,从单张RGB图像中直接生成高质量的三维物体几何重建。
English: Our method introduces a diffusion-based framework that reconstructs high-quality 3D object geometry from single RGB images by integrating hand-object interaction cues and multi-modal supervision during the diffusion process.

Authors:Sara Ghazanfari, Wei-An Lin, Haitong Tian, Ersin Yumer
Title: SpotEdit: Evaluating Visually-Guided Image Editing Methods
Abstract:
Visually-guided image editing, where edits are conditioned on both visual cues and textual prompts, has emerged as a powerful paradigm for fine-grained, controllable content generation. Although recent generative models have shown remarkable capabilities, existing evaluations remain simple and insufficiently representative of real-world editing challenges. We present SpotEdit, a comprehensive benchmark designed to systematically assess visually-guided image editing methods across diverse diffusion, autoregressive, and hybrid generative models, uncovering substantial performance disparities. To address a critical yet underexplored challenge, our benchmark includes a dedicated component on hallucination, highlighting how leading models, such as GPT-4o, often hallucinate the existence of a visual cue and erroneously perform the editing task. Our code and benchmark are publicly released at https://github.com/SaraGhazanfari/SpotEdit.
中文摘要:SpotEdit是一个全面基准,系统评估了多种生成模型在视觉引导图像编辑中的表现,揭示了显著的性能差异,并重点解决了GPT-4o等领先模型常出现的幻觉问题——即错误感知视觉提示并执行编辑任务。
English Summary: SpotEdit is a comprehensive benchmark that systematically evaluates visually-guided image editing methods across various generative models, revealing significant performance gaps and addressing the critical issue of hallucination where models like GPT-4o falsely perceive visual cues.

Authors:Chun Liu, Chen Zhang, Zhuo Li, Zheng Li, Wei Yang
Title: Few-shot Unknown Class Discovery of Hyperspectral Images with Prototype Learning and Clustering
Abstract:
Open-set few-shot hyperspectral image (HSI) classification aims to classify image pixels by using few labeled pixels per class, where the pixels to be classified may be not all from the classes that have been seen. To address the open-set HSI classification challenge, current methods focus mainly on distinguishing the unknown class samples from the known class samples and rejecting them to increase the accuracy of identifying known class samples. They fails to further identify or discovery the unknow classes among the samples. This paper proposes a prototype learning and clustering method for discoverying unknown classes in HSIs under the few-shot environment. Using few labeled samples, it strives to develop the ability of infering the prototypes of unknown classes while distinguishing unknown classes from known classes. Once the unknown class samples are rejected by the learned known class classifier, the proposed method can further cluster the unknown class samples into different classes according to their distance to the inferred unknown class prototypes. Compared to existing state-of-the-art methods, extensive experiments on four benchmark HSI datasets demonstrate that our proposed method exhibits competitive performance in open-set few-shot HSI classification tasks. All the codes are available at \href{https://github.com/KOBEN-ff/OpenFUCD-main} {https://github.com/KOBEN-ff/OpenFUCD-main}
Chinese: 本文提出了一种用于开放集少样本高光谱图像分类的原型学习与聚类方法,不仅能区分已知与未知类别,还能通过推断未知类原型对未知样本进行识别和聚类,在基准数据集上展现出优越性能。
English: This paper introduces a prototype learning and clustering method for open-set few-shot hyperspectral image classification, which not only distinguishes known from unknown classes but also identifies and clusters unknown class samples by inferring their prototypes, demonstrating competitive performance on benchmark datasets.

Authors:Kaiyu Li, Xiangyong Cao, Ruixun Liu, Shihong Wang, Zixuan Jiang, Zhi Wang, Deyu Meng
Title: Annotation-Free Open-Vocabulary Segmentation for Remote-Sensing Images
Abstract:
Semantic segmentation of remote sensing (RS) images is pivotal for comprehensive Earth observation, but the demand for interpreting new object categories, coupled with the high expense of manual annotation, poses significant challenges. Although open-vocabulary semantic segmentation (OVSS) offers a promising solution, existing frameworks designed for natural images are insufficient for the unique complexities of RS data. They struggle with vast scale variations and fine-grained details, and their adaptation often relies on extensive, costly annotations. To address this critical gap, this paper introduces SegEarth-OV, the first framework for annotation-free open-vocabulary segmentation of RS images. Specifically, we propose SimFeatUp, a universal upsampler that robustly restores high-resolution spatial details from coarse features, correcting distorted target shapes without any task-specific post-training. We also present a simple yet effective Global Bias Alleviation operation to subtract the inherent global context from patch features, significantly enhancing local semantic fidelity. These components empower SegEarth-OV to effectively harness the rich semantics of pre-trained VLMs, making OVSS possible in optical RS contexts. Furthermore, to extend the framework's universality to other challenging RS modalities like SAR images, where large-scale VLMs are unavailable and expensive to create, we introduce AlignEarth, which is a distillation-based strategy and can efficiently transfer semantic knowledge from an optical VLM encoder to an SAR encoder, bypassing the need to build SAR foundation models from scratch and enabling universal OVSS across diverse sensor types. Extensive experiments on both optical and SAR datasets validate that SegEarth-OV can achieve dramatic improvements over the SOTA methods, establishing a robust foundation for annotation-free and open-world Earth observation.
中文摘要:本文提出首个用于遥感图像免标注开放词汇分割的SegEarth-OV框架,通过创新的特征上采样和全局偏差消除技术突破现有方法局限,在光学与SAR数据集上均实现了最先进的性能表现。
English Summary: This paper introduces SegEarth-OV, the first framework for annotation-free open-vocabulary segmentation of remote sensing images, which overcomes limitations of existing methods through innovative feature upsampling and global bias alleviation techniques, achieving state-of-the-art performance across optical and SAR datasets.

Authors:Pengfei Jiang, Hanjun Li, Linglan Zhao, Fei Chao, Ke Yan, Shouhong Ding, Rongrong Ji
Title: VISA: Group-wise Visual Token Selection and Aggregation via Graph Summarization for Efficient MLLMs Inference
Abstract:
In this study, we introduce a novel method called group-wise \textbf{VI}sual token \textbf{S}election and \textbf{A}ggregation (VISA) to address the issue of inefficient inference stemming from excessive visual tokens in multimoal large language models (MLLMs). Compared with previous token pruning approaches, our method can preserve more visual information while compressing visual tokens. We first propose a graph-based visual token aggregation (VTA) module. VTA treats each visual token as a node, forming a graph based on semantic similarity among visual tokens. It then aggregates information from removed tokens into kept tokens based on this graph, producing a more compact visual token representation. Additionally, we introduce a group-wise token selection strategy (GTS) to divide visual tokens into kept and removed ones, guided by text tokens from the final layers of each group. This strategy progressively aggregates visual information, enhancing the stability of the visual information extraction process. We conduct comprehensive experiments on LLaVA-1.5, LLaVA-NeXT, and Video-LLaVA across various benchmarks to validate the efficacy of VISA. Our method consistently outperforms previous methods, achieving a superior trade-off between model performance and inference speed. The code is available at https://github.com/mobiushy/VISA.
中文摘要:本研究提出VISA新方法,通过分组选择和基于图的聚合策略压缩视觉令牌,在保持更多视觉信息的同时提升多模态大语言模型的推理效率与性能平衡。
English Summary: This study introduces VISA, a novel method that enhances multimodal large language models by efficiently compressing visual tokens through group-wise selection and graph-based aggregation, achieving superior performance and faster inference speeds.

Authors:Weiqi Yan, Lvhai Chen, Shengchuan Zhang, Yan Zhang, Liujuan Cao
Title: SCOUT: Semi-supervised Camouflaged Object Detection by Utilizing Text and Adaptive Data Selection
Abstract:
The difficulty of pixel-level annotation has significantly hindered the development of the Camouflaged Object Detection (COD) field. To save on annotation costs, previous works leverage the semi-supervised COD framework that relies on a small number of labeled data and a large volume of unlabeled data. We argue that there is still significant room for improvement in the effective utilization of unlabeled data. To this end, we introduce a Semi-supervised Camouflaged Object Detection by Utilizing Text and Adaptive Data Selection (SCOUT). It includes an Adaptive Data Augment and Selection (ADAS) module and a Text Fusion Module (TFM). The ADSA module selects valuable data for annotation through an adversarial augment and sampling strategy. The TFM module further leverages the selected valuable data by combining camouflage-related knowledge and text-visual interaction. To adapt to this work, we build a new dataset, namely RefTextCOD. Extensive experiments show that the proposed method surpasses previous semi-supervised methods in the COD field and achieves state-of-the-art performance. Our code will be released at https://github.com/Heartfirey/SCOUT.
Chinese: 提出的SCOUT框架通过对抗性增强自适应选择有价值的未标记数据,并结合文本-视觉交互,在半监督伪装目标检测中实现了最先进的性能,并在新构建的数据集上得到验证。
English: The proposed SCOUT framework enhances semi-supervised camouflaged object detection by adaptively selecting valuable unlabeled data through adversarial augmentation and integrating text-visual interactions, achieving state-of-the-art performance on a newly built dataset.

Authors:Meiqi Gong, Hao Zhang, Xunpeng Yi, Linfeng Tang, Jiayi Ma
Title: TemCoCo: Temporally Consistent Multi-modal Video Fusion with Visual-Semantic Collaboration
Abstract:
Existing multi-modal fusion methods typically apply static frame-based image fusion techniques directly to video fusion tasks, neglecting inherent temporal dependencies and leading to inconsistent results across frames. To address this limitation, we propose the first video fusion framework that explicitly incorporates temporal modeling with visual-semantic collaboration to simultaneously ensure visual fidelity, semantic accuracy, and temporal consistency. First, we introduce a visual-semantic interaction module consisting of a semantic branch and a visual branch, with Dinov2 and VGG19 employed for targeted distillation, allowing simultaneous enhancement of both the visual and semantic representations. Second, we pioneer integrate the video degradation enhancement task into the video fusion pipeline by constructing a temporal cooperative module, which leverages temporal dependencies to facilitate weak information recovery. Third, to ensure temporal consistency, we embed a temporal-enhanced mechanism into the network and devise a temporal loss to guide the optimization process. Finally, we introduce two innovative evaluation metrics tailored for video fusion, aimed at assessing the temporal consistency of the generated fused videos. Extensive experimental results on public video datasets demonstrate the superiority of our method. Our code is released at https://github.com/Meiqi-Gong/TemCoCo.
中文摘要:本文提出首个融合时序建模与视觉语义协作的视频融合框架,通过视觉语义交互模块、时序协同模块和时序增强机制,确保视频帧间的视觉保真度、语义准确性和时序一致性,并在公开数据集上验证了其优越性。
English Summary: This paper introduces the first video fusion framework that integrates temporal modeling with visual-semantic collaboration to achieve consistent, high-quality results across video frames, validated by new evaluation metrics and superior performance on public datasets.

Authors:Xingyu Ai, Shaoyu Wang, Zhiyuan Jia, Ao Xu, Hongming Shan, Jianhua Ma, Qiegen Liu
Title: UniSino: Physics-Driven Foundational Model for Universal CT Sinogram Standardization
Abstract:
During raw-data acquisition in CT imaging, diverse factors can degrade the collected sinograms, with undersampling and noise leading to severe artifacts and noise in reconstructed images and compromising diagnostic accuracy. Conventional correction methods rely on manually designed algorithms or fixed empirical parameters, but these approaches often lack generalizability across heterogeneous artifact types. To address these limitations, we propose UniSino, a foundation model for universal CT sinogram standardization. Unlike existing foundational models that operate in image domain, UniSino directly standardizes data in the projection domain, which enables stronger generalization across diverse undersampling scenarios. Its training framework incorporates the physical characteristics of sinograms, enhancing generalization and enabling robust performance across multiple subtasks spanning four benchmark datasets. Experimental results demonstrate thatUniSino achieves superior reconstruction quality both single and mixed undersampling case, demonstrating exceptional robustness and generalization in sinogram enhancement for CT imaging. The code is available at: https://github.com/yqx7150/UniSino.
中文: UniSino是一种通用的CT正弦图基础模型,直接在投影域中标准化数据,能在多种欠采样场景下提升泛化能力和重建质量。
English: UniSino is a universal CT sinogram foundation model that directly standardizes projection data, enhancing generalization and reconstruction quality across diverse undersampling scenarios.

Authors:Hanzhi Chang, Ruijie Zhu, Wenjie Chang, Mulin Yu, Yanzhe Liang, Jiahao Lu, Zhuoyuan Li, Tianzhu Zhang
Title: MeshSplat: Generalizable Sparse-View Surface Reconstruction via Gaussian Splatting
Abstract:
Surface reconstruction has been widely studied in computer vision and graphics. However, existing surface reconstruction works struggle to recover accurate scene geometry when the input views are extremely sparse. To address this issue, we propose MeshSplat, a generalizable sparse-view surface reconstruction framework via Gaussian Splatting. Our key idea is to leverage 2DGS as a bridge, which connects novel view synthesis to learned geometric priors and then transfers these priors to achieve surface reconstruction. Specifically, we incorporate a feed-forward network to predict per-view pixel-aligned 2DGS, which enables the network to synthesize novel view images and thus eliminates the need for direct 3D ground-truth supervision. To improve the accuracy of 2DGS position and orientation prediction, we propose a Weighted Chamfer Distance Loss to regularize the depth maps, especially in overlapping areas of input views, and also a normal prediction network to align the orientation of 2DGS with normal vectors predicted by a monocular normal estimator. Extensive experiments validate the effectiveness of our proposed improvement, demonstrating that our method achieves state-of-the-art performance in generalizable sparse-view mesh reconstruction tasks. Project Page: https://hanzhichang.github.io/meshsplat_web

Authors:Toufiq Musah, Chinasa Kalaiwo, Maimoona Akram, Ubaida Napari Abdulai, Maruf Adewole, Farouk Dako, Adaobi Chiazor Emegoakor, Udunna C. Anazodo, Prince Ebenezer Adjei, Confidence Raymond
Title: Towards Trustworthy Breast Tumor Segmentation in Ultrasound using Monte Carlo Dropout and Deep Ensembles for Epistemic Uncertainty Estimation
Abstract:
Automated segmentation of BUS images is important for precise lesion delineation and tumor characterization, but is challenged by inherent artifacts and dataset inconsistencies. In this work, we evaluate the use of a modified Residual Encoder U-Net for breast ultrasound segmentation, with a focus on uncertainty quantification. We identify and correct for data duplication in the BUSI dataset, and use a deduplicated subset for more reliable estimates of generalization performance. Epistemic uncertainty is quantified using Monte Carlo dropout, deep ensembles, and their combination. Models are benchmarked on both in-distribution and out-of-distribution datasets to demonstrate how they generalize to unseen cross-domain data. Our approach achieves state-of-the-art segmentation accuracy on the Breast-Lesion-USG dataset with in-distribution validation, and provides calibrated uncertainty estimates that effectively signal regions of low model confidence. Performance declines and increased uncertainty observed in out-of-distribution evaluation highlight the persistent challenge of domain shift in medical imaging, and the importance of integrated uncertainty modeling for trustworthy clinical deployment. \footnote{Code available at: https://github.com/toufiqmusah/nn-uncertainty.git}
中文: 本研究采用改进的残差编码器U-Net进行乳腺超声图像分割,通过去重数据和跨域测试实现了最优分割精度与可靠的不确定性量化,同时揭示了医学影像中域适应问题的持续挑战。
English: This study introduces a modified Residual Encoder U-Net for breast ultrasound segmentation, achieving state-of-the-art accuracy with calibrated uncertainty estimates while highlighting domain shift challenges through rigorous evaluation on deduplicated and cross-domain datasets.

Authors:Seo-Bin Hwang, Yeong-Jun Cho
Title: DroneKey: Drone 3D Pose Estimation in Image Sequences using Gated Key-representation and Pose-adaptive Learning
Abstract:
Estimating the 3D pose of a drone is important for anti-drone systems, but existing methods struggle with the unique challenges of drone keypoint detection. Drone propellers serve as keypoints but are difficult to detect due to their high visual similarity and diversity of poses. To address these challenges, we propose DroneKey, a framework that combines a 2D keypoint detector and a 3D pose estimator specifically designed for drones. In the keypoint detection stage, we extract two key-representations (intermediate and compact) from each transformer encoder layer and optimally combine them using a gated sum. We also introduce a pose-adaptive Mahalanobis distance in the loss function to ensure stable keypoint predictions across extreme poses. We built new datasets of drone 2D keypoints and 3D pose to train and evaluate our method, which have been publicly released. Experiments show that our method achieves an AP of 99.68% (OKS) in keypoint detection, outperforming existing methods. Ablation studies confirm that the pose-adaptive Mahalanobis loss function improves keypoint prediction stability and accuracy. Additionally, improvements in the encoder design enable real-time processing at 44 FPS. For 3D pose estimation, our method achieved an MAE-angle of 10.62°, an RMSE of 0.221m, and an MAE-absolute of 0.076m, demonstrating high accuracy and reliability. The code and dataset are available at https://github.com/kkanuseobin/DroneKey.
Chinese: 提出的DroneKey框架通过结合基于Transformer的2D关键点检测器和3D姿态估计器,解决了无人机姿态估计的挑战,在关键点检测(99.68% AP)和3D姿态估计方面均达到最先进精度,同时实现44 FPS的实时处理能力。
English: The proposed DroneKey framework addresses drone pose estimation challenges by combining a transformer-based 2D keypoint detector with a 3D pose estimator, achieving state-of-the-art accuracy in both keypoint detection (99.68% AP) and 3D pose estimation while enabling real-time processing at 44 FPS.

Authors:Yaolei Qi, Yikai Yang, Wenbo Peng, Shumei Miao, Yutao Hu, Guanyu Yang
Title: Rethinking the Detail-Preserved Completion of Complex Tubular Structures based on Point Cloud: a Dataset and a Benchmark
Abstract:
Complex tubular structures are essential in medical imaging and computer-assisted diagnosis, where their integrity enhances anatomical visualization and lesion detection. However, existing segmentation algorithms struggle with structural discontinuities, particularly in severe clinical cases such as coronary artery stenosis and vessel occlusions, which leads to undesired discontinuity and compromising downstream diagnostic accuracy. Therefore, it is imperative to reconnect discontinuous structures to ensure their completeness. In this study, we explore the tubular structure completion based on point cloud for the first time and establish a Point Cloud-based Coronary Artery Completion (PC-CAC) dataset, which is derived from real clinical data. This dataset provides a novel benchmark for tubular structure completion. Additionally, we propose TSRNet, a Tubular Structure Reconnection Network that integrates a detail-preservated feature extractor, a multiple dense refinement strategy, and a global-to-local loss function to ensure accurate reconnection while maintaining structural integrity. Comprehensive experiments on our PC-CAC and two additional public datasets (PC-ImageCAS and PC-PTR) demonstrate that our method consistently outperforms state-of-the-art approaches across multiple evaluation metrics, setting a new benchmark for point cloud-based tubular structure reconstruction. Our benchmark is available at https://github.com/YaoleiQi/PCCAC.
中文摘要:本研究首次基于点云探索管状结构重建,提出了TSRNet网络和PC-CAC数据集,通过保留细节的特征提取和多层次优化策略,显著提升了血管等管状结构的连续性与完整性。
English Summary: This study introduces TSRNet, a novel network for reconnecting discontinuous tubular structures in medical imaging using point clouds, and establishes the PC-CAC benchmark dataset to enhance segmentation accuracy and diagnostic reliability.

Authors:Krishna Vinod, Prithvi Jai Ramesh, Pavan Kumar B N, Bharatesh Chakravarthi
Title: SEBVS: Synthetic Event-based Visual Servoing for Robot Navigation and Manipulation
Abstract:
Event cameras offer microsecond latency, high dynamic range, and low power consumption, making them ideal for real-time robotic perception under challenging conditions such as motion blur, occlusion, and illumination changes. However, despite their advantages, synthetic event-based vision remains largely unexplored in mainstream robotics simulators. This lack of simulation setup hinders the evaluation of event-driven approaches for robotic manipulation and navigation tasks. This work presents an open-source, user-friendly v2e robotics operating system (ROS) package for Gazebo simulation that enables seamless event stream generation from RGB camera feeds. The package is used to investigate event-based robotic policies (ERP) for real-time navigation and manipulation. Two representative scenarios are evaluated: (1) object following with a mobile robot and (2) object detection and grasping with a robotic manipulator. Transformer-based ERPs are trained by behavior cloning and compared to RGB-based counterparts under various operating conditions. Experimental results show that event-guided policies consistently deliver competitive advantages. The results highlight the potential of event-driven perception to improve real-time robotic navigation and manipulation, providing a foundation for broader integration of event cameras into robotic policy learning. The GitHub repo for the dataset and code: https://eventbasedvision.github.io/SEBVS/

Authors:Jonathan P. Crall, Charles V. Stewart, Tanya Y. Berger-Wolf, Daniel I. Rubenstein, Siva R. Sundaresan
Title: HotSpotter - Patterned Species Instance Recognition
Abstract:
We present HotSpotter, a fast, accurate algorithm for identifying individual animals against a labeled database. It is not species specific and has been applied to Grevy's and plains zebras, giraffes, leopards, and lionfish. We describe two approaches, both based on extracting and matching keypoints or "hotspots". The first tests each new query image sequentially against each database image, generating a score for each database image in isolation, and ranking the results. The second, building on recent techniques for instance recognition, matches the query image against the database using a fast nearest neighbor search. It uses a competitive scoring mechanism derived from the Local Naive Bayes Nearest Neighbor algorithm recently proposed for category recognition. We demonstrate results on databases of more than 1000 images, producing more accurate matches than published methods and matching each query image in just a few seconds.
HotSpotter是一种快速、跨物种的算法,通过基于热点特征的匹配方法,能在数秒内从大型数据库中精确识别个体动物,其准确性优于现有技术。
HotSpotter is a fast, cross-species algorithm that uses hotspot-based matching to accurately identify individual animals from large databases in seconds, outperforming existing methods.

Authors:Hugo Bohy, Minh Tran, Kevin El Haddad, Thierry Dutoit, Mohammad Soleymani
Title: Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice
Abstract:
Human social behaviors are inherently multimodal necessitating the development of powerful audiovisual models for their perception. In this paper, we present Social-MAE, our pre-trained audiovisual Masked Autoencoder based on an extended version of Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE), which is pre-trained on audiovisual social data. Specifically, we modify CAV-MAE to receive a larger number of frames as input and pre-train it on a large dataset of human social interaction (VoxCeleb2) in a self-supervised manner. We demonstrate the effectiveness of this model by finetuning and evaluating the model on different social and affective downstream tasks, namely, emotion recognition, laughter detection and apparent personality estimation. The model achieves state-of-the-art results on multimodal emotion recognition and laughter recognition and competitive results for apparent personality estimation, demonstrating the effectiveness of in-domain self-supervised pre-training. Code and model weight are available here https://github.com/HuBohy/SocialMAE.
Chinese: 本文提出了Social-MAE模型,这是一种基于自监督学习在社交数据上预训练的先进视听模型,在情感识别和笑声检测任务中达到最优效果,并在性格评估中取得竞争性表现。
English: The paper introduces Social-MAE, an advanced audiovisual model pre-trained on social interaction data using self-supervised learning, which achieves state-of-the-art results in emotion and laughter recognition and competitive performance in personality estimation.

Authors:Zhiwen Chen, Jinjian Wu, Zhiyu Zhu, Yifan Zhang, Guangming Shi, Junhui Hou
Title: Optimizing Multi-Modal Trackers via Sensitivity-aware Regularized Tuning
Abstract:
This paper tackles the critical challenge of optimizing multi-modal trackers by effectively adapting the pre-trained models for RGB data. Existing fine-tuning paradigms oscillate between excessive freedom and over-restriction, both leading to a suboptimal plasticity-stability trade-off. To mitigate this dilemma, we propose a novel sensitivity-aware regularized tuning framework, which delicately refines the learning process by incorporating intrinsic parameter sensitivities. Through a comprehensive investigation from pre-trained to multi-modal contexts, we identify that parameters sensitive to pivotal foundational patterns and cross-domain shifts are primary drivers of this issue. Specifically, we first analyze the tangent space of pre-trained weights to measure and orient prior sensitivities, dedicated to preserving generalization. Then, we further explore transfer sensitivities during the tuning phase, emphasizing adaptability and stability. By incorporating these sensitivities as regularization terms, our method significantly enhances the transferability across modalities. Extensive experiments showcase the superior performance of the proposed method, surpassing current state-of-the-art techniques across various multi-modal tracking. The source code and models will be publicly available at https://github.com/zhiwen-xdu/SRTrack.
Chinese: 本文提出了一种灵敏度感知的正则化调优框架,通过平衡参数的可塑性与稳定性来优化多模态跟踪器,在各种跟踪场景中实现了最先进的性能。
English: This paper introduces a sensitivity-aware regularized tuning framework that optimizes multi-modal trackers by balancing parameter plasticity and stability, achieving state-of-the-art performance across various tracking scenarios.

Authors:Kaiyue Sun, Rongyao Fang, Chengqi Duan, Xian Liu, Xihui Liu
Title: T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image Generation
Abstract:
We propose T2I-ReasonBench, a benchmark evaluating reasoning capabilities of text-to-image (T2I) models. It consists of four dimensions: Idiom Interpretation, Textual Image Design, Entity-Reasoning and Scientific-Reasoning. We propose a two-stage evaluation protocol to assess the reasoning accuracy and image quality. We benchmark various T2I generation models, and provide comprehensive analysis on their performances.
中文: T2I-ReasonBench 是一个评估文本到图像模型推理能力的基准,涵盖成语解释、文本图像设计、实体推理和科学推理四个维度,采用两阶段评估协议检验准确性和图像质量,并对多种模型进行了全面性能分析。
English: T2I-ReasonBench is a benchmark designed to evaluate text-to-image models' reasoning abilities across four dimensions—idiom interpretation, textual image design, entity-reasoning, and scientific-reasoning—using a two-stage protocol to assess accuracy and image quality, with comprehensive performance analysis of various models.

Authors:Zijing Zhao, Zhu Xu, Qingchao Chen, Yuxin Peng, Yang Liu
Title: Investigating Domain Gaps for Indoor 3D Object Detection
Abstract:
As a fundamental task for indoor scene understanding, 3D object detection has been extensively studied, and the accuracy on indoor point cloud data has been substantially improved. However, existing researches have been conducted on limited datasets, where the training and testing sets share the same distribution. In this paper, we consider the task of adapting indoor 3D object detectors from one dataset to another, presenting a comprehensive benchmark with ScanNet, SUN RGB-D and 3D Front datasets, as well as our newly proposed large-scale datasets ProcTHOR-OD and ProcFront generated by a 3D simulator. Since indoor point cloud datasets are collected and constructed in different ways, the object detectors are likely to overfit to specific factors within each dataset, such as point cloud quality, bounding box layout and instance features. We conduct experiments across datasets on different adaptation scenarios including synthetic-to-real adaptation, point cloud quality adaptation, layout adaptation and instance feature adaptation, analyzing the impact of different domain gaps on 3D object detectors. We also introduce several approaches to improve adaptation performances, providing baselines for domain adaptive indoor 3D object detection, hoping that future works may propose detectors with stronger generalization ability across domains. Our project homepage can be found in https://jeremyzhao1998.github.io/DAVoteNet-release/.
Chinese: 本文提出了一个全面的领域自适应室内3D物体检测基准,通过跨数据集实验评估不同领域差异对检测器的影响,并提供了提升跨领域泛化能力的基线方法。
English: This paper introduces a comprehensive benchmark for domain adaptive indoor 3D object detection, addressing the challenge of overfitting to specific dataset characteristics by evaluating across multiple datasets and proposing methods to enhance cross-domain generalization.

Authors:Bin Huang, Zhong Liu, Huiying Wen, Bingsheng Huang, Xin Chen, Shuo Li
Title: E-BayesSAM: Efficient Bayesian Adaptation of SAM with Self-Optimizing KAN-Based Interpretation for Uncertainty-Aware Ultrasonic Segmentation
Abstract:
Although the Segment Anything Model (SAM) has advanced medical image segmentation, its Bayesian adaptation for uncertainty-aware segmentation remains hindered by three key issues: (1) instability in Bayesian fine-tuning of large pre-trained SAMs; (2) high computation cost due to SAM's massive parameters; (3) SAM's black-box design limits interpretability. To overcome these, we propose E-BayesSAM, an efficient framework combining Token-wise Variational Bayesian Inference (T-VBI) for efficienty Bayesian adaptation and Self-Optimizing Kolmogorov-Arnold Network (SO-KAN) for improving interpretability. T-VBI innovatively reinterprets SAM's output tokens as dynamic probabilistic weights and reparameterizes them as latent variables without auxiliary training, enabling training-free VBI for uncertainty estimation. SO-KAN improves token prediction with learnable spline activations via self-supervised learning, providing insight to prune redundant tokens to boost efficiency and accuracy. Experiments on five ultrasound datasets demonstrated that E-BayesSAM achieves: (i) real-time inference (0.03s/image), (ii) superior segmentation accuracy (average DSC: Pruned E-BayesSAM's 89.0\% vs. E-BayesSAM's 88.0% vs. MedSAM's 88.3%), and (iii) identification of four critical tokens governing SAM's decisions. By unifying efficiency, reliability, and interpretability, E-BayesSAM bridges SAM's versatility with clinical needs, advancing deployment in safety-critical medical applications. The source code is available at https://github.com/mp31192/E-BayesSAM.
中文:E-BayesSAM通过结合令牌变分贝叶斯推理进行不确定性估计和自优化柯尔莫哥洛夫-阿诺德网络提升可解释性,克服了SAM贝叶斯适应的局限性,在医学影像中实现了实时推理、更优的准确性及关键决策令牌识别。
English: E-BayesSAM overcomes SAM's Bayesian adaptation limitations by integrating Token-wise Variational Bayesian Inference for uncertainty estimation and a Self-Optimizing Kolmogorov-Arnold Network for enhanced interpretability, achieving real-time inference, superior accuracy, and identification of critical decision-making tokens in medical imaging.

Authors:Haoyu Wang, Hao Tang, Donglin Di, Zhilu Zhang, Wangmeng Zuo, Feng Gao, Siwei Ma, Shiliang Zhang
Title: MoCo: Motion-Consistent Human Video Generation via Structure-Appearance Decoupling
Abstract:
Generating human videos with consistent motion from text prompts remains a significant challenge, particularly for whole-body or long-range motion. Existing video generation models prioritize appearance fidelity, resulting in unrealistic or physically implausible human movements with poor structural coherence. Additionally, most existing human video datasets primarily focus on facial or upper-body motions, or consist of vertically oriented dance videos, limiting the scope of corresponding generation methods to simple movements. To overcome these challenges, we propose MoCo, which decouples the process of human video generation into two components: structure generation and appearance generation. Specifically, our method first employs an efficient 3D structure generator to produce a human motion sequence from a text prompt. The remaining video appearance is then synthesized under the guidance of the generated structural sequence. To improve fine-grained control over sparse human structures, we introduce Human-Aware Dynamic Control modules and integrate dense tracking constraints during training. Furthermore, recognizing the limitations of existing datasets, we construct a large-scale whole-body human video dataset featuring complex and diverse motions. Extensive experiments demonstrate that MoCo outperforms existing approaches in generating realistic and structurally coherent human videos.
中文:现有视频生成模型难以合成复杂人体运动,因此我们提出MoSA框架,通过三维结构转换器和控制模块分别处理结构与外观生成,有效提升运动真实性和交互细节,在多项评估中显著优于现有方法。
English: Current video generation models struggle with realistic human motion synthesis, so we introduce MoSA, a framework that separates structure and appearance generation using 3D transformers and control modules to produce physically plausible videos with complex movements, outperforming existing methods.

Authors:Bokai Zhao, Weiyang Shi, Hanqing Chao, Zijiang Yang, Yiyang Zhang, Ming Song, Tianzi Jiang
Title: Neural Proteomics Fields for Super-resolved Spatial Proteomics Prediction
Abstract:
Spatial proteomics maps protein distributions in tissues, providing transformative insights for life sciences. However, current sequencing-based technologies suffer from low spatial resolution, and substantial inter-tissue variability in protein expression further compromises the performance of existing molecular data prediction methods. In this work, we introduce the novel task of spatial super-resolution for sequencing-based spatial proteomics (seq-SP) and, to the best of our knowledge, propose the first deep learning model for this task--Neural Proteomics Fields (NPF). NPF formulates seq-SP as a protein reconstruction problem in continuous space by training a dedicated network for each tissue. The model comprises a Spatial Modeling Module, which learns tissue-specific protein spatial distributions, and a Morphology Modeling Module, which extracts tissue-specific morphological features. Furthermore, to facilitate rigorous evaluation, we establish an open-source benchmark dataset, Pseudo-Visium SP, for this task. Experimental results demonstrate that NPF achieves state-of-the-art performance with fewer learnable parameters, underscoring its potential for advancing spatial proteomics research. Our code and dataset are publicly available at https://github.com/Bokai-Zhao/NPF.
Chinese: 本文提出了Neural Proteomics Fields (NPF)这一新型深度学习模型,通过分别学习组织特异性蛋白质空间分布和形态特征,解决了测序空间蛋白质组学中分辨率低和表达变异大的问题,以更少参数实现了最优性能。
English: This paper introduces Neural Proteomics Fields (NPF), a novel deep learning model that addresses the low spatial resolution and variability challenges in sequencing-based spatial proteomics by learning tissue-specific protein distributions and morphological features, achieving state-of-the-art performance with fewer parameters.

Authors:Guoqing Zhang, Xingtong Ge, Lu Shi, Xin Zhang, Muqing Xue, Wanru Xu, Yigang Cen
Title: Condition Weaving Meets Expert Modulation: Towards Universal and Controllable Image Generation
Abstract:
The image-to-image generation task aims to produce controllable images by leveraging conditional inputs and prompt instructions. However, existing methods often train separate control branches for each type of condition, leading to redundant model structures and inefficient use of computational resources. To address this, we propose a Unified image-to-image Generation (UniGen) framework that supports diverse conditional inputs while enhancing generation efficiency and expressiveness. Specifically, to tackle the widely existing parameter redundancy and computational inefficiency in controllable conditional generation architectures, we propose the Condition Modulated Expert (CoMoE) module. This module aggregates semantically similar patch features and assigns them to dedicated expert modules for visual representation and conditional modeling. By enabling independent modeling of foreground features under different conditions, CoMoE effectively mitigates feature entanglement and redundant computation in multi-condition scenarios. Furthermore, to bridge the information gap between the backbone and control branches, we propose WeaveNet, a dynamic, snake-like connection mechanism that enables effective interaction between global text-level control from the backbone and fine-grained control from conditional branches. Extensive experiments on the Subjects-200K and MultiGen-20M datasets across various conditional image generation tasks demonstrate that our method consistently achieves state-of-the-art performance, validating its advantages in both versatility and effectiveness. The code has been uploaded to https://github.com/gavin-gqzhang/UniGen.
中文: 提出的UniGen框架通过CoMoE模块和WeaveNet机制统一多种条件输入进行图像生成,有效减少冗余并提升效率,在多项任务中实现了最优性能。
English: The proposed UniGen framework introduces the CoMoE module and WeaveNet mechanism to unify diverse conditional inputs for image generation, effectively reducing redundancy and improving efficiency while achieving state-of-the-art performance across multiple tasks.

Authors:Hengyuan Zhang, Zhe Li, Xingqun Qi, Mengze Li, Muyi Sun, Man Zhang, Sirui Han
Title: DanceEditor: Towards Iterative Editable Music-driven Dance Generation with Open-Vocabulary Descriptions
Abstract:
Generating coherent and diverse human dances from music signals has gained tremendous progress in animating virtual avatars. While existing methods support direct dance synthesis, they fail to recognize that enabling users to edit dance movements is far more practical in real-world choreography scenarios. Moreover, the lack of high-quality dance datasets incorporating iterative editing also limits addressing this challenge. To achieve this goal, we first construct DanceRemix, a large-scale multi-turn editable dance dataset comprising the prompt featuring over 25.3M dance frames and 84.5K pairs. In addition, we propose a novel framework for iterative and editable dance generation coherently aligned with given music signals, namely DanceEditor. Considering the dance motion should be both musical rhythmic and enable iterative editing by user descriptions, our framework is built upon a prediction-then-editing paradigm unifying multi-modal conditions. At the initial prediction stage, our framework improves the authority of generated results by directly modeling dance movements from tailored, aligned music. Moreover, at the subsequent iterative editing stages, we incorporate text descriptions as conditioning information to draw the editable results through a specifically designed Cross-modality Editing Module (CEM). Specifically, CEM adaptively integrates the initial prediction with music and text prompts as temporal motion cues to guide the synthesized sequences. Thereby, the results display music harmonics while preserving fine-grained semantic alignment with text descriptions. Extensive experiments demonstrate that our method outperforms the state-of-the-art models on our newly collected DanceRemix dataset. Code is available at https://lzvsdy.github.io/DanceEditor/.

Authors:Tristan S. W. Stevens, Oisín Nolan, Ruud J. G. van Sloun
Title: Semantic Diffusion Posterior Sampling for Cardiac Ultrasound Dehazing
Abstract:
Echocardiography plays a central role in cardiac imaging, offering dynamic views of the heart that are essential for diagnosis and monitoring. However, image quality can be significantly degraded by haze arising from multipath reverberations, particularly in difficult-to-image patients. In this work, we propose a semantic-guided, diffusion-based dehazing algorithm developed for the MICCAI Dehazing Echocardiography Challenge (DehazingEcho2025). Our method integrates a pixel-wise noise model, derived from semantic segmentation of hazy inputs into a diffusion posterior sampling framework guided by a generative prior trained on clean ultrasound data. Quantitative evaluation on the challenge dataset demonstrates strong performance across contrast and fidelity metrics. Code for the submitted algorithm is available at https://github.com/tristan-deep/semantic-diffusion-echo-dehazing.
中文: 本文提出了一种基于语义引导扩散的去雾算法,通过将逐像素噪声模型与生成先验相结合,有效消除超声心动图中的雾状伪影,定量评估显示其性能优异。
English: This paper introduces a semantic-guided diffusion-based algorithm that effectively removes haze from echocardiographic images by integrating a pixel-wise noise model with generative priors, demonstrating strong performance in quantitative evaluations.

Authors:Songliang Cao, Tianqi Hu, Hao Lu
Title: First Place Solution to the MLCAS 2025 GWFSS Challenge: The Devil is in the Detail and Minority
Abstract:
In this report, we present our solution during the participation of the MLCAS 2025 GWFSS Challenge. This challenge hosts a semantic segmentation competition specific to wheat plants, which requires to segment three wheat organs including the head, leaf, and stem, and another background class. In 2025, participating a segmentation competition is significantly different from that in previous years where many tricks can play important roles. Nowadays most segmentation tricks have been well integrated into existing codebases such that our naive ViT-Adapter baseline has already achieved sufficiently good performance. Hence, we believe the key to stand out among other competitors is to focus on the problem nature of wheat per se. By probing visualizations, we identify the key -- the stem matters. In contrast to heads and leaves, stems exhibit fine structure and occupy only few pixels, which suffers from fragile predictions and class imbalance. Building on our baseline, we present three technical improvements tailored to stems: i) incorporating a dynamic upsampler SAPA used to enhance detail delineation; ii) leveraging semi-supervised guided distillation with stem-aware sample selection to mine the treasure beneath unlabeled data; and iii) applying a test-time scaling strategy to zoom in and segment twice the image. Despite being simple, the three improvements bring us to the first place of the competition, outperforming the second place by clear margins. Code and models will be released at https://github.com/tiny-smart/gwfss25.
Chinese: 针对MLCAS 2025小麦分割挑战赛,我们通过动态上采样、半监督蒸馏和测试时缩放三项关键技术重点优化了茎秆的精细结构与类别不平衡问题,最终以显著优势获得竞赛第一名。
English: Our solution for the MLCAS 2025 wheat segmentation challenge focuses on addressing the fine structure and class imbalance of stems through three key improvements—dynamic upsampling, semi-supervised distillation, and test-time scaling—securing first place with significant performance gains.

Authors:Zhihao Chen, Qi Gao, Zilong Li, Junping Zhang, Yi Zhang, Jun Zhao, Hongming Shan
Title: FoundDiff: Foundational Diffusion Model for Generalizable Low-Dose CT Denoising
Abstract:
Low-dose computed tomography (CT) denoising is crucial for reduced radiation exposure while ensuring diagnostically acceptable image quality. Despite significant advancements driven by deep learning (DL) in recent years, existing DL-based methods, typically trained on a specific dose level and anatomical region, struggle to handle diverse noise characteristics and anatomical heterogeneity during varied scanning conditions, limiting their generalizability and robustness in clinical scenarios. In this paper, we propose FoundDiff, a foundational diffusion model for unified and generalizable LDCT denoising across various dose levels and anatomical regions. FoundDiff employs a two-stage strategy: (i) dose-anatomy perception and (ii) adaptive denoising. First, we develop a dose- and anatomy-aware contrastive language image pre-training model (DA-CLIP) to achieve robust dose and anatomy perception by leveraging specialized contrastive learning strategies to learn continuous representations that quantify ordinal dose variations and identify salient anatomical regions. Second, we design a dose- and anatomy-aware diffusion model (DA-Diff) to perform adaptive and generalizable denoising by synergistically integrating the learned dose and anatomy embeddings from DACLIP into diffusion process via a novel dose and anatomy conditional block (DACB) based on Mamba. Extensive experiments on two public LDCT datasets encompassing eight dose levels and three anatomical regions demonstrate superior denoising performance of FoundDiff over existing state-of-the-art methods and the remarkable generalization to unseen dose levels. The codes and models are available at https://github.com/hao1635/FoundDiff.
中文: FoundDiff是一种基础扩散模型,通过剂量-解剖感知和自适应去噪的两阶段策略,实现了跨不同剂量水平和解剖区域的可推广低剂量CT去噪,性能优于现有方法。
English: FoundDiff is a foundational diffusion model that uses a two-stage approach with dose-anatomy perception and adaptive denoising to achieve generalizable low-dose CT denoising across various dose levels and anatomical regions, outperforming existing methods.

Authors:Fucai Ke, Joy Hsu, Zhixi Cai, Zixian Ma, Xin Zheng, Xindi Wu, Sukai Huang, Weiqing Wang, Pari Delir Haghighi, Gholamreza Haffari, Ranjay Krishna, Jiajun Wu, Hamid Rezatofighi
Title: Explain Before You Answer: A Survey on Compositional Visual Reasoning
Abstract:
Compositional visual reasoning has emerged as a key research frontier in multimodal AI, aiming to endow machines with the human-like ability to decompose visual scenes, ground intermediate concepts, and perform multi-step logical inference. While early surveys focus on monolithic vision-language models or general multimodal reasoning, a dedicated synthesis of the rapidly expanding compositional visual reasoning literature is still missing. We fill this gap with a comprehensive survey spanning 2023 to 2025 that systematically reviews 260+ papers from top venues (CVPR, ICCV, NeurIPS, ICML, ACL, etc.). We first formalize core definitions and describe why compositional approaches offer advantages in cognitive alignment, semantic fidelity, robustness, interpretability, and data efficiency. Next, we trace a five-stage paradigm shift: from prompt-enhanced language-centric pipelines, through tool-enhanced LLMs and tool-enhanced VLMs, to recently minted chain-of-thought reasoning and unified agentic VLMs, highlighting their architectural designs, strengths, and limitations. We then catalog 60+ benchmarks and corresponding metrics that probe compositional visual reasoning along dimensions such as grounding accuracy, chain-of-thought faithfulness, and high-resolution perception. Drawing on these analyses, we distill key insights, identify open challenges (e.g., limitations of LLM-based reasoning, hallucination, a bias toward deductive reasoning, scalable supervision, tool integration, and benchmark limitations), and outline future directions, including world-model integration, human-AI collaborative reasoning, and richer evaluation protocols. By offering a unified taxonomy, historical roadmap, and critical outlook, this survey aims to serve as a foundational reference and inspire the next generation of compositional visual reasoning research.
中文: 本综述系统整合了2023-2025年组合视觉推理研究,通过分析范式演变、评测基准与核心挑战,为推进多模态AI发展提出了世界模型集成等未来方向。
English: This survey comprehensively synthesizes compositional visual reasoning research from 2023-2025, analyzing paradigm shifts, benchmarks, and challenges while proposing future directions like world-model integration to advance multimodal AI.

Authors:Breenda Das, Lennart Purucker, Timur Carstensen, Frank Hutter
Title: Quickly Tuning Foundation Models for Image Segmentation
Abstract:
Foundation models like SAM (Segment Anything Model) exhibit strong zero-shot image segmentation performance, but often fall short on domain-specific tasks. Fine-tuning these models typically requires significant manual effort and domain expertise. In this work, we introduce QTT-SEG, a meta-learning-driven approach for automating and accelerating the fine-tuning of SAM for image segmentation. Built on the Quick-Tune hyperparameter optimization framework, QTT-SEG predicts high-performing configurations using meta-learned cost and performance models, efficiently navigating a search space of over 200 million possibilities. We evaluate QTT-SEG on eight binary and five multiclass segmentation datasets under tight time constraints. Our results show that QTT-SEG consistently improves upon SAM's zero-shot performance and surpasses AutoGluon Multimodal, a strong AutoML baseline, on most binary tasks within three minutes. On multiclass datasets, QTT-SEG delivers consistent gains as well. These findings highlight the promise of meta-learning in automating model adaptation for specialized segmentation tasks. Code available at: https://github.com/ds-brx/QTT-SEG/
中文: QTT-SEG是一种基于元学习的方法,可自动优化图像分割模型SAM的微调过程,在严格时间限制下显著提升了其在专业任务上的零样本性能表现。
English: QTT-SEG is a meta-learning approach that automates fine-tuning of the Segment Anything Model for image segmentation, significantly improving zero-shot performance on domain-specific tasks within tight time constraints.

Authors:Xiaoyang Hao, Han Li
Title: PersPose: 3D Human Pose Estimation with Perspective Encoding and Perspective Rotation
Abstract:
Monocular 3D human pose estimation (HPE) methods estimate the 3D positions of joints from individual images. Existing 3D HPE approaches often use the cropped image alone as input for their models. However, the relative depths of joints cannot be accurately estimated from cropped images without the corresponding camera intrinsics, which determine the perspective relationship between 3D objects and the cropped images. In this work, we introduce Perspective Encoding (PE) to encode the camera intrinsics of the cropped images. Moreover, since the human subject can appear anywhere within the original image, the perspective relationship between the 3D scene and the cropped image differs significantly, which complicates model fitting. Additionally, the further the human subject deviates from the image center, the greater the perspective distortions in the cropped image. To address these issues, we propose Perspective Rotation (PR), a transformation applied to the original image that centers the human subject, thereby reducing perspective distortions and alleviating the difficulty of model fitting. By incorporating PE and PR, we propose a novel 3D HPE framework, PersPose. Experimental results demonstrate that PersPose achieves state-of-the-art (SOTA) performance on the 3DPW, MPI-INF-3DHP, and Human3.6M datasets. For example, on the in-the-wild dataset 3DPW, PersPose achieves an MPJPE of 60.1 mm, 7.54% lower than the previous SOTA approach. Code is available at: https://github.com/KenAdamsJoseph/PersPose.
中文: 本文提出PersPose框架,通过引入透视编码来整合相机参数和透视旋转来居中人体,有效减少透视畸变并提升模型拟合能力,在多个数据集上实现了最优的三维人体姿态估计性能。
English: This paper introduces PersPose, a novel monocular 3D human pose estimation framework that incorporates Perspective Encoding to encode camera intrinsics and Perspective Rotation to center human subjects, achieving state-of-the-art performance by reducing perspective distortions and improving model fitting.

Authors:Qibin Zhang, Xinyu Hao, Qiao Chen, Rui Xu, Fengyu Cong, Cheng Lu, Hongming Xu
Title: Multi-modal Knowledge Decomposition based Online Distillation for Biomarker Prediction in Breast Cancer Histopathology
Abstract:
Immunohistochemical (IHC) biomarker prediction benefits from multi-modal data fusion analysis. However, the simultaneous acquisition of multi-modal data, such as genomic and pathological information, is often challenging due to cost or technical limitations. To address this challenge, we propose an online distillation approach based on Multi-modal Knowledge Decomposition (MKD) to enhance IHC biomarker prediction in haematoxylin and eosin (H\&E) stained histopathology images. This method leverages paired genomic-pathology data during training while enabling inference using either pathology slides alone or both modalities. Two teacher and one student models are developed to extract modality-specific and modality-general features by minimizing the MKD loss. To maintain the internal structural relationships between samples, Similarity-preserving Knowledge Distillation (SKD) is applied. Additionally, Collaborative Learning for Online Distillation (CLOD) facilitates mutual learning between teacher and student models, encouraging diverse and complementary learning dynamics. Experiments on the TCGA-BRCA and in-house QHSU datasets demonstrate that our approach achieves superior performance in IHC biomarker prediction using uni-modal data. Our code is available at https://github.com/qiyuanzz/MICCAI2025_MKD.
中文: 本研究提出了一种基于多模态知识分解的在线蒸馏方法,通过联合训练基因组和病理学数据,实现在仅使用病理学图像时也能有效预测IHC生物标志物,并在TCGA-BRCA和QHSU数据集上验证了其优越性能。
English: The study introduces an online distillation method using Multi-modal Knowledge Decomposition to improve IHC biomarker prediction from H&E images by training with paired genomic-pathology data but allowing inference with pathology data alone, validated on TCGA-BRCA and QHSU datasets.

Authors:Hyeyeon Kim, Sungwoo Han, Jingun Kwon, Hidetaka Kamigaito, Manabu Okumura
Title: MMCIG: Multimodal Cover Image Generation for Text-only Documents and Its Dataset Construction via Pseudo-labeling
Abstract:
In this study, we introduce a novel cover image generation task that produces both a concise summary and a visually corresponding image from a given text-only document. Because no existing datasets are available for this task, we propose a multimodal pseudo-labeling method to construct high-quality datasets at low cost. We first collect documents that contain multiple images with their captions, and their summaries by excluding factually inconsistent instances. Our approach selects one image from the multiple images accompanying the documents. Using the gold summary, we independently rank both the images and their captions. Then, we annotate a pseudo-label for an image when both the image and its corresponding caption are ranked first in their respective rankings. Finally, we remove documents that contain direct image references within texts. Experimental results demonstrate that the proposed multimodal pseudo-labeling method constructs more precise datasets and generates higher quality images than text- and image-only pseudo-labeling methods, which consider captions and images separately. We release our code at: https://github.com/HyeyeeonKim/MMCIG
中文摘要:本研究提出了一种从纯文本文档生成封面图像及对应摘要的新任务,通过多模态伪标注方法低成本构建高质量数据集,该方法联合评估图像与标题,实验证明其比单模态方法能构建更精确数据集并生成更优质图像。
English Summary: This study introduces a novel task for generating cover images with corresponding summaries from text documents, proposing a multimodal pseudo-labeling method to create high-quality datasets efficiently by jointly evaluating images and captions, which outperforms unimodal approaches.

Authors:Zhenghui Zhao, Chen Wu, Di Wang, Hongruixuan Chen, Cuiqun Chen, Zhuo Zheng, Bo Du, Liangpei Zhang
Title: Advancing Weakly-Supervised Change Detection in Satellite Images via Adversarial Class Prompting
Abstract:
Weakly-Supervised Change Detection (WSCD) aims to distinguish specific object changes (e.g., objects appearing or disappearing) from background variations (e.g., environmental changes due to light, weather, or seasonal shifts) in paired satellite images, relying only on paired image (i.e., image-level) classification labels. This technique significantly reduces the need for dense annotations required in fully-supervised change detection. However, as image-level supervision only indicates whether objects have changed in a scene, WSCD methods often misclassify background variations as object changes, especially in complex remote-sensing scenarios. In this work, we propose an Adversarial Class Prompting (AdvCP) method to address this co-occurring noise problem, including two phases: a) Adversarial Prompt Mining: After each training iteration, we introduce adversarial prompting perturbations, using incorrect one-hot image-level labels to activate erroneous feature mappings. This process reveals co-occurring adversarial samples under weak supervision, namely background variation features that are likely to be misclassified as object changes. b) Adversarial Sample Rectification: We integrate these adversarially prompt-activated pixel samples into training by constructing an online global prototype. This prototype is built from an exponentially weighted moving average of the current batch and all historical training data. Our AdvCP can be seamlessly integrated into current WSCD methods without adding additional inference cost. Experiments on ConvNet, Transformer, and Segment Anything Model (SAM)-based baselines demonstrate significant performance enhancements. Furthermore, we demonstrate the generalizability of AdvCP to other multi-class weakly-supervised dense prediction scenarios. Code is available at https://github.com/zhenghuizhao/AdvCP
中文: 提出的对抗类别提示方法通过标签扰动挖掘对抗样本并利用全局原型进行校正,有效解决了弱监督变化检测中的共现噪声问题,在不增加推理成本的情况下显著提升了多种模型的性能。
English: The proposed Adversarial Class Prompting (AdvCP) method tackles the co-occurring noise problem in Weakly-Supervised Change Detection by mining adversarial samples through label perturbations and rectifying them via a global prototype, significantly improving performance across various models without extra inference costs.

Authors:Yajat Yadav, Varun Bharadwaj, Jathin Korrapati, Tanish Baranwal
Title: VROOM - Visual Reconstruction over Onboard Multiview
Abstract:
We introduce VROOM, a system for reconstructing 3D models of Formula 1 circuits using only onboard camera footage from racecars. Leveraging video data from the 2023 Monaco Grand Prix, we address video challenges such as high-speed motion and sharp cuts in camera frames. Our pipeline analyzes different methods such as DROID-SLAM, AnyCam, and Monst3r and combines preprocessing techniques such as different methods of masking, temporal chunking, and resolution scaling to account for dynamic motion and computational constraints. We show that Vroom is able to partially recover track and vehicle trajectories in complex environments. These findings indicate the feasibility of using onboard video for scalable 4D reconstruction in real-world settings. The project page can be found at https://varun-bharadwaj.github.io/vroom, and our code is available at https://github.com/yajatyadav/vroom.
中文:VROOM系统利用车载摄像头视频重建F1赛道三维模型,通过处理高速运动和动态环境等挑战,验证了在真实场景中实现可扩展4D重建的可行性。
English: VROOM reconstructs 3D models of Formula 1 circuits using onboard camera footage, overcoming challenges like high-speed motion and demonstrating the feasibility of scalable 4D reconstruction in real-world environments.

Authors:Stefanos Pasios, Nikos Nikolaidis
Title: REGEN: Real-Time Photorealism Enhancement in Games via a Dual-Stage Generative Network Framework
Abstract:
Photorealism is an important aspect of modern video games since it can shape the player experience and simultaneously impact the immersion, narrative engagement, and visual fidelity. Although recent hardware technological breakthroughs, along with state-of-the-art rendering technologies, have significantly improved the visual realism of video games, achieving true photorealism in dynamic environments at real-time frame rates still remains a major challenge due to the tradeoff between visual quality and performance. In this short paper, we present a novel approach for enhancing the photorealism of rendered game frames using generative adversarial networks. To this end, we propose Real-time photorealism Enhancement in Games via a dual-stage gEnerative Network framework (REGEN), which employs a robust unpaired image-to-image translation model to produce semantically consistent photorealistic frames that transform the problem into a simpler paired image-to-image translation task. This enables training with a lightweight method that can achieve real-time inference time without compromising visual quality. We demonstrate the effectiveness of our framework on Grand Theft Auto V, showing that the approach achieves visual results comparable to the ones produced by the robust unpaired Im2Im method while improving inference speed by 32.14 times. Our findings also indicate that the results outperform the photorealism-enhanced frames produced by directly training a lightweight unpaired Im2Im translation method to translate the video game frames towards the visual characteristics of real-world images. Code, pre-trained models, and demos for this work are available at: https://github.com/stefanos50/REGEN.
中文摘要:本文提出REGEN双阶段生成网络框架,通过将非配对图像转换转化为更简单的配对任务,在保持与稳健方法相当视觉质量的同时,以32倍提速实现游戏画面的实时照片级真实感增强。
English Summary: The paper introduces REGEN, a dual-stage generative network that enhances video game photorealism in real-time by converting unpaired image translation into a simpler paired task, achieving a 32x speed improvement while maintaining visual quality comparable to robust methods.

Authors:Qingwen Zhang, Xiaomeng Zhu, Yushan Zhang, Yixi Cai, Olov Andersson, Patric Jensfelt
Title: DeltaFlow: An Efficient Multi-frame Scene Flow Estimation Method
Abstract:
Previous dominant methods for scene flow estimation focus mainly on input from two consecutive frames, neglecting valuable information in the temporal domain. While recent trends shift towards multi-frame reasoning, they suffer from rapidly escalating computational costs as the number of frames grows. To leverage temporal information more efficiently, we propose DeltaFlow ($Δ$Flow), a lightweight 3D framework that captures motion cues via a $Δ$ scheme, extracting temporal features with minimal computational cost, regardless of the number of frames. Additionally, scene flow estimation faces challenges such as imbalanced object class distributions and motion inconsistency. To tackle these issues, we introduce a Category-Balanced Loss to enhance learning across underrepresented classes and an Instance Consistency Loss to enforce coherent object motion, improving flow accuracy. Extensive evaluations on the Argoverse 2 and Waymo datasets show that $Δ$Flow achieves state-of-the-art performance with up to 22% lower error and $2\times$ faster inference compared to the next-best multi-frame supervised method, while also demonstrating a strong cross-domain generalization ability. The code is open-sourced at https://github.com/Kin-Zhang/DeltaFlow along with trained model weights.
中文: DeltaFlow提出了一种轻量级三维框架,通过增量方案高效提取时序特征,并采用双重损失函数解决类别不平衡和运动不一致问题,在Argoverse 2和Waymo数据集上实现最优性能——误差降低22%且推理速度提升两倍。
English: DeltaFlow introduces a lightweight 3D framework with a delta scheme for efficient temporal feature extraction and dual loss functions to address class imbalance and motion inconsistency, achieving state-of-the-art performance with 22% lower error and twice the speed of leading multi-frame methods.

Authors:Xianjing Cheng, Lintai Wu, Zuowen Wang, Junhui Hou, Jie Wen, Yong Xu
Title: PVNet: Point-Voxel Interaction LiDAR Scene Upsampling Via Diffusion Models
Abstract:
Accurate 3D scene understanding in outdoor environments heavily relies on high-quality point clouds. However, LiDAR-scanned data often suffer from extreme sparsity, severely hindering downstream 3D perception tasks. Existing point cloud upsampling methods primarily focus on individual objects, thus demonstrating limited generalization capability for complex outdoor scenes. To address this issue, we propose PVNet, a diffusion model-based point-voxel interaction framework to perform LiDAR point cloud upsampling without dense supervision. Specifically, we adopt the classifier-free guidance-based DDPMs to guide the generation, in which we employ a sparse point cloud as the guiding condition and the synthesized point clouds derived from its nearby frames as the input. Moreover, we design a voxel completion module to refine and complete the coarse voxel features for enriching the feature representation. In addition, we propose a point-voxel interaction module to integrate features from both points and voxels, which efficiently improves the environmental perception capability of each upsampled point. To the best of our knowledge, our approach is the first scene-level point cloud upsampling method supporting arbitrary upsampling rates. Extensive experiments on various benchmarks demonstrate that our method achieves state-of-the-art performance. The source code will be available at https://github.com/chengxianjing/PVNet.
中文: PVNet提出了一种基于扩散模型的点体素交互框架,用于室外场景的激光雷达点云上采样,无需密集监督即可实现最优性能,并支持任意上采样率。
English: PVNet introduces a diffusion-based point-voxel interaction framework for LiDAR point cloud upsampling in outdoor scenes, achieving state-of-the-art performance without dense supervision and supporting arbitrary upsampling rates.

Authors:Dmitry Yudin
Title: M3DMap: Object-aware Multimodal 3D Mapping for Dynamic Environments
Abstract:
3D mapping in dynamic environments poses a challenge for modern researchers in robotics and autonomous transportation. There are no universal representations for dynamic 3D scenes that incorporate multimodal data such as images, point clouds, and text. This article takes a step toward solving this problem. It proposes a taxonomy of methods for constructing multimodal 3D maps, classifying contemporary approaches based on scene types and representations, learning methods, and practical applications. Using this taxonomy, a brief structured analysis of recent methods is provided. The article also describes an original modular method called M3DMap, designed for object-aware construction of multimodal 3D maps for both static and dynamic scenes. It consists of several interconnected components: a neural multimodal object segmentation and tracking module; an odometry estimation module, including trainable algorithms; a module for 3D map construction and updating with various implementations depending on the desired scene representation; and a multimodal data retrieval module. The article highlights original implementations of these modules and their advantages in solving various practical tasks, from 3D object grounding to mobile manipulation. Additionally, it presents theoretical propositions demonstrating the positive effect of using multimodal data and modern foundational models in 3D mapping methods. Details of the taxonomy and method implementation are available at https://yuddim.github.io/M3DMap.

Authors:Raghul Asokan
Title: F4-ITS: Fine-grained Feature Fusion for Food Image-Text Search
Abstract:
The proliferation of digital food content has intensified the need for robust and accurate systems capable of fine-grained visual understanding and retrieval. In this work, we address the challenging task of food image-to-text matching, a critical component in applications such as dietary monitoring, smart kitchens, and restaurant automation. We propose F4-ITS: Fine-grained Feature Fusion for Food Image-Text Search, a training-free, vision-language model (VLM)-guided framework that significantly improves retrieval performance through enhanced multi-modal feature representations. Our approach introduces two key contributions: (1) a uni-directional(and bi-directional) multi-modal fusion strategy that combines image embeddings with VLM-generated textual descriptions to improve query expressiveness, and (2) a novel feature-based re-ranking mechanism for top-k retrieval, leveraging predicted food ingredients to refine results and boost precision. Leveraging open-source image-text encoders, we demonstrate substantial gains over standard baselines - achieving ~10% and ~7.7% improvements in top-1 retrieval under dense and sparse caption scenarios, and a ~28.6% gain in top-k ingredient-level retrieval. Additionally, we show that smaller models (e.g., ViT-B/32) can match or outperform larger counterparts (e.g., ViT-H, ViT-G, ViT-bigG) when augmented with textual fusion, highlighting the effectiveness of our method in resource-constrained settings. Code and test datasets will be made publicly available at: https://github.com/mailcorahul/f4-its
中文: 本文提出F4-ITS框架,通过多模态融合和特征重排序技术提升食品图像文本匹配性能,在不同检索场景和模型规模下均实现了显著效果提升。
English: This paper introduces F4-ITS, a training-free framework that enhances food image-text matching through multi-modal fusion and feature re-ranking, achieving significant retrieval improvements across various scenarios and model sizes.

Authors:Mingliang Li, Lin Yuanbo Wu, Changhong Liu, Hanxi Li
Title: A Novel Local Focusing Mechanism for Deepfake Detection Generalization
Abstract:
The rapid advancement of deepfake generation techniques has intensified the need for robust and generalizable detection methods. Existing approaches based on reconstruction learning typically leverage deep convolutional networks to extract differential features. However, these methods show poor generalization across object categories (e.g., from faces to cars) and generation domains (e.g., from GANs to Stable Diffusion), due to intrinsic limitations of deep CNNs. First, models trained on a specific category tend to overfit to semantic feature distributions, making them less transferable to other categories, especially as network depth increases. Second, Global Average Pooling (GAP) compresses critical local forgery cues into a single vector, thus discarding discriminative patterns vital for real-fake classification. To address these issues, we propose a novel Local Focus Mechanism (LFM) that explicitly attends to discriminative local features for differentiating fake from real images. LFM integrates a Salience Network (SNet) with a task-specific Top-K Pooling (TKP) module to select the K most informative local patterns. To mitigate potential overfitting introduced by Top-K pooling, we introduce two regularization techniques: Rank-Based Linear Dropout (RBLD) and Random-K Sampling (RKS), which enhance the model's robustness. LFM achieves a 3.7 improvement in accuracy and a 2.8 increase in average precision over the state-of-the-art Neighboring Pixel Relationships (NPR) method, while maintaining exceptional efficiency at 1789 FPS on a single NVIDIA A6000 GPU. Our approach sets a new benchmark for cross-domain deepfake detection. The source code are available in https://github.com/lmlpy/LFM.git
中文: 提出的局部聚焦机制通过关注区分性局部特征,解决了现有深度伪造检测方法泛化能力不足的问题,在跨领域检测中实现了更优的准确性和效率。
English: The proposed Local Focus Mechanism (LFM) addresses the generalization limitations of existing deepfake detectors by focusing on discriminative local features, achieving superior accuracy and efficiency in cross-domain detection.

Authors:Riad Hassan, M. Rubaiyat Hossain Mondal, Sheikh Iqbal Ahamed, Fahad Mostafa, Md Mostafijur Rahman
Title: An Efficient Dual-Line Decoder Network with Multi-Scale Convolutional Attention for Multi-organ Segmentation
Abstract:
Proper segmentation of organs-at-risk is important for radiation therapy, surgical planning, and diagnostic decision-making in medical image analysis. While deep learning-based segmentation architectures have made significant progress, they often fail to balance segmentation accuracy with computational efficiency. Most of the current state-of-the-art methods either prioritize performance at the cost of high computational complexity or compromise accuracy for efficiency. This paper addresses this gap by introducing an efficient dual-line decoder segmentation network (EDLDNet). The proposed method features a noisy decoder, which learns to incorporate structured perturbation at training time for better model robustness, yet at inference time only the noise-free decoder is executed, leading to lower computational cost. Multi-Scale convolutional Attention Modules (MSCAMs), Attention Gates (AGs), and Up-Convolution Blocks (UCBs) are further utilized to optimize feature representation and boost segmentation performance. By leveraging multi-scale segmentation masks from both decoders, we also utilize a mutation-based loss function to enhance the model's generalization. Our approach outperforms SOTA segmentation architectures on four publicly available medical imaging datasets. EDLDNet achieves SOTA performance with an 84.00% Dice score on the Synapse dataset, surpassing baseline model like UNet by 13.89% in Dice score while significantly reducing Multiply-Accumulate Operations (MACs) by 89.7%. Compared to recent approaches like EMCAD, our EDLDNet not only achieves higher Dice score but also maintains comparable computational efficiency. The outstanding performance across diverse datasets establishes EDLDNet's strong generalization, computational efficiency, and robustness. The source code, pre-processed data, and pre-trained weights will be available at https://github.com/riadhassan/EDLDNet .
Chinese: 本文提出的EDLDNet高效双线解码器分割网络,通过噪声解码器和多尺度注意力模块等创新设计,在多个医学影像数据集上实现了最佳性能,同时兼顾了分割精度与计算效率。
English: This paper introduces EDLDNet, an efficient dual-line decoder segmentation network that achieves state-of-the-art performance on medical imaging datasets by balancing high accuracy with computational efficiency through innovative components like a noisy decoder and multi-scale attention modules.

Authors:Yahao Liu, Qin Wang, Lixin Duan, Wen Li
Title: Balanced Sharpness-Aware Minimization for Imbalanced Regression
Abstract:
Regression is fundamental in computer vision and is widely used in various tasks including age estimation, depth estimation, target localization, \etc However, real-world data often exhibits imbalanced distribution, making regression models perform poorly especially for target values with rare observations~(known as the imbalanced regression problem). In this paper, we reframe imbalanced regression as an imbalanced generalization problem. To tackle that, we look into the loss sharpness property for measuring the generalization ability of regression models in the observation space. Namely, given a certain perturbation on the model parameters, we check how model performance changes according to the loss values of different target observations. We propose a simple yet effective approach called Balanced Sharpness-Aware Minimization~(BSAM) to enforce the uniform generalization ability of regression models for the entire observation space. In particular, we start from the traditional sharpness-aware minimization and then introduce a novel targeted reweighting strategy to homogenize the generalization ability across the observation space, which guarantees a theoretical generalization bound. Extensive experiments on multiple vision regression tasks, including age and depth estimation, demonstrate that our BSAM method consistently outperforms existing approaches. The code is available \href{https://github.com/manmanjun/BSAM_for_Imbalanced_Regression}{here}.
中文: 本文针对计算机视觉中的不平衡回归问题,提出平衡锐度感知最小化方法,通过目标重加权策略和锐度感知优化来统一模型在整个观测空间的泛化能力,在年龄和深度估计等任务中展现出优于现有方法的性能。
English: This paper addresses the imbalanced regression problem in computer vision by proposing Balanced Sharpness-Aware Minimization (BSAM), a method that enhances model generalization across all observation levels through targeted reweighting and sharpness-aware optimization, demonstrating superior performance in tasks like age and depth estimation.

Authors:Tianhang Pan, Xiuyi Jia
Title: Local Information Matters: A Rethink of Crowd Counting
Abstract:
The motivation of this paper originates from rethinking an essential characteristic of crowd counting: individuals (heads of humans) in the crowd counting task typically occupy a very small portion of the image. This characteristic has never been the focus of existing works: they typically use the same backbone as other visual tasks and pursue a large receptive field. This drives us to propose a new model design principle of crowd counting: emphasizing local modeling capability of the model. We follow the principle and design a crowd counting model named Local Information Matters Model (LIMM). The main innovation lies in two strategies: a window partitioning design that applies grid windows to the model input, and a window-wise contrastive learning design to enhance the model's ability to distinguish between local density levels. Moreover, a global attention module is applied to the end of the model to handle the occasionally occurring large-sized individuals. Extensive experiments on multiple public datasets illustrate that the proposed model shows a significant improvement in local modeling capability (8.7\% in MAE on the JHU-Crowd++ high-density subset for example), without compromising its ability to count large-sized ones, which achieves state-of-the-art performance. Code is available at: https://github.com/tianhangpan/LIMM.
中文摘要:本文提出用于人群计数的局部信息重要性模型(LIMM),通过窗口分区和对比学习增强局部建模能力,同时保持全局计数精度,实现了最先进的性能表现。
English Summary: This paper introduces the Local Information Matters Model (LIMM) for crowd counting, which enhances local modeling through window partitioning and contrastive learning while maintaining global counting accuracy, achieving state-of-the-art performance.

Authors:Krishna Kanth Nakka, Alexandre Alahi
Title: NAT: Learning to Attack Neurons for Enhanced Adversarial Transferability
Abstract:
The generation of transferable adversarial perturbations typically involves training a generator to maximize embedding separation between clean and adversarial images at a single mid-layer of a source model. In this work, we build on this approach and introduce Neuron Attack for Transferability (NAT), a method designed to target specific neuron within the embedding. Our approach is motivated by the observation that previous layer-level optimizations often disproportionately focus on a few neurons representing similar concepts, leaving other neurons within the attacked layer minimally affected. NAT shifts the focus from embedding-level separation to a more fundamental, neuron-specific approach. We find that targeting individual neurons effectively disrupts the core units of the neural network, providing a common basis for transferability across different models. Through extensive experiments on 41 diverse ImageNet models and 9 fine-grained models, NAT achieves fooling rates that surpass existing baselines by over 14\% in cross-model and 4\% in cross-domain settings. Furthermore, by leveraging the complementary attacking capabilities of the trained generators, we achieve impressive fooling rates within just 10 queries. Our code is available at: https://krishnakanthnakka.github.io/NAT/

Authors:Qi Song, Ziyuan Luo, Ka Chun Cheung, Simon See, Renjie Wan
Title: Align 3D Representation and Text Embedding for 3D Content Personalization
Abstract:
Recent advances in NeRF and 3DGS have significantly enhanced the efficiency and quality of 3D content synthesis. However, efficient personalization of generated 3D content remains a critical challenge. Current 3D personalization approaches predominantly rely on knowledge distillation-based methods, which require computationally expensive retraining procedures. To address this challenge, we propose \textbf{Invert3D}, a novel framework for convenient 3D content personalization. Nowadays, vision-language models such as CLIP enable direct image personalization through aligned vision-text embedding spaces. However, the inherent structural differences between 3D content and 2D images preclude direct application of these techniques to 3D personalization. Our approach bridges this gap by establishing alignment between 3D representations and text embedding spaces. Specifically, we develop a camera-conditioned 3D-to-text inverse mechanism that projects 3D contents into a 3D embedding aligned with text embeddings. This alignment enables efficient manipulation and personalization of 3D content through natural language prompts, eliminating the need for computationally retraining procedures. Extensive experiments demonstrate that Invert3D achieves effective personalization of 3D content. Our work is available at: https://github.com/qsong2001/Invert3D.
中文:Invert3D提出了一种通过相机条件逆向机制将三维表征与文本嵌入对齐的新框架,无需昂贵重训练即可通过自然语言实现高效的三维内容个性化。
English: Invert3D introduces a novel framework that aligns 3D representations with text embeddings through a camera-conditioned inverse mechanism, enabling efficient 3D content personalization via natural language without costly retraining.

Authors:Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, Zhao Zhong
Title: HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation
Abstract:
Recent advances in video generation produce visually realistic content, yet the absence of synchronized audio severely compromises immersion. To address key challenges in video-to-audio generation, including multimodal data scarcity, modality imbalance and limited audio quality in existing methods, we propose HunyuanVideo-Foley, an end-to-end text-video-to-audio framework that synthesizes high-fidelity audio precisely aligned with visual dynamics and semantic context. Our approach incorporates three core innovations: (1) a scalable data pipeline curating 100k-hour multimodal datasets through automated annotation; (2) a representation alignment strategy using self-supervised audio features to guide latent diffusion training, efficiently improving audio quality and generation stability; (3) a novel multimodal diffusion transformer resolving modal competition, containing dual-stream audio-video fusion through joint attention, and textual semantic injection via cross-attention. Comprehensive evaluations demonstrate that HunyuanVideo-Foley achieves new state-of-the-art performance across audio fidelity, visual-semantic alignment, temporal alignment and distribution matching. The demo page is available at: https://szczesnys.github.io/hunyuanvideo-foley/.

Authors:Prerit Gupta, Jason Alexander Fotso-Puepi, Zhengyuan Li, Jay Mehta, Aniket Bera
Title: MDD: A Dataset for Text-and-Music Conditioned Duet Dance Generation
Abstract:
We introduce Multimodal DuetDance (MDD), a diverse multimodal benchmark dataset designed for text-controlled and music-conditioned 3D duet dance motion generation. Our dataset comprises 620 minutes of high-quality motion capture data performed by professional dancers, synchronized with music, and detailed with over 10K fine-grained natural language descriptions. The annotations capture a rich movement vocabulary, detailing spatial relationships, body movements, and rhythm, making MDD the first dataset to seamlessly integrate human motions, music, and text for duet dance generation. We introduce two novel tasks supported by our dataset: (1) Text-to-Duet, where given music and a textual prompt, both the leader and follower dance motion are generated (2) Text-to-Dance Accompaniment, where given music, textual prompt, and the leader's motion, the follower's motion is generated in a cohesive, text-aligned manner. We include baseline evaluations on both tasks to support future research.

Authors:Shunyu Yao, Ming Liu, Zhilu Zhang, Zhaolin Wan, Zhilong Ji, Jinfeng Bai, Wangmeng Zuo
Title: MDIQA: Unified Image Quality Assessment for Multi-dimensional Evaluation and Restoration
Abstract:
Recent advancements in image quality assessment (IQA), driven by sophisticated deep neural network designs, have significantly improved the ability to approach human perceptions. However, most existing methods are obsessed with fitting the overall score, neglecting the fact that humans typically evaluate image quality from different dimensions before arriving at an overall quality assessment. To overcome this problem, we propose a multi-dimensional image quality assessment (MDIQA) framework. Specifically, we model image quality across various perceptual dimensions, including five technical and four aesthetic dimensions, to capture the multifaceted nature of human visual perception within distinct branches. Each branch of our MDIQA is initially trained under the guidance of a separate dimension, and the respective features are then amalgamated to generate the final IQA score. Additionally, when the MDIQA model is ready, we can deploy it for a flexible training of image restoration (IR) models, enabling the restoration results to better align with varying user preferences through the adjustment of perceptual dimension weights. Extensive experiments demonstrate that our MDIQA achieves superior performance and can be effectively and flexibly applied to image restoration tasks. The code is available: https://github.com/YaoShunyu19/MDIQA.
中文: 提出的多维图像质量评估(MDIQA)框架通过技术和美学维度建模图像质量,更贴合人类视觉感知,并能灵活指导图像复原任务。
English: The proposed multi-dimensional image quality assessment (MDIQA) framework models image quality across technical and aesthetic dimensions to better align with human visual perception and can flexibly guide image restoration tasks.

Authors:Xilai Li, Huichun Liu, Xiaosong Li, Tao Ye, Zhenyu Kuang, Huafeng Li
Title: AWM-Fuse: Multi-Modality Image Fusion for Adverse Weather via Global and Local Text Perception
Abstract:
Multi-modality image fusion (MMIF) in adverse weather aims to address the loss of visual information caused by weather-related degradations, providing clearer scene representations. Although less studies have attempted to incorporate textual information to improve semantic perception, they often lack effective categorization and thorough analysis of textual content. In response, we propose AWM-Fuse, a novel fusion method for adverse weather conditions, designed to handle multiple degradations through global and local text perception within a unified, shared weight architecture. In particular, a global feature perception module leverages BLIP-produced captions to extract overall scene features and identify primary degradation types, thus promoting generalization across various adverse weather conditions. Complementing this, the local module employs detailed scene descriptions produced by ChatGPT to concentrate on specific degradation effects through concrete textual cues, thereby capturing finer details. Furthermore, textual descriptions are used to constrain the generation of fusion images, effectively steering the network learning process toward better alignment with real semantic labels, thereby promoting the learning of more meaningful visual features. Extensive experiments demonstrate that AWM-Fuse outperforms current state-of-the-art methods in complex weather conditions and downstream tasks. Our code is available at https://github.com/Feecuin/AWM-Fuse.
中文: AWM-Fuse是一种新颖的多模态图像融合方法,通过统一架构整合全局与局部文本感知来提升恶劣天气下的场景清晰度,在复杂天气条件和下游任务中优于现有技术。
English: AWM-Fuse is a novel multi-modality image fusion method that enhances scene clarity in adverse weather by integrating global and local text perception through a unified architecture, outperforming existing techniques in complex conditions and downstream tasks.

Authors:Xin Tian, Jiazheng Wang, Yuxi Zhang, Xiang Chen, Renjiu Hu, Gaolei Li, Min Liu, Hang Zhang
Title: Gaussian Primitive Optimized Deformable Retinal Image Registration
Abstract:
Deformable retinal image registration is notoriously difficult due to large homogeneous regions and sparse but critical vascular features, which cause limited gradient signals in standard learning-based frameworks. In this paper, we introduce Gaussian Primitive Optimization (GPO), a novel iterative framework that performs structured message passing to overcome these challenges. After an initial coarse alignment, we extract keypoints at salient anatomical structures (e.g., major vessels) to serve as a minimal set of descriptor-based control nodes (DCN). Each node is modelled as a Gaussian primitive with trainable position, displacement, and radius, thus adapting its spatial influence to local deformation scales. A K-Nearest Neighbors (KNN) Gaussian interpolation then blends and propagates displacement signals from these information-rich nodes to construct a globally coherent displacement field; focusing interpolation on the top (K) neighbors reduces computational overhead while preserving local detail. By strategically anchoring nodes in high-gradient regions, GPO ensures robust gradient flow, mitigating vanishing gradient signal in textureless areas. The framework is optimized end-to-end via a multi-term loss that enforces both keypoint consistency and intensity alignment. Experiments on the FIRE dataset show that GPO reduces the target registration error from 6.2\,px to ~2.4\,px and increases the AUC at 25\,px from 0.770 to 0.938, substantially outperforming existing methods. The source code can be accessed via https://github.com/xintian-99/GPOreg.
中文摘要:本文提出高斯基元优化(GPO)方法,通过在关键血管特征处部署可学习的高斯基元来解决视网膜图像配准中的梯度信号不足问题,在FIRE数据集上实现了显著优于现有方法的配准精度。
English Summary: The paper introduces Gaussian Primitive Optimization (GPO), a novel deformable retinal image registration framework that uses strategically placed Gaussian primitives at key vascular features to overcome gradient signal limitations, achieving state-of-the-art performance on the FIRE dataset.

Authors:Stefania L. Moroianu, Christian Bluethgen, Pierre Chambon, Mehdi Cherti, Jean-Benoit Delbrouck, Magdalini Paschali, Brandon Price, Judy Gichoya, Jenia Jitsev, Curtis P. Langlotz, Akshay S. Chaudhari
Title: Improving Performance, Robustness, and Fairness of Radiographic AI Models with Finely-Controllable Synthetic Data
Abstract:
Achieving robust performance and fairness across diverse patient populations remains a challenge in developing clinically deployable deep learning models for diagnostic imaging. Synthetic data generation has emerged as a promising strategy to address limitations in dataset scale and diversity. We introduce RoentGen-v2, a text-to-image diffusion model for chest radiographs that enables fine-grained control over both radiographic findings and patient demographic attributes, including sex, age, and race/ethnicity. RoentGen-v2 is the first model to generate clinically plausible images with demographic conditioning, facilitating the creation of a large, demographically balanced synthetic dataset comprising over 565,000 images. We use this large synthetic dataset to evaluate optimal training pipelines for downstream disease classification models. In contrast to prior work that combines real and synthetic data naively, we propose an improved training strategy that leverages synthetic data for supervised pretraining, followed by fine-tuning on real data. Through extensive evaluation on over 137,000 chest radiographs from five institutions, we demonstrate that synthetic pretraining consistently improves model performance, generalization to out-of-distribution settings, and fairness across demographic subgroups. Across datasets, synthetic pretraining led to a 6.5% accuracy increase in the performance of downstream classification models, compared to a modest 2.7% increase when naively combining real and synthetic data. We observe this performance improvement simultaneously with the reduction of the underdiagnosis fairness gap by 19.3%. These results highlight the potential of synthetic imaging to advance equitable and generalizable medical deep learning under real-world data constraints. We open source our code, trained models, and synthetic dataset at https://github.com/StanfordMIMI/RoentGen-v2 .
中文: RoentGen-v2 提出了一种文本到图像的扩散模型,用于生成具有人口统计学控制的临床可信胸部X光片,通过合成预训练显著提升了医学影像模型的准确性、泛化能力和公平性。
English: RoentGen-v2 introduces a text-to-image diffusion model for generating clinically plausible chest radiographs with demographic control, enabling synthetic pretraining that significantly improves model accuracy, generalization, and fairness in medical imaging.

Authors:Ashwath Vaithinathan Aravindan, Abha Jha, Mihir Kulkarni
Title: Do VLMs Have Bad Eyes? Diagnosing Compositional Failures via Mechanistic Interpretability
Abstract:
Vision-Language Models (VLMs) have shown remarkable performance in integrating visual and textual information for tasks such as image captioning and visual question answering. However, these models struggle with compositional generalization and object binding, which limit their ability to handle novel combinations of objects and their attributes. Our work explores the root causes of these failures using mechanistic interpretability techniques. We show evidence that individual neurons in the MLP layers of CLIP's vision encoder represent multiple features, and this "superposition" directly hinders its compositional feature representation which consequently affects compositional reasoning and object binding capabilities. We hope this study will serve as an initial step toward uncovering the mechanistic roots of compositional failures in VLMs. The code and supporting results can be found https://github.com/Mystic-Slice/Do-VLMs-Have-Bad-Eyes.
中文: 视觉语言模型因多层感知机神经元中的特征叠加问题,在组合泛化和对象绑定方面存在局限,本研究通过机制可解释性方法揭示了这些失败的根本原因。
English: Vision-Language Models face limitations in compositional generalization and object binding due to superposition in MLP neurons, which this study investigates using mechanistic interpretability to uncover the root causes of these failures.

Authors:Yosef Dayani, Omer Benishu, Sagie Benaim
Title: MV-RAG: Retrieval Augmented Multiview Diffusion
Abstract:
Text-to-3D generation approaches have advanced significantly by leveraging pretrained 2D diffusion priors, producing high-quality and 3D-consistent outputs. However, they often fail to produce out-of-domain (OOD) or rare concepts, yielding inconsistent or inaccurate results. To this end, we propose MV-RAG, a novel text-to-3D pipeline that first retrieves relevant 2D images from a large in-the-wild 2D database and then conditions a multiview diffusion model on these images to synthesize consistent and accurate multiview outputs. Training such a retrieval-conditioned model is achieved via a novel hybrid strategy bridging structured multiview data and diverse 2D image collections. This involves training on multiview data using augmented conditioning views that simulate retrieval variance for view-specific reconstruction, alongside training on sets of retrieved real-world 2D images using a distinctive held-out view prediction objective: the model predicts the held-out view from the other views to infer 3D consistency from 2D data. To facilitate a rigorous OOD evaluation, we introduce a new collection of challenging OOD prompts. Experiments against state-of-the-art text-to-3D, image-to-3D, and personalization baselines show that our approach significantly improves 3D consistency, photorealism, and text adherence for OOD/rare concepts, while maintaining competitive performance on standard benchmarks.

Authors:Yupei Zhang, Xiaofei Wang, Anran Liu, Lequan Yu, Chao Li
Title: Disentangled Multi-modal Learning of Histology and Transcriptomics for Cancer Characterization
Abstract:
Histopathology remains the gold standard for cancer diagnosis and prognosis. With the advent of transcriptome profiling, multi-modal learning combining transcriptomics with histology offers more comprehensive information. However, existing multi-modal approaches are challenged by intrinsic multi-modal heterogeneity, insufficient multi-scale integration, and reliance on paired data, restricting clinical applicability. To address these challenges, we propose a disentangled multi-modal framework with four contributions: 1) To mitigate multi-modal heterogeneity, we decompose WSIs and transcriptomes into tumor and microenvironment subspaces using a disentangled multi-modal fusion module, and introduce a confidence-guided gradient coordination strategy to balance subspace optimization. 2) To enhance multi-scale integration, we propose an inter-magnification gene-expression consistency strategy that aligns transcriptomic signals across WSI magnifications. 3) To reduce dependency on paired data, we propose a subspace knowledge distillation strategy enabling transcriptome-agnostic inference through a WSI-only student model. 4) To improve inference efficiency, we propose an informative token aggregation module that suppresses WSI redundancy while preserving subspace semantics. Extensive experiments on cancer diagnosis, prognosis, and survival prediction demonstrate our superiority over state-of-the-art methods across multiple settings. Code is available at https://github.com/helenypzhang/Disentangled-Multimodal-Learning.
中文: 本研究提出了一种解耦的多模态框架,通过将全切片图像和转录组分解为肿瘤与微环境子空间,采用置信度引导梯度协调和知识蒸馏等策略,解决了多模态异质性、多尺度整合及配对数据依赖等难题,在癌症诊断、预后和生存预测方面展现出卓越性能。
English: This study introduces a disentangled multi-modal framework that addresses challenges in multi-modal heterogeneity, multi-scale integration, and paired data dependency by decomposing whole slide images and transcriptomes into tumor and microenvironment subspaces, employing strategies like confidence-guided gradient coordination and knowledge distillation, ultimately demonstrating superior performance in cancer diagnosis, prognosis, and survival prediction.

Authors:Aniello Panariello, Emanuele Frascaroli, Pietro Buzzega, Lorenzo Bonicelli, Angelo Porrello, Simone Calderara
Title: Modular Embedding Recomposition for Incremental Learning
Abstract:
The advent of pre-trained Vision-Language Models (VLMs) has significantly transformed Continual Learning (CL), mainly due to their zero-shot classification abilities. Such proficiency makes VLMs well-suited for real-world applications, enabling robust performance on novel unseen classes without requiring adaptation. However, fine-tuning remains essential when downstream tasks deviate significantly from the pre-training domain. Prior CL approaches primarily focus on preserving the zero-shot capabilities of VLMs during incremental fine-tuning on a downstream task. We take a step further by devising an approach that transforms preservation into enhancement of the zero-shot capabilities of VLMs. Our approach, named MoDular Embedding Recomposition (MoDER), introduces a modular framework that trains multiple textual experts, each specialized in a single seen class, and stores them in a foundational hub. At inference time, for each unseen class, we query the hub and compose the retrieved experts to synthesize a refined prototype that improves classification. We show the effectiveness of our method across two popular zero-shot incremental protocols, Class-IL and MTIL, comprising a total of 14 datasets. The codebase is available at https://github.com/aimagelab/mammoth.
中文: 预训练视觉语言模型通过MoDER模块化框架重组专业文本专家,无需调整即可提升对未见类别的零样本分类能力,从而推进持续学习。
English: Pre-trained Vision-Language Models (VLMs) enhance Continual Learning by introducing MoDER, a modular framework that recomposes specialized textual experts to improve zero-shot classification on unseen classes without adaptation.

Authors:Yong Zhang, Cunjian Chen, Qiang Gao, Yi Wang, Bin Fang
Title: A Lightweight Group Multiscale Bidirectional Interactive Network for Real-Time Steel Surface Defect Detection
Abstract:
Real-time surface defect detection is critical for maintaining product quality and production efficiency in the steel manufacturing industry. Despite promising accuracy, existing deep learning methods often suffer from high computational complexity and slow inference speeds, which limit their deployment in resource-constrained industrial environments. Recent lightweight approaches adopt multibranch architectures based on depthwise separable convolution (DSConv) to capture multiscale contextual information. However, these methods often suffer from increased computational overhead and lack effective cross-scale feature interaction, limiting their ability to fully leverage multiscale representations. To address these challenges, we propose GMBINet, a lightweight framework that enhances multiscale feature extraction and interaction through novel Group Multiscale Bidirectional Interactive (GMBI) modules. The GMBI adopts a group-wise strategy for multiscale feature extraction, ensuring scale-agnostic computational complexity. It further integrates a Bidirectional Progressive Feature Interactor (BPFI) and a parameter-free Element-Wise Multiplication-Summation (EWMS) operation to enhance cross-scale interaction without introducing additional computational overhead. Experiments on SD-Saliency-900 and NRSD-MN datasets demonstrate that GMBINet delivers competitive accuracy with real-time speeds of 1048 FPS on GPU and 16.53 FPS on CPU at 512 resolution, using only 0.19 M parameters. Additional evaluations on the NEU-CLS defect classification dataset further confirm the strong generalization ability of our method, demonstrating its potential for broader industrial vision applications beyond surface defect detection. The dataset and code are publicly available at: https://github.com/zhangyongcode/GMBINet.
中文: GMBINet是一种轻量级框架,通过创新的GMBI模块实现高效多尺度特征提取与交互,在保持高精度的同时显著降低计算成本,适用于钢铁制造中的实时表面缺陷检测。
English: GMBINet is a lightweight framework designed for real-time surface defect detection in steel manufacturing, featuring novel GMBI modules that enable efficient multiscale feature extraction and interaction while maintaining competitive accuracy with minimal computational overhead.

Authors:Fengshun Wang, Qiurui Wang, Peilin Zhao
Title: Learning Long-Range Action Representation by Two-Stream Mamba Pyramid Network for Figure Skating Assessment
Abstract:
Technical Element Score (TES) and Program Component Score (PCS) evaluations in figure skating demand precise assessment of athletic actions and artistic interpretation, respectively. Existing methods face three major challenges. Firstly, video and audio cues are regarded as common features for both TES and PCS predictions in previous works without considering the prior evaluation criterion of figure skating. Secondly, action elements in competitions are separated in time, TES should be derived from each element's score, but existing methods try to give an overall TES prediction without evaluating each action element. Thirdly, lengthy competition videos make it difficult and inefficient to handle long-range contexts. To address these challenges, we propose a two-stream Mamba pyramid network that aligns with actual judging criteria to predict TES and PCS by separating visual-feature based TES evaluation stream from audio-visual-feature based PCS evaluation stream. In the PCS evaluation stream, we introduce a multi-level fusion mechanism to guarantee that video-based features remain unaffected when assessing TES, and enhance PCS estimation by fusing visual and auditory cues across each contextual level of the pyramid. In the TES evaluation stream, the multi-scale Mamba pyramid and TES head we proposed effectively address the challenges of localizing and evaluating action elements with various temporal scales and give score predictions. With Mamba's superior ability to capture long-range dependencies and its linear computational complexity, our method is ideal for handling lengthy figure skating videos. Comprehensive experimentation demonstrates that our framework attains state-of-the-art performance on the FineFS benchmark. Our source code is available at https://github.com/ycwfs/Figure-Skating-Action-Quality-Assessment.
中文: 本研究提出了一种双流Mamba金字塔网络,分别通过视觉特征评估技术动作分和视听融合评估节目内容分,有效解决了花样滑冰评分中的动作定位和长视频处理难题,实现了最先进的性能。
English: This study introduces a two-stream Mamba pyramid network that separately evaluates Technical Element Scores (TES) using visual features and Program Component Scores (PCS) through audio-visual fusion, effectively addressing challenges in figure skating assessment by localizing action elements and handling long-range video dependencies with state-of-the-art results.

Authors:Yu Meng, Ligao Deng, Zhihao Xi, Jiansheng Chen, Jingbo Chen, Anzhi Yue, Diyou Liu, Kai Li, Chenhao Wang, Kaiyu Li, Yupeng Deng, Xian Sun
Title: IRSAMap:Towards Large-Scale, High-Resolution Land Cover Map Vectorization
Abstract:
With the enhancement of remote sensing image resolution and the rapid advancement of deep learning, land cover mapping is transitioning from pixel-level segmentation to object-based vector modeling. This shift demands more from deep learning models, requiring precise object boundaries and topological consistency. However, existing datasets face three main challenges: limited class annotations, small data scale, and lack of spatial structural information. To overcome these issues, we introduce IRSAMap, the first global remote sensing dataset for large-scale, high-resolution, multi-feature land cover vector mapping. IRSAMap offers four key advantages: 1) a comprehensive vector annotation system with over 1.8 million instances of 10 typical objects (e.g., buildings, roads, rivers), ensuring semantic and spatial accuracy; 2) an intelligent annotation workflow combining manual and AI-based methods to improve efficiency and consistency; 3) global coverage across 79 regions in six continents, totaling over 1,000 km; and 4) multi-task adaptability for tasks like pixel-level classification, building outline extraction, road centerline extraction, and panoramic segmentation. IRSAMap provides a standardized benchmark for the shift from pixel-based to object-based approaches, advancing geographic feature automation and collaborative modeling. It is valuable for global geographic information updates and digital twin construction. The dataset is publicly available at https://github.com/ucas-dlg/IRSAMap
中文: IRSAMap作为首个全球遥感矢量制图数据集,通过提供全面的矢量标注、智能工作流程、全球覆盖和多任务适应性,解决了现有数据类别有限、规模小及缺乏空间结构的问题,推动了从像素到对象的地理建模转型。
English: IRSAMap introduces the first global remote sensing dataset for large-scale, high-resolution land cover vector mapping, addressing challenges like limited annotations and spatial data by offering comprehensive vector annotations, intelligent workflows, global coverage, and multi-task adaptability to advance object-based geographic modeling.

Authors:Philipp D. Lösel, Aleese Barron, Yulai Zhang, Matthias Fabian, Benjamin Young, Nicolas Francois, Andrew M. Kingston
Title: Self-Validated Learning for Particle Separation: A Correctness-Based Self-Training Framework Without Human Labels
Abstract:
Non-destructive 3D imaging of large multi-particulate samples is essential for quantifying particle-level properties, such as size, shape, and spatial distribution, across applications in mining, materials science, and geology. However, accurate instance segmentation of particles in tomographic data remains challenging due to high morphological variability and frequent particle contact, which limit the effectiveness of classical methods like watershed algorithms. While supervised deep learning approaches offer improved performance, they rely on extensive annotated datasets that are labor-intensive, error-prone, and difficult to scale. In this work, we propose self-validated learning, a novel self-training framework for particle instance segmentation that eliminates the need for manual annotations. Our method leverages implicit boundary detection and iteratively refines the training set by identifying particles that can be consistently matched across reshuffled scans of the same sample. This self-validation mechanism mitigates the impact of noisy pseudo-labels, enabling robust learning from unlabeled data. After just three iterations, our approach accurately segments over 97% of the total particle volume and identifies more than 54,000 individual particles in tomographic scans of quartz fragments. Importantly, the framework also enables fully autonomous model evaluation without the need for ground truth annotations, as confirmed through comparisons with state-of-the-art instance segmentation techniques. The method is integrated into the Biomedisa image analysis platform (https://github.com/biomedisa/biomedisa/).
中文: 本研究提出了一种自验证学习框架,用于3D成像中的颗粒实例分割,通过隐式边界检测和迭代自验证无需人工标注,在石英样本上实现了超过97%的体积分割精度。
English: This study introduces a self-validated learning framework for particle instance segmentation in 3D imaging, eliminating manual annotations by using implicit boundary detection and iterative self-validation to achieve over 97% volume segmentation accuracy on quartz samples.

Authors:Hohyun Na, Seunghoo Hong, Simon S. Woo
Title: PromptFlare: Prompt-Generalized Defense via Cross-Attention Decoy in Diffusion-Based Inpainting
Abstract:
The success of diffusion models has enabled effortless, high-quality image modifications that precisely align with users' intentions, thereby raising concerns about their potential misuse by malicious actors. Previous studies have attempted to mitigate such misuse through adversarial attacks. However, these approaches heavily rely on image-level inconsistencies, which pose fundamental limitations in addressing the influence of textual prompts. In this paper, we propose PromptFlare, a novel adversarial protection method designed to protect images from malicious modifications facilitated by diffusion-based inpainting models. Our approach leverages the cross-attention mechanism to exploit the intrinsic properties of prompt embeddings. Specifically, we identify and target shared token of prompts that is invariant and semantically uninformative, injecting adversarial noise to suppress the sampling process. The injected noise acts as a cross-attention decoy, diverting the model's focus away from meaningful prompt-image alignments and thereby neutralizing the effect of prompt. Extensive experiments on the EditBench dataset demonstrate that our method achieves state-of-the-art performance across various metrics while significantly reducing computational overhead and GPU memory usage. These findings highlight PromptFlare as a robust and efficient protection against unauthorized image manipulations. The code is available at https://github.com/NAHOHYUN-SKKU/PromptFlare.
Chinese: PromptFlare是一种新颖的对抗性保护方法,通过利用交叉注意力机制注入噪声,有效阻止基于扩散模型的恶意图像修改,同时显著降低计算开销。
English: PromptFlare is a novel adversarial protection method that exploits the cross-attention mechanism to inject noise, effectively neutralizing malicious image modifications by diffusion models while reducing computational costs.

Authors:Thinesh Thiyakesan Ponbagavathi, Kunyu Peng, Alina Roitberg
Title: T-MASK: Temporal Masking for Probing Foundation Models across Camera Views in Driver Monitoring
Abstract:
Changes of camera perspective are a common obstacle in driver monitoring. While deep learning and pretrained foundation models show strong potential for improved generalization via lightweight adaptation of the final layers ('probing'), their robustness to unseen viewpoints remains underexplored. We study this challenge by adapting image foundation models to driver monitoring using a single training view, and evaluating them directly on unseen perspectives without further adaptation. We benchmark simple linear probes, advanced probing strategies, and compare two foundation models (DINOv2 and CLIP) against parameter-efficient fine-tuning (PEFT) and full fine-tuning. Building on these insights, we introduce T-MASK -- a new image-to-video probing method that leverages temporal token masking and emphasizes more dynamic video regions. Benchmarked on the public Drive&Act dataset, T-MASK improves cross-view top-1 accuracy by $+1.23\%$ over strong probing baselines and $+8.0\%$ over PEFT methods, without adding any parameters. It proves particularly effective for underrepresented secondary activities, boosting recognition by $+5.42\%$ under the trained view and $+1.36\%$ under cross-view settings. This work provides encouraging evidence that adapting foundation models with lightweight probing methods like T-MASK has strong potential in fine-grained driver observation, especially in cross-view and low-data settings. These results highlight the importance of temporal token selection when leveraging foundation models to build robust driver monitoring systems. Code and models will be made available at https://github.com/th-nesh/T-MASK to support ongoing research.
中文: 本研究提出T-MASK轻量级探测方法,通过时序令牌掩码技术增强跨视角驾驶员监控能力,在不增加参数的情况下显著超越了现有方法的识别准确率。
English: This study introduces T-MASK, a lightweight probing method that enhances cross-view driver monitoring by leveraging temporal token masking, achieving significant accuracy improvements over existing approaches without additional parameters.

Authors:Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, Huan Li
Title: SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning
Abstract:
Video large language models (Vid-LLMs) have shown strong capabilities in understanding video content. However, their reliance on dense video token representations introduces substantial memory and computational overhead in both prefilling and decoding. To mitigate the information loss of recent video token reduction methods and accelerate the decoding stage of Vid-LLMs losslessly, we introduce SpecVLM, a training-free speculative decoding (SD) framework tailored for Vid-LLMs that incorporates staged video token pruning. Building on our novel finding that the draft model's speculation exhibits low sensitivity to video token pruning, SpecVLM prunes up to 90% of video tokens to enable efficient speculation without sacrificing accuracy. To achieve this, we performs a two-stage pruning process: Stage I selects highly informative tokens guided by attention signals from the verifier (target model), while Stage II prunes remaining redundant ones in a spatially uniform manner. Extensive experiments on four video understanding benchmarks demonstrate the effectiveness and robustness of SpecVLM, which achieves up to 2.68$\times$ decoding speedup for LLaVA-OneVision-72B and 2.11$\times$ speedup for Qwen2.5-VL-32B. Code is available at https://github.com/zju-jiyicheng/SpecVLM.
Chinese: SpecVLM是一种无需训练的推测解码框架,通过两阶段剪枝方法可去除高达90%的视频标记,在无损精度的情况下显著提升视频大语言模型的解码速度。
English: SpecVLM is a training-free speculative decoding framework that accelerates video large language models by pruning up to 90% of video tokens in two stages, achieving significant speed improvements without loss of accuracy.

Authors:Mohammad Mohammadzadeh Kalati, Farhad Maleki, Ian McQuillan
Title: FTIO: Frequent Temporally Integrated Objects
Abstract:
Predicting and tracking objects in real-world scenarios is a critical challenge in Video Object Segmentation (VOS) tasks. Unsupervised VOS (UVOS) has the additional challenge of finding an initial segmentation of salient objects, which affects the entire process and keeps a permanent uncertainty about the object proposals. Moreover, deformation and fast motion can lead to temporal inconsistencies. To address these problems, we propose Frequent Temporally Integrated Objects (FTIO), a post-processing framework with two key components. First, we introduce a combined criterion to improve object selection, mitigating failures common in UVOS--particularly when objects are small or structurally complex--by extracting frequently appearing salient objects. Second, we present a three-stage method to correct temporal inconsistencies by integrating missing object mask regions. Experimental results demonstrate that FTIO achieves state-of-the-art performance in multi-object UVOS. Code is available at: https://github.com/MohammadMohammadzadehKalati/FTIO
中文摘要:提出的FTIO框架通过提取频繁出现的显著物体来优化目标选择,并采用三阶段掩码整合方法修正时序不一致性,从而在无监督视频目标分割任务中实现了最先进的性能。
English Summary: The proposed FTIO framework enhances unsupervised video object segmentation by improving object selection through frequently appearing salient objects and correcting temporal inconsistencies with a three-stage mask integration method, achieving state-of-the-art performance.

Authors:Jiaqi Ma, Guo-Sen Xie, Fang Zhao, Zechao Li
Title: Through the Looking Glass: A Dual Perspective on Weakly-Supervised Few-Shot Segmentation
Abstract:
Meta-learning aims to uniformly sample homogeneous support-query pairs, characterized by the same categories and similar attributes, and extract useful inductive biases through identical network architectures. However, this identical network design results in over-semantic homogenization. To address this, we propose a novel homologous but heterogeneous network. By treating support-query pairs as dual perspectives, we introduce heterogeneous visual aggregation (HA) modules to enhance complementarity while preserving semantic commonality. To further reduce semantic noise and amplify the uniqueness of heterogeneous semantics, we design a heterogeneous transfer (HT) module. Finally, we propose heterogeneous CLIP (HC) textual information to enhance the generalization capability of multimodal models. In the weakly-supervised few-shot semantic segmentation (WFSS) task, with only 1/24 of the parameters of existing state-of-the-art models, TLG achieves a 13.2\% improvement on Pascal-5\textsuperscript{i} and a 9.7\% improvement on COCO-20\textsuperscript{i}. To the best of our knowledge, TLG is also the first weakly supervised (image-level) model that outperforms fully supervised (pixel-level) models under the same backbone architectures. The code is available at https://github.com/jarch-ma/TLG.
中文: 提出的TLG模型采用同源异构网络,通过专门模块增强语义互补性并减少噪声,在弱监督小样本语义分割任务中以极少的参数量实现了显著性能提升。
English: The proposed TLG model introduces a homologous but heterogeneous network with specialized modules to enhance semantic complementarity and reduce noise, achieving significant performance improvements in weakly-supervised few-shot semantic segmentation with minimal parameters.

Authors:Xiangde Luo, Xiyue Wang, Feyisope Eweje, Xiaoming Zhang, Sen Yang, Ryan Quinton, Jinxi Xiang, Yuchen Li, Yuanfeng Ji, Zhe Li, Yijiang Chen, Colin Bergstrom, Ted Kim, Francesca Maria Olguin, Kelley Yuan, Matthew Abikenari, Andrew Heider, Sierra Willens, Sanjeeth Rajaram, Robert West, Joel Neal, Maximilian Diehn, Ruijiang Li
Title: Ensemble learning of foundation models for precision oncology
Abstract:
Histopathology is essential for disease diagnosis and treatment decision-making. Recent advances in artificial intelligence (AI) have enabled the development of pathology foundation models that learn rich visual representations from large-scale whole-slide images (WSIs). However, existing models are often trained on disparate datasets using varying strategies, leading to inconsistent performance and limited generalizability. Here, we introduce ELF (Ensemble Learning of Foundation models), a novel framework that integrates five state-of-the-art pathology foundation models to generate unified slide-level representations. Trained on 53,699 WSIs spanning 20 anatomical sites, ELF leverages ensemble learning to capture complementary information from diverse models while maintaining high data efficiency. Unlike traditional tile-level models, ELF's slide-level architecture is particularly advantageous in clinical contexts where data are limited, such as therapeutic response prediction. We evaluated ELF across a wide range of clinical applications, including disease classification, biomarker detection, and response prediction to major anticancer therapies, cytotoxic chemotherapy, targeted therapy, and immunotherapy, across multiple cancer types. ELF consistently outperformed all constituent foundation models and existing slide-level models, demonstrating superior accuracy and robustness. Our results highlight the power of ensemble learning for pathology foundation models and suggest ELF as a scalable and generalizable solution for advancing AI-assisted precision oncology.
中文: ELF框架通过集成学习整合五种病理学基础模型,生成统一的切片级表征,在精准肿瘤学的多种临床应用中都展现出卓越的准确性和鲁棒性。
English: The ELF framework integrates five pathology foundation models through ensemble learning to create unified slide-level representations, demonstrating superior accuracy and robustness across diverse clinical applications in precision oncology.

Authors:Zhaoyi Yan, Binghui Chen, Yunfan Liu, Qixiang Ye
Title: Expandable Residual Approximation for Knowledge Distillation
Abstract:
Knowledge distillation (KD) aims to transfer knowledge from a large-scale teacher model to a lightweight one, significantly reducing computational and storage requirements. However, the inherent learning capacity gap between the teacher and student often hinders the sufficient transfer of knowledge, motivating numerous studies to address this challenge. Inspired by the progressive approximation principle in the Stone-Weierstrass theorem, we propose Expandable Residual Approximation (ERA), a novel KD method that decomposes the approximation of residual knowledge into multiple steps, reducing the difficulty of mimicking the teacher's representation through a divide-and-conquer approach. Specifically, ERA employs a Multi-Branched Residual Network (MBRNet) to implement this residual knowledge decomposition. Additionally, a Teacher Weight Integration (TWI) strategy is introduced to mitigate the capacity disparity by reusing the teacher's head weights. Extensive experiments show that ERA improves the Top-1 accuracy on the ImageNet classification benchmark by 1.41% and the AP on the MS COCO object detection benchmark by 1.40, as well as achieving leading performance across computer vision tasks. Codes and models are available at https://github.com/Zhaoyi-Yan/ERA.
Chinese Summary: 提出的可扩展残差近似(ERA)方法通过将残差知识分解为多个步骤并整合教师权重,显著提升了知识蒸馏效果,在ImageNet分类和MS COCO目标检测任务中实现了性能突破。
English Summary: The proposed Expandable Residual Approximation (ERA) method enhances knowledge distillation by decomposing residual knowledge into manageable steps and integrating teacher weights, achieving significant performance gains in ImageNet classification and MS COCO object detection.

Authors:Floris Erich, Naoya Chiba, Abdullah Mustafa, Ryo Hanai, Noriaki Ando, Yusuke Yoshiyasu, Yukiyasu Domae
Title: NeuralMeshing: Complete Object Mesh Extraction from Casual Captures
Abstract:
How can we extract complete geometric models of objects that we encounter in our daily life, without having access to commercial 3D scanners? In this paper we present an automated system for generating geometric models of objects from two or more videos. Our system requires the specification of one known point in at least one frame of each video, which can be automatically determined using a fiducial marker such as a checkerboard or Augmented Reality (AR) marker. The remaining frames are automatically positioned in world space by using Structure-from-Motion techniques. By using multiple videos and merging results, a complete object mesh can be generated, without having to rely on hole filling. Code for our system is available from https://github.com/FlorisE/NeuralMeshing.
中文: 本文提出了一种自动化系统,通过处理多个视频、利用基准标记和运动恢复结构技术,无需商用扫描仪即可生成日常物品的完整三维网格模型。
English: This paper introduces an automated system that generates complete 3D models of everyday objects by processing multiple videos, using fiducial markers and Structure-from-Motion to create meshes without commercial scanners.

Authors:Hung-Jui Huang, Mohammad Amin Mirzaee, Michael Kaess, Wenzhen Yuan
Title: GelSLAM: A Real-time, High-Fidelity, and Robust 3D Tactile SLAM System
Abstract:
Accurately perceiving an object's pose and shape is essential for precise grasping and manipulation. Compared to common vision-based methods, tactile sensing offers advantages in precision and immunity to occlusion when tracking and reconstructing objects in contact. This makes it particularly valuable for in-hand and other high-precision manipulation tasks. In this work, we present GelSLAM, a real-time 3D SLAM system that relies solely on tactile sensing to estimate object pose over long periods and reconstruct object shapes with high fidelity. Unlike traditional point cloud-based approaches, GelSLAM uses tactile-derived surface normals and curvatures for robust tracking and loop closure. It can track object motion in real time with low error and minimal drift, and reconstruct shapes with submillimeter accuracy, even for low-texture objects such as wooden tools. GelSLAM extends tactile sensing beyond local contact to enable global, long-horizon spatial perception, and we believe it will serve as a foundation for many precise manipulation tasks involving interaction with objects in hand. The video demo is available on our website: https://joehjhuang.github.io/gelslam.

Authors:Zhaodong Jiang, Ashish Sinha, Tongtong Cao, Yuan Ren, Bingbing Liu, Binbin Xu
Title: UnPose: Uncertainty-Guided Diffusion Priors for Zero-Shot Pose Estimation
Abstract:
Estimating the 6D pose of novel objects is a fundamental yet challenging problem in robotics, often relying on access to object CAD models. However, acquiring such models can be costly and impractical. Recent approaches aim to bypass this requirement by leveraging strong priors from foundation models to reconstruct objects from single or multi-view images, but typically require additional training or produce hallucinated geometry. To this end, we propose UnPose, a novel framework for zero-shot, model-free 6D object pose estimation and reconstruction that exploits 3D priors and uncertainty estimates from a pre-trained diffusion model. Specifically, starting from a single-view RGB-D frame, UnPose uses a multi-view diffusion model to estimate an initial 3D model using 3D Gaussian Splatting (3DGS) representation, along with pixel-wise epistemic uncertainty estimates. As additional observations become available, we incrementally refine the 3DGS model by fusing new views guided by the diffusion model's uncertainty, thereby continuously improving the pose estimation accuracy and 3D reconstruction quality. To ensure global consistency, the diffusion prior-generated views and subsequent observations are further integrated in a pose graph and jointly optimized into a coherent 3DGS field. Extensive experiments demonstrate that UnPose significantly outperforms existing approaches in both 6D pose estimation accuracy and 3D reconstruction quality. We further showcase its practical applicability in real-world robotic manipulation tasks.

Authors:Yijun Liu, Yuwei Liu, Yuan Meng, Jieheng Zhang, Yuwei Zhou, Ye Li, Jiacheng Jiang, Kangye Ji, Shijia Ge, Zhi Wang, Wenwu Zhu
Title: Spatial Policy: Guiding Visuomotor Robotic Manipulation with Spatial-Aware Modeling and Reasoning
Abstract:
Vision-centric hierarchical embodied models have demonstrated strong potential for long-horizon robotic control. However, existing methods lack spatial awareness capabilities, limiting their effectiveness in bridging visual plans to actionable control in complex environments. To address this problem, we propose Spatial Policy (SP), a unified spatial-aware visuomotor robotic manipulation framework via explicit spatial modeling and reasoning. Specifically, we first design a spatial-conditioned embodied video generation module to model spatially guided predictions through a spatial plan table. Then, we propose a spatial-based action prediction module to infer executable actions with coordination. Finally, we propose a spatial reasoning feedback policy to refine the spatial plan table via dual-stage replanning. Extensive experiments show that SP significantly outperforms state-of-the-art baselines, achieving a 33.0% average improvement over the best baseline. With an 86.7% average success rate across 11 diverse tasks, SP substantially enhances the practicality of embodied models for robotic control applications. Code and checkpoints are maintained at https://plantpotatoonmoon.github.io/SpatialPolicy/.
中文摘要:提出的空间策略(SP)框架通过显式空间建模与推理增强机器人控制的空间感知能力,在11项任务中达到86.7%的平均成功率,性能较现有最佳方法提升33%。
English Summary: The proposed Spatial Policy (SP) framework enhances robotic control by integrating spatial awareness through explicit modeling and reasoning, achieving an 86.7% success rate and 33% performance improvement over existing methods.

Authors:Haonan Qiu, Ning Yu, Ziqi Huang, Paul Debevec, Ziwei Liu
Title: CineScale: Free Lunch in High-Resolution Cinematic Visual Generation
Abstract:
Visual diffusion models achieve remarkable progress, yet they are typically trained at limited resolutions due to the lack of high-resolution data and constrained computation resources, hampering their ability to generate high-fidelity images or videos at higher resolutions. Recent efforts have explored tuning-free strategies to exhibit the untapped potential higher-resolution visual generation of pre-trained models. However, these methods are still prone to producing low-quality visual content with repetitive patterns. The key obstacle lies in the inevitable increase in high-frequency information when the model generates visual content exceeding its training resolution, leading to undesirable repetitive patterns deriving from the accumulated errors. In this work, we propose CineScale, a novel inference paradigm to enable higher-resolution visual generation. To tackle the various issues introduced by the two types of video generation architectures, we propose dedicated variants tailored to each. Unlike existing baseline methods that are confined to high-resolution T2I and T2V generation, CineScale broadens the scope by enabling high-resolution I2V and V2V synthesis, built atop state-of-the-art open-source video generation frameworks. Extensive experiments validate the superiority of our paradigm in extending the capabilities of higher-resolution visual generation for both image and video models. Remarkably, our approach enables 8k image generation without any fine-tuning, and achieves 4k video generation with only minimal LoRA fine-tuning. Generated video samples are available at our website: https://eyeline-labs.github.io/CineScale/.
中文: 视觉扩散模型因训练限制难以生成高分辨率内容,但提出的CineScale范式无需微调即可实现高达8K分辨率的高保真图像和视频生成,超越了现有方法并支持多种生成任务。
English: Visual diffusion models face challenges in generating high-resolution content due to training limitations, but the proposed CineScale paradigm enables high-fidelity image and video generation at up to 8k resolution without fine-tuning, expanding beyond existing methods to support various generation tasks.

Authors:Gaurav Parmar, Or Patashnik, Daniil Ostashev, Kuan-Chieh Wang, Kfir Aberman, Srinivasa Narasimhan, Jun-Yan Zhu
Title: Scaling Group Inference for Diverse and High-Quality Generation
Abstract:
Generative models typically sample outputs independently, and recent inference-time guidance and scaling algorithms focus on improving the quality of individual samples. However, in real-world applications, users are often presented with a set of multiple images (e.g., 4-8) for each prompt, where independent sampling tends to lead to redundant results, limiting user choices and hindering idea exploration. In this work, we introduce a scalable group inference method that improves both the diversity and quality of a group of samples. We formulate group inference as a quadratic integer assignment problem: candidate outputs are modeled as graph nodes, and a subset is selected to optimize sample quality (unary term) while maximizing group diversity (binary term). To substantially improve runtime efficiency, we progressively prune the candidate set using intermediate predictions, allowing our method to scale up to large candidate sets. Extensive experiments show that our method significantly improves group diversity and quality compared to independent sampling baselines and recent inference algorithms. Our framework generalizes across a wide range of tasks, including text-to-image, image-to-image, image prompting, and video generation, enabling generative models to treat multiple outputs as cohesive groups rather than independent samples.
中文摘要:本文提出一种可扩展的群组推理方法,通过将样本选择构建为二次整数分配问题,在提升生成样本质量的同时显著增强群组多样性,有效解决了多样本输出中的冗余问题。
English Summary: This paper introduces a scalable group inference method that enhances both diversity and quality in generative model outputs by formulating sample selection as a quadratic integer assignment problem, effectively addressing redundancy in multi-sample presentations.

Authors:Qingyang Mao, Qi Cai, Yehao Li, Yingwei Pan, Mingyue Cheng, Ting Yao, Qi Liu, Tao Mei
Title: Visual Autoregressive Modeling for Instruction-Guided Image Editing
Abstract:
Recent advances in diffusion models have brought remarkable visual fidelity to instruction-guided image editing. However, their global denoising process inherently entangles the edited region with the entire image context, leading to unintended spurious modifications and compromised adherence to editing instructions. In contrast, autoregressive models offer a distinct paradigm by formulating image synthesis as a sequential process over discrete visual tokens. Their causal and compositional mechanism naturally circumvents the adherence challenges of diffusion-based methods. In this paper, we present VAREdit, a visual autoregressive (VAR) framework that reframes image editing as a next-scale prediction problem. Conditioned on source image features and text instructions, VAREdit generates multi-scale target features to achieve precise edits. A core challenge in this paradigm is how to effectively condition the source image tokens. We observe that finest-scale source features cannot effectively guide the prediction of coarser target features. To bridge this gap, we introduce a Scale-Aligned Reference (SAR) module, which injects scale-matched conditioning information into the first self-attention layer. VAREdit demonstrates significant advancements in both editing adherence and efficiency. On standard benchmarks, it outperforms leading diffusion-based methods by 30\%+ higher GPT-Balance score. Moreover, it completes a $512\times512$ editing in 1.2 seconds, making it 2.2$\times$ faster than the similarly sized UltraEdit. The models are available at https://github.com/HiDream-ai/VAREdit.
中文: VAREdit提出了一种视觉自回归框架,将图像编辑重构为序列化的多尺度预测任务,通过尺度对齐的条件模块解决扩散模型的无关修改问题,在指令遵循性和效率上均实现显著提升。
English: VAREdit introduces a visual autoregressive framework that reframes image editing as sequential next-scale prediction, achieving superior adherence to instructions and efficiency by addressing the spurious modifications of diffusion models with a scale-aligned conditioning module.

Authors:Yanxu Meng, Haoning Wu, Ya Zhang, Weidi Xie
Title: SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass
Abstract:
3D content generation has recently attracted significant research interest due to its applications in VR/AR and embodied AI. In this work, we address the challenging task of synthesizing multiple 3D assets within a single scene image. Concretely, our contributions are fourfold: (i) we present SceneGen, a novel framework that takes a scene image and corresponding object masks as input, simultaneously producing multiple 3D assets with geometry and texture. Notably, SceneGen operates with no need for optimization or asset retrieval; (ii) we introduce a novel feature aggregation module that integrates local and global scene information from visual and geometric encoders within the feature extraction module. Coupled with a position head, this enables the generation of 3D assets and their relative spatial positions in a single feedforward pass; (iii) we demonstrate SceneGen's direct extensibility to multi-image input scenarios. Despite being trained solely on single-image inputs, our architectural design enables improved generation performance with multi-image inputs; and (iv) extensive quantitative and qualitative evaluations confirm the efficiency and robust generation abilities of our approach. We believe this paradigm offers a novel solution for high-quality 3D content generation, potentially advancing its practical applications in downstream tasks. The code and model will be publicly available at: https://mengmouxu.github.io/SceneGen.
中文: 本文提出SceneGen框架,通过单次前向传播从场景图像和物体掩码直接生成多个具有几何形状和纹理的3D资产,无需优化或检索过程,并能扩展至多图像输入场景,经评估证实其高效稳健的生成能力。
English: This paper introduces SceneGen, a novel framework that generates multiple 3D assets with geometry and texture directly from a single scene image and object masks in one feedforward pass, eliminating the need for optimization or retrieval while demonstrating extensibility to multi-image inputs and robust performance through evaluations.

Authors:Jinhyung Park, Javier Romero, Shunsuke Saito, Fabian Prada, Takaaki Shiratori, Yichen Xu, Federica Bogo, Shoou-I Yu, Kris Kitani, Rawal Khirodkar
Title: ATLAS: Decoupling Skeletal and Shape Parameters for Expressive Parametric Human Modeling
Abstract:
Parametric body models offer expressive 3D representation of humans across a wide range of poses, shapes, and facial expressions, typically derived by learning a basis over registered 3D meshes. However, existing human mesh modeling approaches struggle to capture detailed variations across diverse body poses and shapes, largely due to limited training data diversity and restrictive modeling assumptions. Moreover, the common paradigm first optimizes the external body surface using a linear basis, then regresses internal skeletal joints from surface vertices. This approach introduces problematic dependencies between internal skeleton and outer soft tissue, limiting direct control over body height and bone lengths. To address these issues, we present ATLAS, a high-fidelity body model learned from 600k high-resolution scans captured using 240 synchronized cameras. Unlike previous methods, we explicitly decouple the shape and skeleton bases by grounding our mesh representation in the human skeleton. This decoupling enables enhanced shape expressivity, fine-grained customization of body attributes, and keypoint fitting independent of external soft-tissue characteristics. ATLAS outperforms existing methods by fitting unseen subjects in diverse poses more accurately, and quantitative evaluations show that our non-linear pose correctives more effectively capture complex poses compared to linear models.

Authors:Yifu Zhang, Hao Yang, Yuqi Zhang, Yifei Hu, Fengda Zhu, Chuang Lin, Xiaofeng Mei, Yi Jiang, Bingyue Peng, Zehuan Yuan
Title: Waver: Wave Your Way to Lifelike Video Generation
Abstract:
We present Waver, a high-performance foundation model for unified image and video generation. Waver can directly generate videos with durations ranging from 5 to 10 seconds at a native resolution of 720p, which are subsequently upscaled to 1080p. The model simultaneously supports text-to-video (T2V), image-to-video (I2V), and text-to-image (T2I) generation within a single, integrated framework. We introduce a Hybrid Stream DiT architecture to enhance modality alignment and accelerate training convergence. To ensure training data quality, we establish a comprehensive data curation pipeline and manually annotate and train an MLLM-based video quality model to filter for the highest-quality samples. Furthermore, we provide detailed training and inference recipes to facilitate the generation of high-quality videos. Building on these contributions, Waver excels at capturing complex motion, achieving superior motion amplitude and temporal consistency in video synthesis. Notably, it ranks among the Top 3 on both the T2V and I2V leaderboards at Artificial Analysis (data as of 2025-07-30 10:00 GMT+8), consistently outperforming existing open-source models and matching or surpassing state-of-the-art commercial solutions. We hope this technical report will help the community more efficiently train high-quality video generation models and accelerate progress in video generation technologies. Official page: https://github.com/FoundationVision/Waver.
中文: Waver是一个高性能的图像与视频生成基础模型,在单一框架内支持文生视频、图生视频和文生图任务,在多项评测中达到顶尖水平。
English: Waver is a high-performance foundation model for unified image and video generation that supports text-to-video, image-to-video, and text-to-image tasks within a single framework, achieving top-tier performance on leaderboards.

Authors:Qiaoyu Zheng, Yuze Sun, Chaoyi Wu, Weike Zhao, Pengcheng Qiu, Yongguo Yu, Kun Sun, Yanfeng Wang, Ya Zhang, Weidi Xie
Title: End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning
Abstract:
Accurate diagnosis with medical large language models is hindered by knowledge gaps and hallucinations. Retrieval and tool-augmented methods help, but their impact is limited by weak use of external knowledge and poor feedback-reasoning traceability. To address these challenges, We introduce Deep-DxSearch, an agentic RAG system trained end-to-end with reinforcement learning (RL) that enables steer tracebale retrieval-augmented reasoning for medical diagnosis. In Deep-DxSearch, we first construct a large-scale medical retrieval corpus comprising patient records and reliable medical knowledge sources to support retrieval-aware reasoning across diagnostic scenarios. More crutially, we frame the LLM as the core agent and the retrieval corpus as its environment, using tailored rewards on format, retrieval, reasoning structure, and diagnostic accuracy, thereby evolving the agentic RAG policy from large-scale data through RL. Experiments demonstrate that our end-to-end agentic RL training framework consistently outperforms prompt-engineering and training-free RAG approaches across multiple data centers. After training, Deep-DxSearch achieves substantial gains in diagnostic accuracy, surpassing strong diagnostic baselines such as GPT-4o, DeepSeek-R1, and other medical-specific frameworks for both common and rare disease diagnosis under in-distribution and out-of-distribution settings. Moreover, ablation studies on reward design and retrieval corpus components confirm their critical roles, underscoring the uniqueness and effectiveness of our approach compared with traditional implementations. Finally, case studies and interpretability analyses highlight improvements in Deep-DxSearch's diagnostic policy, providing deeper insight into its performance gains and supporting clinicians in delivering more reliable and precise preliminary diagnoses. See https://github.com/MAGIC-AI4Med/Deep-DxSearch.
中文摘要:Deep-DxSearch是一种基于强化学习的智能检索增强生成系统,通过提升外部知识利用和推理可追溯性来改进医疗诊断,在多种临床场景中显著超越现有模型的准确率表现。
English Summary: Deep-DxSearch is an agentic retrieval-augmented generation system trained with reinforcement learning that enhances medical diagnosis by improving knowledge utilization and reasoning traceability, outperforming existing models in accuracy across diverse clinical settings.

Authors:Ehsan Pajouheshgar, Aditya Bhardwaj, Nathaniel Selub, Ethan Lake
Title: Exploring the Landscape of Non-Equilibrium Memories with Neural Cellular Automata
Abstract:
We investigate the landscape of many-body memories: families of local non-equilibrium dynamics that retain information about their initial conditions for thermodynamically long time scales, even in the presence of arbitrary perturbations. In two dimensions, the only well-studied memory is Toom's rule. Using a combination of rigorous proofs and machine learning methods, we show that the landscape of 2D memories is in fact quite vast. We discover memories that correct errors in ways qualitatively distinct from Toom's rule, have ordered phases stabilized by fluctuations, and preserve information only in the presence of noise. Taken together, our results show that physical systems can perform robust information storage in many distinct ways, and demonstrate that the physics of many-body memories is richer than previously realized. Interactive visualizations of the dynamics studied in this work are available at https://memorynca.github.io/2D.

Authors:Zhiheng Liu, Xueqing Deng, Shoufa Chen, Angtian Wang, Qiushan Guo, Mingfei Han, Zeyue Xue, Mengzhao Chen, Ping Luo, Linjie Yang
Title: WorldWeaver: Generating Long-Horizon Video Worlds via Rich Perception
Abstract:
Generative video modeling has made significant strides, yet ensuring structural and temporal consistency over long sequences remains a challenge. Current methods predominantly rely on RGB signals, leading to accumulated errors in object structure and motion over extended durations. To address these issues, we introduce WorldWeaver, a robust framework for long video generation that jointly models RGB frames and perceptual conditions within a unified long-horizon modeling scheme. Our training framework offers three key advantages. First, by jointly predicting perceptual conditions and color information from a unified representation, it significantly enhances temporal consistency and motion dynamics. Second, by leveraging depth cues, which we observe to be more resistant to drift than RGB, we construct a memory bank that preserves clearer contextual information, improving quality in long-horizon video generation. Third, we employ segmented noise scheduling for training prediction groups, which further mitigates drift and reduces computational cost. Extensive experiments on both diffusion- and rectified flow-based models demonstrate the effectiveness of WorldWeaver in reducing temporal drift and improving the fidelity of generated videos.

Authors:Franz Hanke, Antonia Bieringer, Olaf Wysocki, Boris Jutzi
Title: CM2LoD3: Reconstructing LoD3 Building Models Using Semantic Conflict Maps
Abstract:
Detailed 3D building models are crucial for urban planning, digital twins, and disaster management applications. While Level of Detail 1 (LoD)1 and LoD2 building models are widely available, they lack detailed facade elements essential for advanced urban analysis. In contrast, LoD3 models address this limitation by incorporating facade elements such as windows, doors, and underpasses. However, their generation has traditionally required manual modeling, making large-scale adoption challenging. In this contribution, CM2LoD3, we present a novel method for reconstructing LoD3 building models leveraging Conflict Maps (CMs) obtained from ray-to-model-prior analysis. Unlike previous works, we concentrate on semantically segmenting real-world CMs with synthetically generated CMs from our developed Semantic Conflict Map Generator (SCMG). We also observe that additional segmentation of textured models can be fused with CMs using confidence scores to further increase segmentation performance and thus increase 3D reconstruction accuracy. Experimental results demonstrate the effectiveness of our CM2LoD3 method in segmenting and reconstructing building openings, with the 61% performance with uncertainty-aware fusion of segmented building textures. This research contributes to the advancement of automated LoD3 model reconstruction, paving the way for scalable and efficient 3D city modeling. Our project is available: https://github.com/InFraHank/CM2LoD3
中文: CM2LoD3方法通过语义分割冲突图并与纹理模型数据融合,实现了自动化重建详细LoD3建筑模型,提升了分割和三维重建精度,为可扩展的城市三维建模开辟了新途径。
English: The CM2LoD3 method introduces an automated approach for reconstructing detailed LoD3 building models by semantically segmenting Conflict Maps and fusing them with textured model data, achieving improved segmentation and reconstruction accuracy for scalable 3D city modeling.

Authors:Ziyang Yan, Ruikai Li, Zhiyong Cui, Bohan Li, Han Jiang, Yilong Ren, Aoyong Li, Zhenning Li, Sijia Wen, Haiyang Yu
Title: MapKD: Unlocking Prior Knowledge with Cross-Modal Distillation for Efficient Online HD Map Construction
Abstract:
Online HD map construction is a fundamental task in autonomous driving systems, aiming to acquire semantic information of map elements around the ego vehicle based on real-time sensor inputs. Recently, several approaches have achieved promising results by incorporating offline priors such as SD maps and HD maps or by fusing multi-modal data. However, these methods depend on stale offline maps and multi-modal sensor suites, resulting in avoidable computational overhead at inference. To address these limitations, we employ a knowledge distillation strategy to transfer knowledge from multimodal models with prior knowledge to an efficient, low-cost, and vision-centric student model. Specifically, we propose MapKD, a novel multi-level cross-modal knowledge distillation framework with an innovative Teacher-Coach-Student (TCS) paradigm. This framework consists of: (1) a camera-LiDAR fusion model with SD/HD map priors serving as the teacher; (2) a vision-centric coach model with prior knowledge and simulated LiDAR to bridge the cross-modal knowledge transfer gap; and (3) a lightweight vision-based student model. Additionally, we introduce two targeted knowledge distillation strategies: Token-Guided 2D Patch Distillation (TGPD) for bird's eye view feature alignment and Masked Semantic Response Distillation (MSRD) for semantic learning guidance. Extensive experiments on the challenging nuScenes dataset demonstrate that MapKD improves the student model by +6.68 mIoU and +10.94 mAP while simultaneously accelerating inference speed. The code is available at:https://github.com/2004yan/MapKD2026.
中文:该研究提出MapKD知识蒸馏框架,将多模态地图知识迁移至纯视觉学生模型,在在线高精地图构建中实现了性能显著提升与推理加速。
English: The study introduces MapKD, a knowledge distillation framework that transfers multimodal map knowledge to a vision-only student model, achieving significant performance gains and faster inference speeds in online HD map construction.

Authors:Mengyu Wang, Zhenyu Liu, Kun Li, Yu Wang, Yuwei Wang, Yanyan Wei, Fei Wang
Title: Task-Generalized Adaptive Cross-Domain Learning for Multimodal Image Fusion
Abstract:
Multimodal Image Fusion (MMIF) aims to integrate complementary information from different imaging modalities to overcome the limitations of individual sensors. It enhances image quality and facilitates downstream applications such as remote sensing, medical diagnostics, and robotics. Despite significant advancements, current MMIF methods still face challenges such as modality misalignment, high-frequency detail destruction, and task-specific limitations. To address these challenges, we propose AdaSFFuse, a novel framework for task-generalized MMIF through adaptive cross-domain co-fusion learning. AdaSFFuse introduces two key innovations: the Adaptive Approximate Wavelet Transform (AdaWAT) for frequency decoupling, and the Spatial-Frequency Mamba Blocks for efficient multimodal fusion. AdaWAT adaptively separates the high- and low-frequency components of multimodal images from different scenes, enabling fine-grained extraction and alignment of distinct frequency characteristics for each modality. The Spatial-Frequency Mamba Blocks facilitate cross-domain fusion in both spatial and frequency domains, enhancing this process. These blocks dynamically adjust through learnable mappings to ensure robust fusion across diverse modalities. By combining these components, AdaSFFuse improves the alignment and integration of multimodal features, reduces frequency loss, and preserves critical details. Extensive experiments on four MMIF tasks -- Infrared-Visible Image Fusion (IVF), Multi-Focus Image Fusion (MFF), Multi-Exposure Image Fusion (MEF), and Medical Image Fusion (MIF) -- demonstrate AdaSFFuse's superior fusion performance, ensuring both low computational cost and a compact network, offering a strong balance between performance and efficiency. The code will be publicly available at https://github.com/Zhen-yu-Liu/AdaSFFuse.
中文:提出的AdaSFFuse框架通过自适应频率解耦和跨域融合解决多模态图像融合难题,在多种任务中实现卓越性能的同时保持计算效率。
English: The proposed AdaSFFuse framework addresses multimodal image fusion challenges through adaptive frequency decoupling and cross-domain fusion, achieving superior performance across multiple tasks while maintaining computational efficiency.

Authors:Chengqi Dong, Fenghe Tang, Rongge Mao, Xinpei Gao, S. Kevin Zhou
Title: LGMSNet: Thinning a medical image segmentation model via dual-level multiscale fusion
Abstract:
Medical image segmentation plays a pivotal role in disease diagnosis and treatment planning, particularly in resource-constrained clinical settings where lightweight and generalizable models are urgently needed. However, existing lightweight models often compromise performance for efficiency and rarely adopt computationally expensive attention mechanisms, severely restricting their global contextual perception capabilities. Additionally, current architectures neglect the channel redundancy issue under the same convolutional kernels in medical imaging, which hinders effective feature extraction. To address these challenges, we propose LGMSNet, a novel lightweight framework based on local and global dual multiscale that achieves state-of-the-art performance with minimal computational overhead. LGMSNet employs heterogeneous intra-layer kernels to extract local high-frequency information while mitigating channel redundancy. In addition, the model integrates sparse transformer-convolutional hybrid branches to capture low-frequency global information. Extensive experiments across six public datasets demonstrate LGMSNet's superiority over existing state-of-the-art methods. In particular, LGMSNet maintains exceptional performance in zero-shot generalization tests on four unseen datasets, underscoring its potential for real-world deployment in resource-limited medical scenarios. The whole project code is in https://github.com/cq-dong/LGMSNet.
中文: LGMSNet是一种新颖的轻量级医学图像分割框架,通过异构内核和变换器-卷积混合设计,在最小计算成本下实现卓越性能,并在多个数据集上展现出强大的泛化能力。
English: LGMSNet is a novel lightweight medical image segmentation framework that uses heterogeneous kernels and transformer-convolutional hybrids to achieve superior performance with minimal computational cost, demonstrating strong generalization across multiple datasets.

Authors:Huy Hoang Nguyen, Johannes Huemer, Markus Murschitz, Tobias Glueck, Minh Nhat Vu, Andreas Kugi
Title: Lang2Lift: A Framework for Language-Guided Pallet Detection and Pose Estimation Integrated in Autonomous Outdoor Forklift Operation
Abstract:
The logistics and construction industries face persistent challenges in automating pallet handling, especially in outdoor environments with variable payloads, inconsistencies in pallet quality and dimensions, and unstructured surroundings. In this paper, we tackle automation of a critical step in pallet transport: the pallet pick-up operation. Our work is motivated by labor shortages, safety concerns, and inefficiencies in manually locating and retrieving pallets under such conditions. We present Lang2Lift, a framework that leverages foundation models for natural language-guided pallet detection and 6D pose estimation, enabling operators to specify targets through intuitive commands such as "pick up the steel beam pallet near the crane." The perception pipeline integrates Florence-2 and SAM-2 for language-grounded segmentation with FoundationPose for robust pose estimation in cluttered, multi-pallet outdoor scenes under variable lighting. The resulting poses feed into a motion planning module for fully autonomous forklift operation. We validate Lang2Lift on the ADAPT autonomous forklift platform, achieving 0.76 mIoU pallet segmentation accuracy on a real-world test dataset. Timing and error analysis demonstrate the system's robustness and confirm its feasibility for deployment in operational logistics and construction environments. Video demonstrations are available at https://eric-nguyen1402.github.io/lang2lift.github.io/

Authors:Wenrui Li, Wei Han, Liang-Jian Deng, Ruiqin Xiong, Xiaopeng Fan
Title: Spiking Variational Graph Representation Inference for Video Summarization
Abstract:
With the rise of short video content, efficient video summarization techniques for extracting key information have become crucial. However, existing methods struggle to capture the global temporal dependencies and maintain the semantic coherence of video content. Additionally, these methods are also influenced by noise during multi-channel feature fusion. We propose a Spiking Variational Graph (SpiVG) Network, which enhances information density and reduces computational complexity. First, we design a keyframe extractor based on Spiking Neural Networks (SNN), leveraging the event-driven computation mechanism of SNNs to learn keyframe features autonomously. To enable fine-grained and adaptable reasoning across video frames, we introduce a Dynamic Aggregation Graph Reasoner, which decouples contextual object consistency from semantic perspective coherence. We present a Variational Inference Reconstruction Module to address uncertainty and noise arising during multi-channel feature fusion. In this module, we employ Evidence Lower Bound Optimization (ELBO) to capture the latent structure of multi-channel feature distributions, using posterior distribution regularization to reduce overfitting. Experimental results show that SpiVG surpasses existing methods across multiple datasets such as SumMe, TVSum, VideoXum, and QFVS. Our codes and pre-trained models are available at https://github.com/liwrui/SpiVG.
中文摘要:提出的脉冲变分图网络通过结合脉冲神经网络的关键帧提取、动态图推理和变分推断方法,有效提升了视频摘要的语义连贯性,同时降低了计算复杂度并减少了多通道特征融合中的噪声干扰。
English Summary: The proposed Spiking Variational Graph (SpiVG) Network addresses limitations in video summarization by combining spiking neural networks for keyframe extraction with dynamic graph reasoning and variational inference to enhance semantic coherence while reducing computational complexity and noise.

Authors:Olga Matykina, Dmitry Yudin
Title: RCDINO: Enhancing Radar-Camera 3D Object Detection with DINOv2 Semantic Features
Abstract:
Three-dimensional object detection is essential for autonomous driving and robotics, relying on effective fusion of multimodal data from cameras and radar. This work proposes RCDINO, a multimodal transformer-based model that enhances visual backbone features by fusing them with semantically rich representations from the pretrained DINOv2 foundation model. This approach enriches visual representations and improves the model's detection performance while preserving compatibility with the baseline architecture. Experiments on the nuScenes dataset demonstrate that RCDINO achieves state-of-the-art performance among radar-camera models, with 56.4 NDS and 48.1 mAP. Our implementation is available at https://github.com/OlgaMatykina/RCDINO.
中文: RCDINO是一种基于多模态变换器的模型,通过融合DINOv2的丰富语义特征与视觉数据来增强三维物体检测能力,在nuScenes数据集上实现了最先进的性能。
English: RCDINO is a multimodal transformer-based model that enhances 3D object detection by integrating DINOv2's rich semantic features with visual data, achieving state-of-the-art performance on the nuScenes dataset.

Authors:Wutao Liu, YiDan Wang, Pan Gao
Title: First RAG, Second SEG: A Training-Free Paradigm for Camouflaged Object Detection
Abstract:
Camouflaged object detection (COD) poses a significant challenge in computer vision due to the high similarity between objects and their backgrounds. Existing approaches often rely on heavy training and large computational resources. While foundation models such as the Segment Anything Model (SAM) offer strong generalization, they still struggle to handle COD tasks without fine-tuning and require high-quality prompts to yield good performance. However, generating such prompts manually is costly and inefficient. To address these challenges, we propose \textbf{First RAG, Second SEG (RAG-SEG)}, a training-free paradigm that decouples COD into two stages: Retrieval-Augmented Generation (RAG) for generating coarse masks as prompts, followed by SAM-based segmentation (SEG) for refinement. RAG-SEG constructs a compact retrieval database via unsupervised clustering, enabling fast and effective feature retrieval. During inference, the retrieved features produce pseudo-labels that guide precise mask generation using SAM2. Our method eliminates the need for conventional training while maintaining competitive performance. Extensive experiments on benchmark COD datasets demonstrate that RAG-SEG performs on par with or surpasses state-of-the-art methods. Notably, all experiments are conducted on a \textbf{personal laptop}, highlighting the computational efficiency and practicality of our approach. We present further analysis in the Appendix, covering limitations, salient object detection extension, and possible improvements. \textcolor{blue} {Code: https://github.com/Lwt-diamond/RAG-SEG.}
中文: 提出的RAG-SEG方法通过检索增强生成创建提示词和SAM分割优化的两阶段设计,无需训练即可实现竞争性伪装物体检测性能,且能在个人笔记本电脑上高效运行。
English: The proposed RAG-SEG method addresses camouflaged object detection by combining retrieval-augmented generation for prompt creation and SAM-based segmentation for refinement, achieving competitive performance without training while operating efficiently on personal laptops.

Authors:Jiamu Wang, Keunho Byeon, Jinsol Song, Anh Nguyen, Sangjeong Ahn, Sung Hak Lee, Jin Tae Kwak
Title: Pathology-Informed Latent Diffusion Model for Anomaly Detection in Lymph Node Metastasis
Abstract:
Anomaly detection is an emerging approach in digital pathology for its ability to efficiently and effectively utilize data for disease diagnosis. While supervised learning approaches deliver high accuracy, they rely on extensively annotated datasets, suffering from data scarcity in digital pathology. Unsupervised anomaly detection, however, offers a viable alternative by identifying deviations from normal tissue distributions without requiring exhaustive annotations. Recently, denoising diffusion probabilistic models have gained popularity in unsupervised anomaly detection, achieving promising performance in both natural and medical imaging datasets. Building on this, we incorporate a vision-language model with a diffusion model for unsupervised anomaly detection in digital pathology, utilizing histopathology prompts during reconstruction. Our approach employs a set of pathology-related keywords associated with normal tissues to guide the reconstruction process, facilitating the differentiation between normal and abnormal tissues. To evaluate the effectiveness of the proposed method, we conduct experiments on a gastric lymph node dataset from a local hospital and assess its generalization ability under domain shift using a public breast lymph node dataset. The experimental results highlight the potential of the proposed method for unsupervised anomaly detection across various organs in digital pathology. Code: https://github.com/QuIIL/AnoPILaD.
中文: 本研究提出一种结合视觉语言与扩散模型的无监督异常检测方法,通过组织病理学提示区分异常组织,并在多器官数据集中展现出优异的检测性能与泛化能力。
English: This study introduces an unsupervised anomaly detection method for digital pathology by integrating vision-language and diffusion models, using histopathology prompts to distinguish abnormal tissues and demonstrating strong performance across multiple organ datasets.

Authors:Shihao Dong, Xiaotong Zhou, Yuhui Zheng, Huiying Xu, Xinzhong Zhu
Title: Center-Oriented Prototype Contrastive Clustering
Abstract:
Contrastive learning is widely used in clustering tasks due to its discriminative representation. However, the conflict problem between classes is difficult to solve effectively. Existing methods try to solve this problem through prototype contrast, but there is a deviation between the calculation of hard prototypes and the true cluster center. To address this problem, we propose a center-oriented prototype contrastive clustering framework, which consists of a soft prototype contrastive module and a dual consistency learning module. In short, the soft prototype contrastive module uses the probability that the sample belongs to the cluster center as a weight to calculate the prototype of each category, while avoiding inter-class conflicts and reducing prototype drift. The dual consistency learning module aligns different transformations of the same sample and the neighborhoods of different samples respectively, ensuring that the features have transformation-invariant semantic information and compact intra-cluster distribution, while providing reliable guarantees for the calculation of prototypes. Extensive experiments on five datasets show that the proposed method is effective compared to the SOTA. Our code is published on https://github.com/LouisDong95/CPCC.
中文: 该研究提出的面向中心的原型对比聚类框架通过软原型加权和双重一致性学习解决类间冲突和原型漂移问题,在五个数据集上的实验证明了其优于现有方法的性能。
English: The proposed center-oriented prototype contrastive clustering framework addresses inter-class conflicts and prototype drift through soft prototype weighting and dual consistency learning, demonstrating superior performance in experiments across five datasets.

Authors:Hantao Zhang, Jingyang Liu, Ed Li
Title: See it. Say it. Sorted: Agentic System for Compositional Diagram Generation
Abstract:
We study sketch-to-diagram generation: converting rough hand sketches into precise, compositional diagrams. Diffusion models excel at photorealism but struggle with the spatial precision, alignment, and symbolic structure required for flowcharts. We introduce See it. Say it. Sorted., a training-free agentic system that couples a Vision-Language Model (VLM) with Large Language Models (LLMs) to produce editable Scalable Vector Graphics (SVG) programs. The system runs an iterative loop in which a Critic VLM proposes a small set of qualitative, relational edits; multiple candidate LLMs synthesize SVG updates with diverse strategies (conservative->aggressive, alternative, focused); and a Judge VLM selects the best candidate, ensuring stable improvement. This design prioritizes qualitative reasoning over brittle numerical estimates, preserves global constraints (e.g., alignment, connectivity), and naturally supports human-in-the-loop corrections. On 10 sketches derived from flowcharts in published papers, our method more faithfully reconstructs layout and structure than two frontier closed-source image generation LLMs (GPT-5 and Gemini-2.5-Pro), accurately composing primitives (e.g., multi-headed arrows) without inserting unwanted text. Because outputs are programmatic SVGs, the approach is readily extensible to presentation tools (e.g., PowerPoint) via APIs and can be specialized with improved prompts and task-specific tools. The codebase is open-sourced at https://github.com/hantaoZhangrichard/see_it_say_it_sorted.git.
中文: 本研究提出了一种无需训练的智能系统,通过结合视觉语言模型与大语言模型的迭代优化,将手绘草图转化为精确可编辑的矢量图表,在布局还原度上超越现有模型,并具备程序化扩展能力。
English: This research introduces a training-free agentic system that combines Vision-Language and Large Language Models to convert hand sketches into precise, editable SVG diagrams through iterative refinement, outperforming existing models in layout accuracy while enabling programmatic extensibility.

Authors:Leiyue Zhao, Yuechen Yang, Yanfan Zhu, Haichun Yang, Yuankai Huo, Paul D. Simonson, Kenji Ikemura, Mert R. Sabuncu, Yihe Yang, Ruining Deng
Title: DyMorph-B2I: Dynamic and Morphology-Guided Binary-to-Instance Segmentation for Renal Pathology
Abstract:
Accurate morphological quantification of renal pathology functional units relies on instance-level segmentation, yet most existing datasets and automated methods provide only binary (semantic) masks, limiting the precision of downstream analyses. Although classical post-processing techniques such as watershed, morphological operations, and skeletonization, are often used to separate semantic masks into instances, their individual effectiveness is constrained by the diverse morphologies and complex connectivity found in renal tissue. In this study, we present DyMorph-B2I, a dynamic, morphology-guided binary-to-instance segmentation pipeline tailored for renal pathology. Our approach integrates watershed, skeletonization, and morphological operations within a unified framework, complemented by adaptive geometric refinement and customizable hyperparameter tuning for each class of functional unit. Through systematic parameter optimization, DyMorph-B2I robustly separates adherent and heterogeneous structures present in binary masks. Experimental results demonstrate that our method outperforms individual classical approaches and naïve combinations, enabling superior instance separation and facilitating more accurate morphometric analysis in renal pathology workflows. The pipeline is publicly available at: https://github.com/ddrrnn123/DyMorph-B2I.
中文: DyMorph-B2I是一种动态的形态学引导二值到实例分割流程,通过整合分水岭、骨架化和形态学操作,有效分离肾脏病理中的粘连结构,显著提升实例分割精度和形态分析准确性。
English: DyMorph-B2I is a dynamic, morphology-guided pipeline that integrates watershed, skeletonization, and morphological operations to robustly convert binary masks into instance-level segmentations for renal pathology, outperforming classical methods and enabling more accurate morphometric analysis.

Authors:Yan Luo, Drake Du, Hao Huang, Yi Fang, Mengyu Wang
Title: CurveFlow: Curvature-Guided Flow Matching for Image Generation
Abstract:
Existing rectified flow models are based on linear trajectories between data and noise distributions. This linearity enforces zero curvature, which can inadvertently force the image generation process through low-probability regions of the data manifold. A key question remains underexplored: how does the curvature of these trajectories correlate with the semantic alignment between generated images and their corresponding captions, i.e., instructional compliance? To address this, we introduce CurveFlow, a novel flow matching framework designed to learn smooth, non-linear trajectories by directly incorporating curvature guidance into the flow path. Our method features a robust curvature regularization technique that penalizes abrupt changes in the trajectory's intrinsic dynamics.Extensive experiments on MS COCO 2014 and 2017 demonstrate that CurveFlow achieves state-of-the-art performance in text-to-image generation, significantly outperforming both standard rectified flow variants and other non-linear baselines like Rectified Diffusion. The improvements are especially evident in semantic consistency metrics such as BLEU, METEOR, ROUGE, and CLAIR. This confirms that our curvature-aware modeling substantially enhances the model's ability to faithfully follow complex instructions while simultaneously maintaining high image quality. The code is made publicly available at https://github.com/Harvard-AI-and-Robotics-Lab/CurveFlow.
中文: CurveFlow提出了一种曲率引导的流匹配框架,通过非线性轨迹学习显著提升了文本到图像生成中的语义对齐和图像质量,优于线性模型。
English: CurveFlow introduces a curvature-guided flow matching framework that learns non-linear trajectories, significantly improving semantic alignment and image quality in text-to-image generation compared to linear models.

Authors:Andrew C. Freeman, Luke Reinkensmeyer
Title: adder-viz: Real-Time Visualization Software for Transcoding Event Video
Abstract:
Recent years have brought about a surge in neuromorphic ``event'' video research, primarily targeting computer vision applications. Event video eschews video frames in favor of asynchronous, per-pixel intensity samples. While much work has focused on a handful of representations for specific event cameras, these representations have shown limitations in flexibility, speed, and compressibility. We previously proposed the unified ADDER representation to address these concerns. This paper introduces numerous improvements to the adder-viz software for visualizing real-time event transcode processes and applications in-the-loop. The MIT-licensed software is available from a centralized repository at https://github.com/ac-freeman/adder-codec-rs.
中文: 本文介绍了对adder-viz软件的改进,用于实时可视化神经形态事件视频转码过程,通过统一的ADDER表示法解决了现有方法在灵活性和压缩性方面的局限。
English: This paper presents enhancements to the adder-viz software for real-time visualization of neuromorphic event video transcoding, addressing limitations in existing representations through the unified ADDER approach.

Authors:Andrei Balykin, Anvar Ganiev, Denis Kondranin, Kirill Polevoda, Nikolai Liudkevich, Artem Petrov
Title: Paired-Sampling Contrastive Framework for Joint Physical-Digital Face Attack Detection
Abstract:
Modern face recognition systems remain vulnerable to spoofing attempts, including both physical presentation attacks and digital forgeries. Traditionally, these two attack vectors have been handled by separate models, each targeting its own artifacts and modalities. However, maintaining distinct detectors increases system complexity and inference latency and leaves systems exposed to combined attack vectors. We propose the Paired-Sampling Contrastive Framework, a unified training approach that leverages automatically matched pairs of genuine and attack selfies to learn modality-agnostic liveness cues. Evaluated on the 6th Face Anti-Spoofing Challenge Unified Physical-Digital Attack Detection benchmark, our method achieves an average classification error rate (ACER) of 2.10 percent, outperforming prior solutions. The framework is lightweight (4.46 GFLOPs) and trains in under one hour, making it practical for real-world deployment. Code and pretrained models are available at https://github.com/xPONYx/iccv2025_deepfake_challenge.
中文: 配对采样对比框架通过利用自动匹配的真实与攻击自拍对来学习模态无关的活体特征,实现了2.10%的低错误率,并具备轻量高效的特点,适合实际应用。
English: The Paired-Sampling Contrastive Framework is a unified training method that uses matched genuine and attack selfie pairs to learn modality-agnostic liveness cues, achieving a low error rate of 2.10% and high efficiency for real-world deployment.

Authors:Hakjin Lee, Junghoon Seo, Jaehoon Sim
Title: You Only Pose Once: A Minimalist's Detection Transformer for Monocular RGB Category-level 9D Multi-Object Pose Estimation
Abstract:
Accurately recovering the full 9-DoF pose of unseen instances within specific categories from a single RGB image remains a core challenge for robotics and automation. Most existing solutions still rely on pseudo-depth, CAD models, or multi-stage cascades that separate 2D detection from pose estimation. Motivated by the need for a simpler, RGB-only alternative that learns directly at the category level, we revisit a longstanding question: Can object detection and 9-DoF pose estimation be unified with high performance, without any additional data? We show that they can with our method, YOPO, a single-stage, query-based framework that treats category-level 9-DoF estimation as a natural extension of 2D detection. YOPO augments a transformer detector with a lightweight pose head, a bounding-box-conditioned translation module, and a 6D-aware Hungarian matching cost. The model is trained end-to-end only with RGB images and category-level pose labels. Despite its minimalist design, YOPO sets a new state of the art on three benchmarks. On the REAL275 dataset, it achieves 79.6% $\rm{IoU}_{50}$ and 54.1% under the $10^\circ$$10{\rm{cm}}$ metric, surpassing prior RGB-only methods and closing much of the gap to RGB-D systems. The code, models, and additional qualitative results can be found on our project.

Authors:Chiao-An Yang, Raymond A. Yeh
Title: Heatmap Regression without Soft-Argmax for Facial Landmark Detection
Abstract:
Facial landmark detection is an important task in computer vision with numerous applications, such as head pose estimation, expression analysis, face swapping, etc. Heatmap regression-based methods have been widely used to achieve state-of-the-art results in this task. These methods involve computing the argmax over the heatmaps to predict a landmark. Since argmax is not differentiable, these methods use a differentiable approximation, Soft-argmax, to enable end-to-end training on deep-nets. In this work, we revisit this long-standing choice of using Soft-argmax and demonstrate that it is not the only way to achieve strong performance. Instead, we propose an alternative training objective based on the classic structured prediction framework. Empirically, our method achieves state-of-the-art performance on three facial landmark benchmarks (WFLW, COFW, and 300W), converging 2.2x faster during training while maintaining better/competitive accuracy. Our code is available here: https://github.com/ca-joe-yang/regression-without-softarg.
中文摘要:本研究通过引入基于结构化预测的训练目标,挑战了面部关键点检测中传统使用的Soft-argmax方法,在三个基准测试上以2.2倍更快的收敛速度实现了最优性能。
English Summary: This study challenges the conventional use of Soft-argmax in facial landmark detection by introducing a structured prediction-based training objective, which achieves state-of-the-art performance with 2.2x faster convergence on three benchmarks.

Authors:Jia Lu, Taoran Yi, Jiemin Fang, Chen Yang, Chuiyun Wu, Wei Shen, Wenyu Liu, Qi Tian, Xinggang Wang
Title: Snap-Snap: Taking Two Images to Reconstruct 3D Human Gaussians in Milliseconds
Abstract:
Reconstructing 3D human bodies from sparse views has been an appealing topic, which is crucial to broader the related applications. In this paper, we propose a quite challenging but valuable task to reconstruct the human body from only two images, i.e., the front and back view, which can largely lower the barrier for users to create their own 3D digital humans. The main challenges lie in the difficulty of building 3D consistency and recovering missing information from the highly sparse input. We redesign a geometry reconstruction model based on foundation reconstruction models to predict consistent point clouds even input images have scarce overlaps with extensive human data training. Furthermore, an enhancement algorithm is applied to supplement the missing color information, and then the complete human point clouds with colors can be obtained, which are directly transformed into 3D Gaussians for better rendering quality. Experiments show that our method can reconstruct the entire human in 190 ms on a single NVIDIA RTX 4090, with two images at a resolution of 1024x1024, demonstrating state-of-the-art performance on the THuman2.0 and cross-domain datasets. Additionally, our method can complete human reconstruction even with images captured by low-cost mobile devices, reducing the requirements for data collection. Demos and code are available at https://hustvl.github.io/Snap-Snap/.
中文: 本文提出了一种仅需正反两张图像即可重建3D人体的方法,通过改进几何模型和色彩增强技术,在保证高质量渲染的同时大幅降低了数据采集和设备门槛。
English: This paper introduces a method for reconstructing 3D human bodies from just front and back view images, using a redesigned geometry model and color enhancement to achieve fast, high-quality results with minimal hardware requirements.

Authors:Licheng Shen, Saining Zhang, Honghan Li, Peilin Yang, Zihao Huang, Zongzheng Zhang, Hao Zhao
Title: GaussianArt: Unified Modeling of Geometry and Motion for Articulated Objects
Abstract:
Reconstructing articulated objects is essential for building digital twins of interactive environments. However, prior methods typically decouple geometry and motion by first reconstructing object shape in distinct states and then estimating articulation through post-hoc alignment. This separation complicates the reconstruction pipeline and restricts scalability, especially for objects with complex, multi-part articulation. We introduce a unified representation that jointly models geometry and motion using articulated 3D Gaussians. This formulation improves robustness in motion decomposition and supports articulated objects with up to 20 parts, significantly outperforming prior approaches that often struggle beyond 2--3 parts due to brittle initialization. To systematically assess scalability and generalization, we propose MPArt-90, a new benchmark consisting of 90 articulated objects across 20 categories, each with diverse part counts and motion configurations. Extensive experiments show that our method consistently achieves superior accuracy in part-level geometry reconstruction and motion estimation across a broad range of object types. We further demonstrate applicability to downstream tasks such as robotic simulation and human-scene interaction modeling, highlighting the potential of unified articulated representations in scalable physical modeling.

Authors:Bingquan Dai, Li Ray Luo, Qihong Tang, Jie Wang, Xinyu Lian, Hao Xu, Minghan Qin, Xudong Xu, Bo Dai, Haoqian Wang, Zhaoyang Lyu, Jiangmiao Pang
Title: MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds
Abstract:
Reconstructing 3D objects into editable programs is pivotal for applications like reverse engineering and shape editing. However, existing methods often rely on limited domain-specific languages (DSLs) and small-scale datasets, restricting their ability to model complex geometries and structures. To address these challenges, we introduce MeshCoder, a novel framework that reconstructs complex 3D objects from point clouds into editable Blender Python scripts. We develop a comprehensive set of expressive Blender Python APIs capable of synthesizing intricate geometries. Leveraging these APIs, we construct a large-scale paired object-code dataset, where the code for each object is decomposed into distinct semantic parts. Subsequently, we train a multimodal large language model (LLM) that translates 3D point cloud into executable Blender Python scripts. Our approach not only achieves superior performance in shape-to-code reconstruction tasks but also facilitates intuitive geometric and topological editing through convenient code modifications. Furthermore, our code-based representation enhances the reasoning capabilities of LLMs in 3D shape understanding tasks. Together, these contributions establish MeshCoder as a powerful and flexible solution for programmatic 3D shape reconstruction and understanding. The project homepage is available at \href{https://daibingquan.github.io/MeshCoder}{this link}.

Authors:Canyu Zhao, Xiaoman Li, Tianjian Feng, Zhiyue Zhao, Hao Chen, Chunhua Shen
Title: Tinker: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization
Abstract:
We introduce Tinker, a versatile framework for high-fidelity 3D editing that operates in both one-shot and few-shot regimes without any per-scene finetuning. Unlike prior techniques that demand extensive per-scene optimization to ensure multi-view consistency or to produce dozens of consistent edited input views, Tinker delivers robust, multi-view consistent edits from as few as one or two images. This capability stems from repurposing pretrained diffusion models, which unlocks their latent 3D awareness. To drive research in this space, we curate the first large-scale multi-view editing dataset and data pipeline, spanning diverse scenes and styles. Building on this dataset, we develop our framework capable of generating multi-view consistent edited views without per-scene training, which consists of two novel components: (1) Referring multi-view editor: Enables precise, reference-driven edits that remain coherent across all viewpoints. (2) Any-view-to-video synthesizer: Leverages spatial-temporal priors from video diffusion to perform high-quality scene completion and novel-view generation even from sparse inputs. Through extensive experiments, Tinker significantly reduces the barrier to generalizable 3D content creation, achieving state-of-the-art performance on editing, novel-view synthesis, and rendering enhancement tasks. We believe that Tinker represents a key step towards truly scalable, zero-shot 3D editing. Project webpage: https://aim-uofa.github.io/Tinker
中文: Tinker是一个多功能3D编辑框架,无需逐场景训练即可通过最少输入图像实现高保真、多视角一致的编辑,利用预训练扩散模型及创新组件实现精确编辑和新视角生成。
English: Tinker is a versatile 3D editing framework that enables high-fidelity, multi-view consistent edits from minimal input images without per-scene training, leveraging pretrained diffusion models and novel components for precise editing and novel-view synthesis.

Authors:Abhijith Punnappurath, Luxi Zhao, Hoang Le, Abdelrahman Abdelhamed, SaiKiran Kumar Tedla, Michael S. Brown
Title: Improved Mapping Between Illuminations and Sensors for RAW Images
Abstract:
RAW images are unprocessed camera sensor output with sensor-specific RGB values based on the sensor's color filter spectral sensitivities. RAW images also incur strong color casts due to the sensor's response to the spectral properties of scene illumination. The sensor- and illumination-specific nature of RAW images makes it challenging to capture RAW datasets for deep learning methods, as scenes need to be captured for each sensor and under a wide range of illumination. Methods for illumination augmentation for a given sensor and the ability to map RAW images between sensors are important for reducing the burden of data capture. To explore this problem, we introduce the first-of-its-kind dataset comprising carefully captured scenes under a wide range of illumination. Specifically, we use a customized lightbox with tunable illumination spectra to capture several scenes with different cameras. Our illumination and sensor mapping dataset has 390 illuminations, four cameras, and 18 scenes. Using this dataset, we introduce a lightweight neural network approach for illumination and sensor mapping that outperforms competing methods. We demonstrate the utility of our approach on the downstream task of training a neural ISP. Link to project page: https://github.com/SamsungLabs/illum-sensor-mapping.
Chinese: RAW图像因传感器和光照特性导致的色彩偏差给深度学习带来挑战,本研究通过引入包含多种光照和传感器的创新数据集及轻量级神经网络方法,有效实现光照增强和传感器间映射,提升神经ISP训练效果。
English: RAW images present challenges for deep learning due to their sensor- and illumination-specific color casts, which this study addresses by introducing a novel dataset and a lightweight neural network for illumination and sensor mapping to facilitate neural ISP training.

Authors:Zichi Liu, Yinggui Wang, Tao Wei, Chao Ma
Title: AnchorSync: Global Consistency Optimization for Long Video Editing
Abstract:
Editing long videos remains a challenging task due to the need for maintaining both global consistency and temporal coherence across thousands of frames. Existing methods often suffer from structural drift or temporal artifacts, particularly in minute-long sequences. We introduce AnchorSync, a novel diffusion-based framework that enables high-quality, long-term video editing by decoupling the task into sparse anchor frame editing and smooth intermediate frame interpolation. Our approach enforces structural consistency through a progressive denoising process and preserves temporal dynamics via multimodal guidance. Extensive experiments show that AnchorSync produces coherent, high-fidelity edits, surpassing prior methods in visual quality and temporal stability.
中文摘要:AnchorSync是一种基于扩散的框架,通过将长视频编辑分解为关键帧编辑和中间帧插值,实现了高质量编辑,确保了结构一致性和时间连贯性,在视觉质量和时间稳定性上超越现有方法。
English Summary: AnchorSync is a diffusion-based framework that enhances long video editing by separating it into anchor frame editing and frame interpolation, ensuring structural consistency and temporal coherence for superior visual quality and stability.

Authors:Peiming Li, Ziyi Wang, Yulin Yuan, Hong Liu, Xiangming Meng, Junsong Yuan, Mengyuan Liu
Title: UST-SSM: Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling
Abstract:
Point cloud videos capture dynamic 3D motion while reducing the effects of lighting and viewpoint variations, making them highly effective for recognizing subtle and continuous human actions. Although Selective State Space Models (SSMs) have shown good performance in sequence modeling with linear complexity, the spatio-temporal disorder of point cloud videos hinders their unidirectional modeling when directly unfolding the point cloud video into a 1D sequence through temporally sequential scanning. To address this challenge, we propose the Unified Spatio-Temporal State Space Model (UST-SSM), which extends the latest advancements in SSMs to point cloud videos. Specifically, we introduce Spatial-Temporal Selection Scanning (STSS), which reorganizes unordered points into semantic-aware sequences through prompt-guided clustering, thereby enabling the effective utilization of points that are spatially and temporally distant yet similar within the sequence. For missing 4D geometric and motion details, Spatio-Temporal Structure Aggregation (STSA) aggregates spatio-temporal features and compensates. To improve temporal interaction within the sampled sequence, Temporal Interaction Sampling (TIS) enhances fine-grained temporal dependencies through non-anchor frame utilization and expanded receptive fields. Experimental results on the MSR-Action3D, NTU RGB+D, and Synthia 4D datasets validate the effectiveness of our method. Our code is available at https://github.com/wangzy01/UST-SSM.
中文摘要:本研究提出的统一时空状态空间模型(UST-SSM)通过语义感知的序列重组和时空特征增强,有效解决了点云视频在序列建模中的时空无序问题,在多个数据集上验证了其优越性能。
English Summary: The proposed Unified Spatio-Temporal State Space Model (UST-SSM) effectively processes point cloud videos by reorganizing unordered points into semantic sequences and enhancing spatio-temporal feature aggregation to overcome limitations in existing sequence modeling approaches.

Authors:Sofiène Boutaj, Marin Scalbert, Pierre Marza, Florent Couzinie-Devy, Maria Vakalopoulou, Stergios Christodoulidis
Title: Controllable Latent Space Augmentation for Digital Pathology
Abstract:
Whole slide image (WSI) analysis in digital pathology presents unique challenges due to the gigapixel resolution of WSIs and the scarcity of dense supervision signals. While Multiple Instance Learning (MIL) is a natural fit for slide-level tasks, training robust models requires large and diverse datasets. Even though image augmentation techniques could be utilized to increase data variability and reduce overfitting, implementing them effectively is not a trivial task. Traditional patch-level augmentation is prohibitively expensive due to the large number of patches extracted from each WSI, and existing feature-level augmentation methods lack control over transformation semantics. We introduce HistAug, a fast and efficient generative model for controllable augmentations in the latent space for digital pathology. By conditioning on explicit patch-level transformations (e.g., hue, erosion), HistAug generates realistic augmented embeddings while preserving initial semantic information. Our method allows the processing of a large number of patches in a single forward pass efficiently, while at the same time consistently improving MIL model performance. Experiments across multiple slide-level tasks and diverse organs show that HistAug outperforms existing methods, particularly in low-data regimes. Ablation studies confirm the benefits of learned transformations over noise-based perturbations and highlight the importance of uniform WSI-wise augmentation. Code is available at https://github.com/MICS-Lab/HistAug.
中文: HistAug提出了一种用于数字病理学的高效可控潜在空间增强生成模型,通过生成保留语义的真实嵌入,在多样化数据集和低数据场景下持续提升多示例学习模型的性能。
English: HistAug introduces a generative model for efficient and controllable latent space augmentation in digital pathology, enhancing MIL model performance by generating realistic embeddings while preserving semantics across diverse datasets and low-data scenarios.

Authors:Walter Zimmer, Ross Greer, Xingcheng Zhou, Rui Song, Marc Pavel, Daniel Lehmberg, Ahmed Ghita, Akshay Gopalkrishnan, Mohan Trivedi, Alois Knoll
Title: Safety-Critical Learning for Long-Tail Events: The TUM Traffic Accident Dataset
Abstract:
Even though a significant amount of work has been done to increase the safety of transportation networks, accidents still occur regularly. They must be understood as an unavoidable and sporadic outcome of traffic networks. We present the TUM Traffic Accident (TUMTraf-A) dataset, a collection of real-world highway accidents. It contains ten sequences of vehicle crashes at high-speed driving with 294,924 labeled 2D and 93,012 labeled 3D boxes and track IDs within 48,144 labeled frames recorded from four roadside cameras and LiDARs at 10 Hz. The dataset contains ten object classes and is provided in the OpenLABEL format. We propose Accid3nD, an accident detection model that combines a rule-based approach with a learning-based one. Experiments and ablation studies on our dataset show the robustness of our proposed method. The dataset, model, and code are available on our project website: https://tum-traffic-dataset.github.io/tumtraf-a.
中文:尽管交通安全工作不断加强,事故仍难以避免,TUMTraf-A数据集收录了真实高速公路事故,结合规则与学习方法的Accid3nD模型在实验中展现出可靠的检测性能。
English: Despite extensive efforts to enhance transportation safety, accidents remain inevitable, and the TUMTraf-A dataset, featuring real-world highway crashes with extensive 2D/3D annotations, is introduced alongside the Accid3nD model, which effectively combines rule-based and learning-based approaches for robust accident detection.

Authors:Diego Belzarena, Seginus Mowlavi, Aitor Artola, Camilo Mariño, Marina Gardella, Ignacio Ramírez, Antoine Tadros, Roy He, Natalia Bottaioli, Boshra Rajaei, Gregory Randall, Jean-Michel Morel
Title: Improving OCR using internal document redundancy
Abstract:
Current OCR systems are based on deep learning models trained on large amounts of data. Although they have shown some ability to generalize to unseen data, especially in detection tasks, they can struggle with recognizing low-quality data. This is particularly evident for printed documents, where intra-domain data variability is typically low, but inter-domain data variability is high. In that context, current OCR methods do not fully exploit each document's redundancy. We propose an unsupervised method by leveraging the redundancy of character shapes within a document to correct imperfect outputs of a given OCR system and suggest better clustering. To this aim, we introduce an extended Gaussian Mixture Model (GMM) by alternating an Expectation-Maximization (EM) algorithm with an intra-cluster realignment process and normality statistical testing. We demonstrate improvements in documents with various levels of degradation, including recovered Uruguayan military archives and 17th to mid-20th century European newspapers.
中文: 现有OCR系统在处理低质量数据时存在不足且未充分利用文档冗余性,为此我们提出一种无监督方法,通过利用字符形状冗余和扩展高斯混合模型来提升OCR精度和聚类效果,并在包括历史档案和报纸在内的退化文档上验证了其有效性。
English: Current OCR systems often struggle with low-quality data and fail to fully utilize document redundancy, so we propose an unsupervised method using character shape redundancy and an extended Gaussian Mixture Model to improve OCR accuracy and clustering, demonstrating effectiveness on degraded documents like historical archives and newspapers.

Authors:Haoran Bai, Xiaoxu Chen, Canqian Yang, Zongyao He, Sibin Deng, Ying Chen
Title: Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration
Abstract:
We present Vivid-VR, a DiT-based generative video restoration method built upon an advanced T2V foundation model, where ControlNet is leveraged to control the generation process, ensuring content consistency. However, conventional fine-tuning of such controllable pipelines frequently suffers from distribution drift due to limitations in imperfect multimodal alignment, resulting in compromised texture realism and temporal coherence. To tackle this challenge, we propose a concept distillation training strategy that utilizes the pretrained T2V model to synthesize training samples with embedded textual concepts, thereby distilling its conceptual understanding to preserve texture and temporal quality. To enhance generation controllability, we redesign the control architecture with two key components: 1) a control feature projector that filters degradation artifacts from input video latents to minimize their propagation through the generation pipeline, and 2) a new ControlNet connector employing a dual-branch design. This connector synergistically combines MLP-based feature mapping with cross-attention mechanism for dynamic control feature retrieval, enabling both content preservation and adaptive control signal modulation. Extensive experiments show that Vivid-VR performs favorably against existing approaches on both synthetic and real-world benchmarks, as well as AIGC videos, achieving impressive texture realism, visual vividness, and temporal consistency. The codes and checkpoints are publicly available at https://github.com/csbhr/Vivid-VR.
中文:Vivid-VR是一种基于DiT的视频修复方法,通过概念蒸馏训练策略和改进的双分支控制架构,有效提升了纹理真实性和时序连贯性,在视觉质量上优于现有方法。
English: Vivid-VR is a DiT-based video restoration method that enhances texture realism and temporal coherence through concept distillation and an improved control architecture, outperforming existing approaches in visual quality.

Authors:Xiangfei Sheng, Xiaofeng Pan, Zhichao Yang, Pengfei Chen, Leida Li
Title: Fine-grained Image Quality Assessment for Perceptual Image Restoration
Abstract:
Recent years have witnessed remarkable achievements in perceptual image restoration (IR), creating an urgent demand for accurate image quality assessment (IQA), which is essential for both performance comparison and algorithm optimization. Unfortunately, the existing IQA metrics exhibit inherent weakness for IR task, particularly when distinguishing fine-grained quality differences among restored images. To address this dilemma, we contribute the first-of-its-kind fine-grained image quality assessment dataset for image restoration, termed FGRestore, comprising 18,408 restored images across six common IR tasks. Beyond conventional scalar quality scores, FGRestore was also annotated with 30,886 fine-grained pairwise preferences. Based on FGRestore, a comprehensive benchmark was conducted on the existing IQA metrics, which reveal significant inconsistencies between score-based IQA evaluations and the fine-grained restoration quality. Motivated by these findings, we further propose FGResQ, a new IQA model specifically designed for image restoration, which features both coarse-grained score regression and fine-grained quality ranking. Extensive experiments and comparisons demonstrate that FGResQ significantly outperforms state-of-the-art IQA metrics. Codes and model weights have been released in https://pxf0429.github.io/FGResQ/

Authors:Gyusam Chang, Tuan-Anh Vu, Vivek Alumootil, Harris Song, Deanna Pham, Sangpil Kim, M. Khalid Jawed
Title: Reconstruction Using the Invisible: Intuition from NIR and Metadata for Enhanced 3D Gaussian Splatting
Abstract:
While 3D Gaussian Splatting (3DGS) has rapidly advanced, its application in agriculture remains underexplored. Agricultural scenes present unique challenges for 3D reconstruction methods, particularly due to uneven illumination, occlusions, and a limited field of view. To address these limitations, we introduce \textbf{NIRPlant}, a novel multimodal dataset encompassing Near-Infrared (NIR) imagery, RGB imagery, textual metadata, Depth, and LiDAR data collected under varied indoor and outdoor lighting conditions. By integrating NIR data, our approach enhances robustness and provides crucial botanical insights that extend beyond the visible spectrum. Additionally, we leverage text-based metadata derived from vegetation indices, such as NDVI, NDWI, and the chlorophyll index, which significantly enriches the contextual understanding of complex agricultural environments. To fully exploit these modalities, we propose \textbf{NIRSplat}, an effective multimodal Gaussian splatting architecture employing a cross-attention mechanism combined with 3D point-based positional encoding, providing robust geometric priors. Comprehensive experiments demonstrate that \textbf{NIRSplat} outperforms existing landmark methods, including 3DGS, CoR-GS, and InstantSplat, highlighting its effectiveness in challenging agricultural scenarios. The code and dataset are publicly available at: https://github.com/StructuresComp/3D-Reconstruction-NIR
中文: 研究者提出了包含近红外图像和元数据的多模态农业数据集NIRPlant,并开发了NIRSplat跨注意力高斯溅射模型,在复杂农田场景中显著优于现有方法。
English: The authors introduce NIRPlant, a multimodal agricultural dataset with NIR imagery and metadata, and propose NIRSplat, a cross-attention Gaussian splatting model that outperforms existing methods in challenging farm environments.

Authors:Fei Peng, Junqiang Wu, Yan Li, Tingting Gao, Di Zhang, Huiyuan Fu
Title: MUSE: Multi-Subject Unified Synthesis via Explicit Layout Semantic Expansion
Abstract:
Existing text-to-image diffusion models have demonstrated remarkable capabilities in generating high-quality images guided by textual prompts. However, achieving multi-subject compositional synthesis with precise spatial control remains a significant challenge. In this work, we address the task of layout-controllable multi-subject synthesis (LMS), which requires both faithful reconstruction of reference subjects and their accurate placement in specified regions within a unified image. While recent advancements have separately improved layout control and subject synthesis, existing approaches struggle to simultaneously satisfy the dual requirements of spatial precision and identity preservation in this composite task. To bridge this gap, we propose MUSE, a unified synthesis framework that employs concatenated cross-attention (CCA) to seamlessly integrate layout specifications with textual guidance through explicit semantic space expansion. The proposed CCA mechanism enables bidirectional modality alignment between spatial constraints and textual descriptions without interference. Furthermore, we design a progressive two-stage training strategy that decomposes the LMS task into learnable sub-objectives for effective optimization. Extensive experiments demonstrate that MUSE achieves zero-shot end-to-end generation with superior spatial accuracy and identity consistency compared to existing solutions, advancing the frontier of controllable image synthesis. Our code and model are available at https://github.com/pf0607/MUSE.
中文: 提出的MUSE框架通过引入拼接交叉注意力和渐进式训练策略,解决了布局可控多主体合成的难题,在图像生成中实现了卓越的空间精度和身份一致性。
English: The proposed MUSE framework addresses the challenge of layout-controllable multi-subject synthesis by introducing concatenated cross-attention and a progressive training strategy, achieving superior spatial accuracy and identity consistency in image generation.

Authors:Jeahun Sung, Changhyun Roh, Chanho Eom, Jihyong Oh
Title: MoCHA-former: Moiré-Conditioned Hybrid Adaptive Transformer for Video Demoiréing
Abstract:
Recent advances in portable imaging have made camera-based screen capture ubiquitous. Unfortunately, frequency aliasing between the camera's color filter array (CFA) and the display's sub-pixels induces moiré patterns that severely degrade captured photos and videos. Although various demoiréing models have been proposed to remove such moiré patterns, these approaches still suffer from several limitations: (i) spatially varying artifact strength within a frame, (ii) large-scale and globally spreading structures, (iii) channel-dependent statistics and (iv) rapid temporal fluctuations across frames. We address these issues with the Moiré Conditioned Hybrid Adaptive Transformer (MoCHA-former), which comprises two key components: Decoupled Moiré Adaptive Demoiréing (DMAD) and Spatio-Temporal Adaptive Demoiréing (STAD). DMAD separates moiré and content via a Moiré Decoupling Block (MDB) and a Detail Decoupling Block (DDB), then produces moiré-adaptive features using a Moiré Conditioning Block (MCB) for targeted restoration. STAD introduces a Spatial Fusion Block (SFB) with window attention to capture large-scale structures, and a Feature Channel Attention (FCA) to model channel dependence in RAW frames. To ensure temporal consistency, MoCHA-former performs implicit frame alignment without any explicit alignment module. We analyze moiré characteristics through qualitative and quantitative studies, and evaluate on two video datasets covering RAW and sRGB domains. MoCHA-former consistently surpasses prior methods across PSNR, SSIM, and LPIPS.

Authors:Seokjun Choi, Hoon-Gyu Chung, Yujin Jeon, Giljoo Nam, Seung-Hwan Baek
Title: A Real-world Display Inverse Rendering Dataset
Abstract:
Inverse rendering aims to reconstruct geometry and reflectance from captured images. Display-camera imaging systems offer unique advantages for this task: each pixel can easily function as a programmable point light source, and the polarized light emitted by LCD displays facilitates diffuse-specular separation. Despite these benefits, there is currently no public real-world dataset captured using display-camera systems, unlike other setups such as light stages. This absence hinders the development and evaluation of display-based inverse rendering methods. In this paper, we introduce the first real-world dataset for display-based inverse rendering. To achieve this, we construct and calibrate an imaging system comprising an LCD display and stereo polarization cameras. We then capture a diverse set of objects with diverse geometry and reflectance under one-light-at-a-time (OLAT) display patterns. We also provide high-quality ground-truth geometry. Our dataset enables the synthesis of captured images under arbitrary display patterns and different noise levels. Using this dataset, we evaluate the performance of existing photometric stereo and inverse rendering methods, and provide a simple, yet effective baseline for display inverse rendering, outperforming state-of-the-art inverse rendering methods. Code and dataset are available on our project page at https://michaelcsj.github.io/DIR/

Authors:Junchao Zhu, Ruining Deng, Junlin Guo, Tianyuan Yao, Juming Xiong, Chongyu Qu, Mengmeng Yin, Yu Wang, Shilin Zhao, Haichun Yang, Daguang Xu, Yucheng Tang, Yuankai Huo
Title: Img2ST-Net: Efficient High-Resolution Spatial Omics Prediction from Whole Slide Histology Images via Fully Convolutional Image-to-Image Learning
Abstract:
Recent advances in multi-modal AI have demonstrated promising potential for generating the currently expensive spatial transcriptomics (ST) data directly from routine histology images, offering a means to reduce the high cost and time-intensive nature of ST data acquisition. However, the increasing resolution of ST, particularly with platforms such as Visium HD achieving 8um or finer, introduces significant computational and modeling challenges. Conventional spot-by-spot sequential regression frameworks become inefficient and unstable at this scale, while the inherent extreme sparsity and low expression levels of high-resolution ST further complicate both prediction and evaluation. To address these limitations, we propose Img2ST-Net, a novel histology-to-ST generation framework for efficient and parallel high-resolution ST prediction. Unlike conventional spot-by-spot inference methods, Img2ST-Net employs a fully convolutional architecture to generate dense, HD gene expression maps in a parallelized manner. By modeling HD ST data as super-pixel representations, the task is reformulated from image-to-omics inference into a super-content image generation problem with hundreds or thousands of output channels. This design not only improves computational efficiency but also better preserves the spatial organization intrinsic to spatial omics data. To enhance robustness under sparse expression patterns, we further introduce SSIM-ST, a structural-similarity-based evaluation metric tailored for high-resolution ST analysis. We present a scalable, biologically coherent framework for high-resolution ST prediction. Img2ST-Net offers a principled solution for efficient and accurate ST inference at scale. Our contributions lay the groundwork for next-generation ST modeling that is robust and resolution-aware. The source code has been made publicly available at https://github.com/hrlblab/Img2ST-Net.
中文: 多模态人工智能的最新进展能够从组织学图像中高效生成高分辨率空间转录组数据,而Img2ST-Net通过并行超像素框架和专用评估指标,解决了由此产生的计算挑战。
English: Recent multi-modal AI advances enable cost-effective generation of high-resolution spatial transcriptomics data from histology images, but face computational challenges that Img2ST-Net addresses through a parallel super-pixel framework and specialized evaluation metrics.

Authors:Runshi Zhang, Bimeng Jie, Yang He, Junchen Wang
Title: TCFNet: Bidirectional face-bone transformation via a Transformer-based coarse-to-fine point movement network
Abstract:
Computer-aided surgical simulation is a critical component of orthognathic surgical planning, where accurately simulating face-bone shape transformations is significant. The traditional biomechanical simulation methods are limited by their computational time consumption levels, labor-intensive data processing strategies and low accuracy. Recently, deep learning-based simulation methods have been proposed to view this problem as a point-to-point transformation between skeletal and facial point clouds. However, these approaches cannot process large-scale points, have limited receptive fields that lead to noisy points, and employ complex preprocessing and postprocessing operations based on registration. These shortcomings limit the performance and widespread applicability of such methods. Therefore, we propose a Transformer-based coarse-to-fine point movement network (TCFNet) to learn unique, complicated correspondences at the patch and point levels for dense face-bone point cloud transformations. This end-to-end framework adopts a Transformer-based network and a local information aggregation network (LIA-Net) in the first and second stages, respectively, which reinforce each other to generate precise point movement paths. LIA-Net can effectively compensate for the neighborhood precision loss of the Transformer-based network by modeling local geometric structures (edges, orientations and relative position features). The previous global features are employed to guide the local displacement using a gated recurrent unit. Inspired by deformable medical image registration, we propose an auxiliary loss that can utilize expert knowledge for reconstructing critical organs.Compared with the existing state-of-the-art (SOTA) methods on gathered datasets, TCFNet achieves outstanding evaluation metrics and visualization results. The code is available at https://github.com/Runshi-Zhang/TCFNet.
中文摘要:提出的TCFNet通过基于Transformer的从粗到细网络,结合全局特征与局部几何建模,有效解决了现有方法在面部骨骼点云转换中的精度和规模限制,实现了更精确的形变模拟。
English Summary: The proposed TCFNet, a Transformer-based coarse-to-fine network, overcomes limitations of existing methods by learning patch and point-level correspondences for accurate face-bone transformations through complementary global and local feature integration.

Authors:Zhujun Li, Shuo Zhang, Ioannis Stamos
Title: Learning Point Cloud Representations with Pose Continuity for Depth-Based Category-Level 6D Object Pose Estimation
Abstract:
Category-level object pose estimation aims to predict the 6D pose and 3D size of objects within given categories. Existing approaches for this task rely solely on 6D poses as supervisory signals without explicitly capturing the intrinsic continuity of poses, leading to inconsistencies in predictions and reduced generalization to unseen poses. To address this limitation, we propose HRC-Pose, a novel depth-only framework for category-level object pose estimation, which leverages contrastive learning to learn point cloud representations that preserve the continuity of 6D poses. HRC-Pose decouples object pose into rotation and translation components, which are separately encoded and leveraged throughout the network. Specifically, we introduce a contrastive learning strategy for multi-task, multi-category scenarios based on our 6D pose-aware hierarchical ranking scheme, which contrasts point clouds from multiple categories by considering rotational and translational differences as well as categorical information. We further design pose estimation modules that separately process the learned rotation-aware and translation-aware embeddings. Our experiments demonstrate that HRC-Pose successfully learns continuous feature spaces. Results on REAL275 and CAMERA25 benchmarks show that our method consistently outperforms existing depth-only state-of-the-art methods and runs in real-time, demonstrating its effectiveness and potential for real-world applications. Our code is at https://github.com/zhujunli1993/HRC-Pose.
中文: HRC-Pose是一种新颖的仅使用深度信息的类别级物体姿态估计框架,通过对比学习保持6D姿态连续性,在基准测试中优于现有方法且能实时运行。
English: HRC-Pose is a novel depth-only framework for category-level object pose estimation that uses contrastive learning to preserve 6D pose continuity, outperforming existing methods on benchmarks while running in real-time.

Authors:Gaston Gustavo Rios, Pedro Dal Bianco, Franco Ronchetti, Facundo Quiroga, Oscar Stanchi, Santiago Ponte Ahón, Waldo Hasperué
Title: HandCraft: Dynamic Sign Generation for Synthetic Data Augmentation
Abstract:
Sign Language Recognition (SLR) models face significant performance limitations due to insufficient training data availability. In this article, we address the challenge of limited data in SLR by introducing a novel and lightweight sign generation model based on CMLPe. This model, coupled with a synthetic data pretraining approach, consistently improves recognition accuracy, establishing new state-of-the-art results for the LSFB and DiSPLaY datasets using our Mamba-SL and Transformer-SL classifiers. Our findings reveal that synthetic data pretraining outperforms traditional augmentation methods in some cases and yields complementary benefits when implemented alongside them. Our approach democratizes sign generation and synthetic data pretraining for SLR by providing computationally efficient methods that achieve significant performance improvements across diverse datasets.
Chinese: 本研究提出了一种基于CMLPe的轻量级手语生成模型和合成数据预训练方法,以解决手语识别中训练数据不足的问题,在多个数据集上取得了最先进的结果,并显示出优于或与传统数据增强方法互补的性能。
English: The study introduces a lightweight sign generation model using CMLPe and synthetic data pretraining to overcome limited training data in Sign Language Recognition, achieving state-of-the-art results and demonstrating superior or complementary performance compared to traditional methods.

Authors:Anushka A. Kore, Frank G. te Nijenhuis, Matthijs van der Sluijs, Wim van Zwam, Charles Majoie, Geert Lycklama à Nijeholt, Danny Ruijters, Frans Vos, Sandra Cornelissen, Ruisheng Su, Theo van Walsum
Title: OccluNet: Spatio-Temporal Deep Learning for Occlusion Detection on DSA
Abstract:
Accurate detection of vascular occlusions during endovascular thrombectomy (EVT) is critical in acute ischemic stroke (AIS). Interpretation of digital subtraction angiography (DSA) sequences poses challenges due to anatomical complexity and time constraints. This work proposes OccluNet, a spatio-temporal deep learning model that integrates YOLOX, a single-stage object detector, with transformer-based temporal attention mechanisms to automate occlusion detection in DSA sequences. We compared OccluNet with a YOLOv11 baseline trained on either individual DSA frames or minimum intensity projections. Two spatio-temporal variants were explored for OccluNet: pure temporal attention and divided space-time attention. Evaluation on DSA images from the MR CLEAN Registry revealed the model's capability to capture temporally consistent features, achieving precision and recall of 89.02% and 74.87%, respectively. OccluNet significantly outperformed the baseline models, and both attention variants attained similar performance. Source code is available at https://github.com/anushka-kore/OccluNet.git
中文摘要:本研究提出OccluNet模型,通过结合YOLOX目标检测器与基于Transformer的时序注意力机制,实现了数字减影血管造影序列中血管闭塞的自动检测,在MR CLEAN Registry数据集上以89.02%的精确率和74.87%的召回率显著优于基线模型。
English Summary: This study introduces OccluNet, a spatio-temporal deep learning model combining YOLOX with transformer-based attention mechanisms to automate vascular occlusion detection in DSA sequences, demonstrating superior performance over baseline models with 89.02% precision and 74.87% recall.

Authors:Said Djafar Said, Torkan Gholamalizadeh, Mostafa Mehdipour Ghazi
Title: Tooth-Diffusion: Guided 3D CBCT Synthesis with Fine-Grained Tooth Conditioning
Abstract:
Despite the growing importance of dental CBCT scans for diagnosis and treatment planning, generating anatomically realistic scans with fine-grained control remains a challenge in medical image synthesis. In this work, we propose a novel conditional diffusion framework for 3D dental volume generation, guided by tooth-level binary attributes that allow precise control over tooth presence and configuration. Our approach integrates wavelet-based denoising diffusion, FiLM conditioning, and masked loss functions to focus learning on relevant anatomical structures. We evaluate the model across diverse tasks, such as tooth addition, removal, and full dentition synthesis, using both paired and distributional similarity metrics. Results show strong fidelity and generalization with low FID scores, robust inpainting performance, and SSIM values above 0.91 even on unseen scans. By enabling realistic, localized modification of dentition without rescanning, this work opens opportunities for surgical planning, patient communication, and targeted data augmentation in dental AI workflows. The codes are available at: https://github.com/djafar1/tooth-diffusion.
Chinese: 本研究提出了一种条件扩散框架,用于生成具有精确牙齿属性控制的逼真3D牙科CBCT扫描,在牙齿修改和合成等任务中实现了高保真度和泛化能力。
English: This study introduces a conditional diffusion framework for generating realistic 3D dental CBCT scans with precise control over tooth attributes, achieving high fidelity and generalization in tasks like tooth modification and synthesis.

Authors:Tinghan Yang, Md Ashiqur Rahman, Raymond A. Yeh
Title: CLIPSym: Delving into Symmetry Detection with CLIP
Abstract:
Symmetry is one of the most fundamental geometric cues in computer vision, and detecting it has been an ongoing challenge. With the recent advances in vision-language models,~i.e., CLIP, we investigate whether a pre-trained CLIP model can aid symmetry detection by leveraging the additional symmetry cues found in the natural image descriptions. We propose CLIPSym, which leverages CLIP's image and language encoders and a rotation-equivariant decoder based on a hybrid of Transformer and $G$-Convolution to detect rotation and reflection symmetries. To fully utilize CLIP's language encoder, we have developed a novel prompting technique called Semantic-Aware Prompt Grouping (SAPG), which aggregates a diverse set of frequent object-based prompts to better integrate the semantic cues for symmetry detection. Empirically, we show that CLIPSym outperforms the current state-of-the-art on three standard symmetry detection datasets (DENDI, SDRW, and LDRS). Finally, we conduct detailed ablations verifying the benefits of CLIP's pre-training, the proposed equivariant decoder, and the SAPG technique. The code is available at https://github.com/timyoung2333/CLIPSym.
中文:CLIPSym是一种新颖的对称性检测方法,它利用预训练的CLIP模型,结合旋转等变解码器和语义感知提示分组技术,在多个标准数据集上超越了现有最优方法。
English: CLIPSym, a novel symmetry detection method, leverages pre-trained CLIP models with a rotation-equivariant decoder and Semantic-Aware Prompt Grouping to outperform state-of-the-art approaches on standard datasets.

Authors:Md Ashiqur Rahman, Chiao-An Yang, Michael N. Cheng, Lim Jun Hao, Jeremiah Jiang, Teck-Yian Lim, Raymond A. Yeh
Title: Local Scale Equivariance with Latent Deep Equilibrium Canonicalizer
Abstract:
Scale variation is a fundamental challenge in computer vision. Objects of the same class can have different sizes, and their perceived size is further affected by the distance from the camera. These variations are local to the objects, i.e., different object sizes may change differently within the same image. To effectively handle scale variations, we present a deep equilibrium canonicalizer (DEC) to improve the local scale equivariance of a model. DEC can be easily incorporated into existing network architectures and can be adapted to a pre-trained model. Notably, we show that on the competitive ImageNet benchmark, DEC improves both model performance and local scale consistency across four popular pre-trained deep-nets, e.g., ViT, DeiT, Swin, and BEiT. Our code is available at https://github.com/ashiq24/local-scale-equivariance.
中文: 本文提出了一种深度均衡规范化器(DEC),通过增强模型的局部尺度等变性来解决计算机视觉中的尺度变化问题,在ImageNet基准测试中显著提升了多种预训练网络的性能和尺度一致性。
English: The paper introduces a deep equilibrium canonicalizer (DEC) to address local scale variations in computer vision by enhancing model equivariance, which boosts performance and scale consistency across multiple pre-trained networks on ImageNet.

Authors:Ronghao Dang, Yuqian Yuan, Yunxuan Mao, Kehan Li, Jiangpin Liu, Zhikai Wang, Xin Li, Fan Wang, Deli Zhao
Title: RynnEC: Bringing MLLMs into Embodied World
Abstract:
We introduce RynnEC, a video multimodal large language model designed for embodied cognition. Built upon a general-purpose vision-language foundation model, RynnEC incorporates a region encoder and a mask decoder, enabling flexible region-level video interaction. Despite its compact architecture, RynnEC achieves state-of-the-art performance in object property understanding, object segmentation, and spatial reasoning. Conceptually, it offers a region-centric video paradigm for the brain of embodied agents, providing fine-grained perception of the physical world and enabling more precise interactions. To mitigate the scarcity of annotated 3D datasets, we propose an egocentric video based pipeline for generating embodied cognition data. Furthermore, we introduce RynnEC-Bench, a region-centered benchmark for evaluating embodied cognitive capabilities. We anticipate that RynnEC will advance the development of general-purpose cognitive cores for embodied agents and facilitate generalization across diverse embodied tasks. The code, model checkpoints, and benchmark are available at: https://github.com/alibaba-damo-academy/RynnEC
中文:RynnEC 是一种紧凑型视频多模态大语言模型,通过区域级视频交互在具身认知任务中实现最优性能,并利用以自我为中心的视频数据生成流程解决数据稀缺问题。
English: RynnEC is a compact video multimodal large language model that achieves state-of-the-art performance in embodied cognition tasks through region-level video interaction and addresses data scarcity with an egocentric video data generation pipeline.

Authors:Lianghui Zhu, Bin Ouyang, Yuxuan Zhang, Tianheng Cheng, Rui Hu, Haocheng Shen, Longjin Ran, Xiaoxin Chen, Li Yu, Wenyu Liu, Xinggang Wang
Title: LENS: Learning to Segment Anything with Unified Reinforced Reasoning
Abstract:
Text-prompted image segmentation enables fine-grained visual understanding and is critical for applications such as human-computer interaction and robotics. However, existing supervised fine-tuning methods typically ignore explicit chain-of-thought (CoT) reasoning at test time, which limits their ability to generalize to unseen prompts and domains. To address this issue, we introduce LENS, a scalable reinforcement-learning framework that jointly optimizes the reasoning process and segmentation in an end-to-end manner. We propose unified reinforcement-learning rewards that span sentence-, box-, and segment-level cues, encouraging the model to generate informative CoT rationales while refining mask quality. Using a publicly available 3-billion-parameter vision-language model, i.e., Qwen2.5-VL-3B-Instruct, LENS achieves an average cIoU of 81.2% on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks, outperforming the strong fine-tuned method, i.e., GLaMM, by up to 5.6%. These results demonstrate that RL-driven CoT reasoning serves as a robust prior for text-prompted segmentation and offers a practical path toward more generalizable Segment Anything models. Code is available at https://github.com/hustvl/LENS.
Chinese: LENS 是一种强化学习框架,通过联合优化思维链推理与图像分割,显著提升了文本提示图像分割的精度和泛化能力,在多个基准测试中表现优异。
English: LENS is a reinforcement learning framework that enhances text-prompted image segmentation by jointly optimizing chain-of-thought reasoning and segmentation, achieving state-of-the-art performance on benchmarks and improving generalization.

Authors:Chin-Yang Lin, Cheng Sun, Fu-En Yang, Min-Hung Chen, Yen-Yu Lin, Yu-Lun Liu
Title: LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos
Abstract:
LongSplat addresses critical challenges in novel view synthesis (NVS) from casually captured long videos characterized by irregular camera motion, unknown camera poses, and expansive scenes. Current methods often suffer from pose drift, inaccurate geometry initialization, and severe memory limitations. To address these issues, we introduce LongSplat, a robust unposed 3D Gaussian Splatting framework featuring: (1) Incremental Joint Optimization that concurrently optimizes camera poses and 3D Gaussians to avoid local minima and ensure global consistency; (2) a robust Pose Estimation Module leveraging learned 3D priors; and (3) an efficient Octree Anchor Formation mechanism that converts dense point clouds into anchors based on spatial density. Extensive experiments on challenging benchmarks demonstrate that LongSplat achieves state-of-the-art results, substantially improving rendering quality, pose accuracy, and computational efficiency compared to prior approaches. Project page: https://linjohnss.github.io/longsplat/

Authors:Omkar Thawakar, Dmitry Demidov, Ritesh Thawkar, Rao Muhammad Anwer, Mubarak Shah, Fahad Shahbaz Khan, Salman Khan
Title: Beyond Simple Edits: Composed Video Retrieval with Dense Modifications
Abstract:
Composed video retrieval is a challenging task that strives to retrieve a target video based on a query video and a textual description detailing specific modifications. Standard retrieval frameworks typically struggle to handle the complexity of fine-grained compositional queries and variations in temporal understanding limiting their retrieval ability in the fine-grained setting. To address this issue, we introduce a novel dataset that captures both fine-grained and composed actions across diverse video segments, enabling more detailed compositional changes in retrieved video content. The proposed dataset, named Dense-WebVid-CoVR, consists of 1.6 million samples with dense modification text that is around seven times more than its existing counterpart. We further develop a new model that integrates visual and textual information through Cross-Attention (CA) fusion using grounded text encoder, enabling precise alignment between dense query modifications and target videos. The proposed model achieves state-of-the-art results surpassing existing methods on all metrics. Notably, it achieves 71.3\% Recall@1 in visual+text setting and outperforms the state-of-the-art by 3.4\%, highlighting its efficacy in terms of leveraging detailed video descriptions and dense modification texts. Our proposed dataset, code, and model are available at :https://github.com/OmkarThawakar/BSE-CoVR
中文: 本研究提出了一种新的组合视频检索数据集和模型,通过融合视觉与文本信息精确对齐密集修改内容与目标视频,实现了最先进的检索性能。
English: This study introduces a novel dataset and model for composed video retrieval, achieving state-of-the-art performance by integrating visual and textual information to precisely align dense modifications with target videos.

Authors:Lintao Xiang, Xinkai Chen, Jianhuang Lai, Guangcong Wang
Title: Distilled-3DGS:Distilled 3D Gaussian Splatting
Abstract:
3D Gaussian Splatting (3DGS) has exhibited remarkable efficacy in novel view synthesis (NVS). However, it suffers from a significant drawback: achieving high-fidelity rendering typically necessitates a large number of 3D Gaussians, resulting in substantial memory consumption and storage requirements. To address this challenge, we propose the first knowledge distillation framework for 3DGS, featuring various teacher models, including vanilla 3DGS, noise-augmented variants, and dropout-regularized versions. The outputs of these teachers are aggregated to guide the optimization of a lightweight student model. To distill the hidden geometric structure, we propose a structural similarity loss to boost the consistency of spatial geometric distributions between the student and teacher model. Through comprehensive quantitative and qualitative evaluations across diverse datasets, the proposed Distilled-3DGS, a simple yet effective framework without bells and whistles, achieves promising rendering results in both rendering quality and storage efficiency compared to state-of-the-art methods. Project page: https://distilled3dgs.github.io . Code: https://github.com/lt-xiang/Distilled-3DGS .
中文: 本文提出Distilled-3DGS,一种知识蒸馏框架,通过多个教师模型和结构相似性损失训练轻量学生模型,有效降低了3D高斯泼溅的内存和存储需求,同时保持了优异的渲染质量与效率。
English: This paper introduces Distilled-3DGS, a knowledge distillation framework that reduces the memory and storage demands of 3D Gaussian Splatting by using multiple teacher models and a structural similarity loss to train a compact student model, achieving high rendering quality and efficiency.

Authors:Ken Deng, Yunhan Yang, Jingxiang Sun, Xihui Liu, Yebin Liu, Ding Liang, Yan-Pei Cao
Title: GeoSAM2: Unleashing the Power of SAM2 for 3D Part Segmentation
Abstract:
We introduce GeoSAM2, a prompt-controllable framework for 3D part segmentation that casts the task as multi-view 2D mask prediction. Given a textureless object, we render normal and point maps from predefined viewpoints and accept simple 2D prompts - clicks or boxes - to guide part selection. These prompts are processed by a shared SAM2 backbone augmented with LoRA and residual geometry fusion, enabling view-specific reasoning while preserving pretrained priors. The predicted masks are back-projected to the object and aggregated across views. Our method enables fine-grained, part-specific control without requiring text prompts, per-shape optimization, or full 3D labels. In contrast to global clustering or scale-based methods, prompts are explicit, spatially grounded, and interpretable. We achieve state-of-the-art class-agnostic performance on PartObjaverse-Tiny and PartNetE, outperforming both slow optimization-based pipelines and fast but coarse feedforward approaches. Our results highlight a new paradigm: aligning the paradigm of 3D segmentation with SAM2, leveraging interactive 2D inputs to unlock controllability and precision in object-level part understanding.

Authors:Tuo Chen, Jie Gui, Minjing Dong, Ju Jia, Lanting Fang, Jian Liu
Title: Backdooring Self-Supervised Contrastive Learning by Noisy Alignment
Abstract:
Self-supervised contrastive learning (CL) effectively learns transferable representations from unlabeled data containing images or image-text pairs but suffers vulnerability to data poisoning backdoor attacks (DPCLs). An adversary can inject poisoned images into pretraining datasets, causing compromised CL encoders to exhibit targeted misbehavior in downstream tasks. Existing DPCLs, however, achieve limited efficacy due to their dependence on fragile implicit co-occurrence between backdoor and target object and inadequate suppression of discriminative features in backdoored images. We propose Noisy Alignment (NA), a DPCL method that explicitly suppresses noise components in poisoned images. Inspired by powerful training-controllable CL attacks, we identify and extract the critical objective of noisy alignment, adapting it effectively into data-poisoning scenarios. Our method implements noisy alignment by strategically manipulating contrastive learning's random cropping mechanism, formulating this process as an image layout optimization problem with theoretically derived optimal parameters. The resulting method is simple yet effective, achieving state-of-the-art performance compared to existing DPCLs, while maintaining clean-data accuracy. Furthermore, Noisy Alignment demonstrates robustness against common backdoor defenses. Codes can be found at https://github.com/jsrdcht/Noisy-Alignment.
中文摘要:Noisy Alignment是一种新型数据投毒后门攻击方法,通过策略性地操控对比学习的随机裁剪机制来显式抑制污染图像中的噪声成分,在保持干净数据准确性的同时实现了最先进的攻击性能。
English Summary: Noisy Alignment is a novel data poisoning backdoor attack method that enhances attack efficacy by explicitly suppressing noise components in poisoned images through strategic manipulation of contrastive learning's cropping mechanism, achieving state-of-the-art performance while maintaining clean-data accuracy.

Authors:Tianyi Niu, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal
Title: RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation
Abstract:
We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0°, 90°, 180°, and 270°. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench -- a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information -- including captions, depth maps, and more -- or using chain-of-thought prompting offers only small and inconsistent improvements. Our results indicate that most models are able to reliably identify right-side-up (0°) images, while certain models are able to identify upside-down (180°) images. None can reliably distinguish between 90° and 270°. Simultaneously showing the image rotated in different orientations leads to moderate performance gains for reasoning models, while a modified setup using voting improves the performance of weaker models. We further show that fine-tuning does not improve models' ability to distinguish 90° and 270° rotations, despite substantially improving the identification of 180° images. Together, these results reveal a significant gap between MLLMs' spatial reasoning capabilities and human perception in identifying rotation.
中文: 研究表明,当前多模态大语言模型在识别图像旋转方面存在明显缺陷,尤其无法可靠区分90°和270°旋转,显示出与人类空间感知能力的重要差距。
English: This study reveals that current Multimodal Large Language Models struggle to reliably identify image rotations, particularly distinguishing between 90° and 270° orientations, exposing a significant gap in spatial reasoning compared to human perception.

Authors:Jiacheng Ruan, Dan Jiang, Xian Gao, Ting Liu, Yuzhuo Fu, Yangyang Kang
Title: MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal Large Language Models
Abstract:
Recently, multimodal large language models (MLLMs) have achieved significant advancements across various domains, and corresponding evaluation benchmarks have been continuously refined and improved. In this process, benchmarks in the scientific domain have played an important role in assessing the reasoning capabilities of MLLMs. However, existing benchmarks still face three key challenges: 1) Insufficient evaluation of models' reasoning abilities in multilingual scenarios; 2) Inadequate assessment of MLLMs' comprehensive modality coverage; 3) Lack of fine-grained annotation of scientific knowledge points. To address these gaps, we propose MME-SCI, a comprehensive and challenging benchmark. We carefully collected 1,019 high-quality question-answer pairs, which involve 3 distinct evaluation modes. These pairs cover four subjects, namely mathematics, physics, chemistry, and biology, and support five languages: Chinese, English, French, Spanish, and Japanese. We conducted extensive experiments on 16 open-source models and 4 closed-source models, and the results demonstrate that MME-SCI is widely challenging for existing MLLMs. For instance, under the Image-only evaluation mode, o4-mini achieved accuracy of only 52.11%, 24.73%, 36.57%, and 29.80% in mathematics, physics, chemistry, and biology, respectively, indicating a significantly higher difficulty level compared to existing benchmarks. More importantly, using MME-SCI's multilingual and fine-grained knowledge attributes, we analyzed existing models' performance in depth and identified their weaknesses in specific domains. The Data and Evaluation Code are available at https://github.com/JCruan519/MME-SCI.
中文: 多模态大语言模型虽取得显著进展,但现有科学领域基准在跨语言推理能力评估、多模态覆盖和细粒度知识标注方面存在不足,为此提出MME-SCI基准,涵盖四大学科和五种语言,实验证明其能有效揭示现有模型在特定领域的性能缺陷。
English: Multimodal large language models have advanced significantly, yet existing scientific benchmarks lack comprehensive multilingual reasoning assessment, full modality coverage, and fine-grained knowledge annotation, prompting the introduction of MME-SCI—a challenging benchmark covering four subjects and five languages that reveals substantial performance gaps in current models.

Authors:Chunji Lv, Zequn Chen, Donglin Di, Weinan Zhang, Hao Li, Wei Chen, Changsheng Li
Title: PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis
Abstract:
While physics-grounded 3D motion synthesis has seen significant progress, current methods face critical limitations. They typically rely on pre-reconstructed 3D Gaussian Splatting (3DGS) representations, while physics integration depends on either inflexible, manually defined physical attributes or unstable, optimization-heavy guidance from video models. To overcome these challenges, we introduce PhysGM, a feed-forward framework that jointly predicts a 3D Gaussian representation and its physical properties from a single image, enabling immediate, physical simulation and high-fidelity 4D rendering. We first establish a base model by jointly optimizing for Gaussian reconstruction and probabilistic physics prediction. The model is then refined with physically plausible reference videos to enhance both rendering fidelity and physics prediction accuracy. We adopt the Direct Preference Optimization (DPO) to align its simulations with reference videos, circumventing Score Distillation Sampling (SDS) optimization which needs back-propagating gradients through the complex differentiable simulation and rasterization. To facilitate the training, we introduce a new dataset PhysAssets of over 24,000 3D assets, annotated with physical properties and corresponding guiding videos. Experimental results demonstrate that our method effectively generates high-fidelity 4D simulations from a single image in one minute. This represents a significant speedup over prior works while delivering realistic rendering results. Our project page is at:https://hihixiaolv.github.io/PhysGM.github.io/

Authors:Valentina Corbetta, Floris Six Dijkstra, Regina Beets-Tan, Hoel Kervadec, Kristoffer Wickstrøm, Wilson Silva
Title: In-hoc Concept Representations to Regularise Deep Learning in Medical Imaging
Abstract:
Deep learning models in medical imaging often achieve strong in-distribution performance but struggle to generalise under distribution shifts, frequently relying on spurious correlations instead of clinically meaningful features. We introduce LCRReg, a novel regularisation approach that leverages Latent Concept Representations (LCRs) (e.g., Concept Activation Vectors (CAVs)) to guide models toward semantically grounded representations. LCRReg requires no concept labels in the main training set and instead uses a small auxiliary dataset to synthesise high-quality, disentangled concept examples. We extract LCRs for predefined relevant features, and incorporate a regularisation term that guides a Convolutional Neural Network (CNN) to activate within latent subspaces associated with those concepts. We evaluate LCRReg across synthetic and real-world medical tasks. On a controlled toy dataset, it significantly improves robustness to injected spurious correlations and remains effective even in multi-concept and multiclass settings. On the diabetic retinopathy binary classification task, LCRReg enhances performance under both synthetic spurious perturbations and out-of-distribution (OOD) generalisation. Compared to baselines, including multitask learning, linear probing, and post-hoc concept-based models, LCRReg offers a lightweight, architecture-agnostic strategy for improving model robustness without requiring dense concept supervision. Code is available at the following link: https://github.com/Trustworthy-AI-UU-NKI/lcr\_regularization
中文: LCRReg是一种新颖的正则化方法,利用潜在概念表示引导医学影像模型关注临床相关特征,无需密集概念标注即可有效提升模型对虚假相关性和分布外泛化的鲁棒性。
English: LCRReg is a novel regularization method that uses latent concept representations to guide medical imaging models toward clinically meaningful features, improving robustness against spurious correlations and out-of-distribution generalization without requiring dense concept supervision.

Authors:Paul Grimal, Michaël Soumm, Hervé Le Borgne, Olivier Ferret, Akihiro Sugimoto
Title: SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation
Abstract:
State-of-the-art text-to-image models produce visually impressive results but often struggle with precise alignment to text prompts, leading to missing critical elements or unintended blending of distinct concepts. We propose a novel approach that learns a high-success-rate distribution conditioned on a target prompt, ensuring that generated images faithfully reflect the corresponding prompts. Our method explicitly models the signal component during the denoising process, offering fine-grained control that mitigates over-optimization and out-of-distribution artifacts. Moreover, our framework is training-free and seamlessly integrates with both existing diffusion and flow matching architectures. It also supports additional conditioning modalities -- such as bounding boxes -- for enhanced spatial alignment. Extensive experiments demonstrate that our approach outperforms current state-of-the-art methods. The code is available at https://github.com/grimalPaul/gsn-factory.
中文: 本文提出一种无需训练的新方法,通过学习高成功率分布来提升文本到图像生成的精准度,增强控制力并减少伪影,同时支持边界框等额外条件输入。
English: This paper introduces a novel training-free method that enhances text-to-image generation by learning a high-success-rate distribution for precise prompt alignment, improving control and reducing artifacts while supporting additional conditioning like bounding boxes.

Authors:Sebastian Ibarra, Javier del Riego, Alessandro Catanese, Julian Cuba, Julian Cardona, Nataly Leon, Jonathan Infante, Karim Lekadir, Oliver Diaz, Richard Osuala
Title: Comparing Conditional Diffusion Models for Synthesizing Contrast-Enhanced Breast MRI from Pre-Contrast Images
Abstract:
Dynamic contrast-enhanced (DCE) MRI is essential for breast cancer diagnosis and treatment. However, its reliance on contrast agents introduces safety concerns, contraindications, increased cost, and workflow complexity. To this end, we present pre-contrast conditioned denoising diffusion probabilistic models to synthesize DCE-MRI, introducing, evaluating, and comparing a total of 22 generative model variants in both single-breast and full breast settings. Towards enhancing lesion fidelity, we introduce both tumor-aware loss functions and explicit tumor segmentation mask conditioning. Using a public multicenter dataset and comparing to respective pre-contrast baselines, we observe that subtraction image-based models consistently outperform post-contrast-based models across five complementary evaluation metrics. Apart from assessing the entire image, we also separately evaluate the region of interest, where both tumor-aware losses and segmentation mask inputs improve evaluation metrics. The latter notably enhance qualitative results capturing contrast uptake, albeit assuming access to tumor localization inputs that are not guaranteed to be available in screening settings. A reader study involving 2 radiologists and 4 MRI technologists confirms the high realism of the synthetic images, indicating an emerging clinical potential of generative contrast-enhancement. We share our codebase at https://github.com/sebastibar/conditional-diffusion-breast-MRI.
This study introduces a generative AI method using pre-contrast MRI data to synthesize contrast-enhanced breast MRI, demonstrating superior performance over baseline methods through multiple evaluation metrics and clinical reader validation, while noting dependency on tumor localization inputs.
English Summary:

Authors:Tiago Assis, Ines P. Machado, Benjamin Zwick, Nuno C. Garcia, Reuben Dorent
Title: Deep Biomechanically-Guided Interpolation for Keypoint-Based Brain Shift Registration
Abstract:
Accurate compensation of brain shift is critical for maintaining the reliability of neuronavigation during neurosurgery. While keypoint-based registration methods offer robustness to large deformations and topological changes, they typically rely on simple geometric interpolators that ignore tissue biomechanics to create dense displacement fields. In this work, we propose a novel deep learning framework that estimates dense, physically plausible brain deformations from sparse matched keypoints. We first generate a large dataset of synthetic brain deformations using biomechanical simulations. Then, a residual 3D U-Net is trained to refine standard interpolation estimates into biomechanically guided deformations. Experiments on a large set of simulated displacement fields demonstrate that our method significantly outperforms classical interpolators, reducing by half the mean square error while introducing negligible computational overhead at inference time. Code available at: \href{https://github.com/tiago-assis/Deep-Biomechanical-Interpolator}{https://github.com/tiago-assis/Deep-Biomechanical-Interpolator}.
中文: 本研究提出了一种深度学习框架,通过生物力学模拟和3D U-Net从稀疏关键点生成精确且物理合理的大脑变形,在计算成本极低的情况下显著优于传统插值方法。
English: This study introduces a deep learning framework that uses biomechanical simulations and a 3D U-Net to generate accurate, physically plausible brain deformations from sparse keypoints, significantly outperforming traditional interpolation methods with minimal computational cost.

Authors:Yeji Park, Minyoung Lee, Sanghyuk Chun, Junsuk Choe
Title: Mitigating Cross-Image Information Leakage in LVLMs for Multi-Image Tasks
Abstract:
Large Vision-Language Models (LVLMs) demonstrate strong performance on single-image tasks. However, we observe that their performance degrades significantly when handling multi-image inputs. This occurs because visual cues from different images become entangled in the model's output. We refer to this phenomenon as cross-image information leakage. To address this issue, we propose FOCUS, a training-free and architecture-agnostic decoding strategy that mitigates cross-image information leakage during inference. FOCUS sequentially masks all but one image with random noise, guiding the model to focus on the single clean image. We repeat this process across all target images to obtain logits under partially masked contexts. These logits are aggregated and then contrastively refined using a noise-only reference input, which suppresses the leakage and yields more accurate outputs. FOCUS consistently improves performance across four multi-image benchmarks and diverse LVLM families. This demonstrates that FOCUS offers a general and practical solution for enhancing multi-image reasoning without additional training or architectural modifications.
中文: FOCUS是一种无需训练的通用解码策略,通过顺序用噪声遮蔽图像、聚合对数并对比优化输出,有效缓解大型视觉语言模型中的跨图像信息泄露问题,显著提升多图像推理能力。
English: FOCUS is a training-free decoding strategy that mitigates cross-image information leakage in Large Vision-Language Models by sequentially masking images with noise, aggregating logits, and contrastively refining outputs to enhance multi-image reasoning performance.

Authors:Ali Abdari, Alex Falcon, Giuseppe Serra
Title: Hierarchical Vision-Language Retrieval of Educational Metaverse Content in Agriculture
Abstract:
Every day, a large amount of educational content is uploaded online across different areas, including agriculture and gardening. When these videos or materials are grouped meaningfully, they can make learning easier and more effective. One promising way to organize and enrich such content is through the Metaverse, which allows users to explore educational experiences in an interactive and immersive environment. However, searching for relevant Metaverse scenarios and finding those matching users' interests remains a challenging task. A first step in this direction has been done recently, but existing datasets are small and not sufficient for training advanced models. In this work, we make two main contributions: first, we introduce a new dataset containing 457 agricultural-themed virtual museums (AgriMuseums), each enriched with textual descriptions; and second, we propose a hierarchical vision-language model to represent and retrieve relevant AgriMuseums using natural language queries. In our experimental setting, the proposed method achieves up to about 62\% R@1 and 78\% MRR, confirming its effectiveness, and it also leads to improvements on existing benchmarks by up to 6\% R@1 and 11\% MRR. Moreover, an extensive evaluation validates our design choices. Code and dataset are available at https://github.com/aliabdari/Agricultural_Metaverse_Retrieval .
中文: 本研究提出了包含457个农业虚拟博物馆的新数据集及分层视觉语言模型,通过自然语言查询有效检索元宇宙教育内容,在检索指标上实现了显著性能提升。
English: This work introduces a new dataset of 457 agricultural virtual museums and a hierarchical vision-language model that effectively retrieves relevant Metaverse educational content using natural language queries, achieving significant performance improvements in retrieval metrics.

Authors:Dengxian Gong, Shunping Ji
Title: DeH4R: A Decoupled and Hybrid Method for Road Network Graph Extraction
Abstract:
The automated extraction of complete and precise road network graphs from remote sensing imagery remains a critical challenge in geospatial computer vision. Segmentation-based approaches, while effective in pixel-level recognition, struggle to maintain topology fidelity after vectorization postprocessing. Graph-growing methods build more topologically faithful graphs but suffer from computationally prohibitive iterative ROI cropping. Graph-generating methods first predict global static candidate road network vertices, and then infer possible edges between vertices. They achieve fast topology-aware inference, but limits the dynamic insertion of vertices. To address these challenges, we propose DeH4R, a novel hybrid model that combines graph-generating efficiency and graph-growing dynamics. This is achieved by decoupling the task into candidate vertex detection, adjacent vertex prediction, initial graph contruction, and graph expansion. This architectural innovation enables dynamic vertex (edge) insertions while retaining fast inference speed and enhancing both topology fidelity and spatial consistency. Comprehensive evaluations on CityScale and SpaceNet benchmarks demonstrate state-of-the-art (SOTA) performance. DeH4R outperforms the prior SOTA graph-growing method RNGDet++ by 4.62 APLS and 10.18 IoU on CityScale, while being approximately 10 $\times$ faster. The code will be made publicly available at https://github.com/7777777FAN/DeH4R.
Chinese: 提出的DeH4R模型通过结合图生成效率与图增长动态性,克服了现有道路提取方法的局限,在基准测试中实现了更优的拓扑保真度和更快的推理速度。
English: The proposed DeH4R model overcomes limitations in existing road extraction methods by combining graph-generating efficiency with graph-growing dynamics, achieving superior topology fidelity and faster inference speeds on benchmark datasets.

Authors:Yutong Feng, Linlin Zhang, Hengyuan Cao, Yiming Chen, Xiaoduan Feng, Jian Cao, Yuxiong Wu, Bin Wang
Title: OmniTry: Virtual Try-On Anything without Masks
Abstract:
Virtual Try-ON (VTON) is a practical and widely-applied task, for which most of existing works focus on clothes. This paper presents OmniTry, a unified framework that extends VTON beyond garment to encompass any wearable objects, e.g., jewelries and accessories, with mask-free setting for more practical application. When extending to various types of objects, data curation is challenging for obtaining paired images, i.e., the object image and the corresponding try-on result. To tackle this problem, we propose a two-staged pipeline: For the first stage, we leverage large-scale unpaired images, i.e., portraits with any wearable items, to train the model for mask-free localization. Specifically, we repurpose the inpainting model to automatically draw objects in suitable positions given an empty mask. For the second stage, the model is further fine-tuned with paired images to transfer the consistency of object appearance. We observed that the model after the first stage shows quick convergence even with few paired samples. OmniTry is evaluated on a comprehensive benchmark consisting of 12 common classes of wearable objects, with both in-shop and in-the-wild images. Experimental results suggest that OmniTry shows better performance on both object localization and ID-preservation compared with existing methods. The code, model weights, and evaluation benchmark of OmniTry will be made publicly available at https://omnitry.github.io/.

Authors:Shunian Chen, Hejin Huang, Yexin Liu, Zihan Ye, Pengcheng Chen, Chenghao Zhu, Michael Guan, Rongsheng Wang, Junying Chen, Guanbin Li, Ser-Nam Lim, Harry Yang, Benyou Wang
Title: TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis
Abstract:
Audio-driven talking head synthesis has achieved remarkable photorealism, yet state-of-the-art (SOTA) models exhibit a critical failure: they lack generalization to the full spectrum of human diversity in ethnicity, language, and age groups. We argue that this generalization gap is a direct symptom of limitations in existing training data, which lack the necessary scale, quality, and diversity. To address this challenge, we introduce TalkVid, a new large-scale, high-quality, and diverse dataset containing 1244 hours of video from 7729 unique speakers. TalkVid is curated through a principled, multi-stage automated pipeline that rigorously filters for motion stability, aesthetic quality, and facial detail, and is validated against human judgments to ensure its reliability. Furthermore, we construct and release TalkVid-Bench, a stratified evaluation set of 500 clips meticulously balanced across key demographic and linguistic axes. Our experiments demonstrate that a model trained on TalkVid outperforms counterparts trained on previous datasets, exhibiting superior cross-dataset generalization. Crucially, our analysis on TalkVid-Bench reveals performance disparities across subgroups that are obscured by traditional aggregate metrics, underscoring its necessity for future research. Code and data can be found in https://github.com/FreedomIntelligence/TalkVid
中文:当前语音驱动头部动画模型因训练数据不足而难以泛化到多样化人群,为此我们推出TalkVid大规模高质量数据集,不仅显著提升模型性能,更通过分层评估揭示了传统指标掩盖的性能差异。
English: Current audio-driven talking head models suffer from limited generalization across diverse human demographics due to inadequate training data, prompting the introduction of TalkVid—a large-scale, high-quality dataset that improves model performance and reveals performance disparities through stratified evaluation.

Authors:Guiqin Wang, Peng Zhao, Cong Zhao, Jing Huang, Siyan Guo, Shusen Yang
Title: Generative Model-Based Feature Attention Module for Video Action Analysis
Abstract:
Video action analysis is a foundational technology within the realm of intelligent video comprehension, particularly concerning its application in Internet of Things(IoT). However, existing methodologies overlook feature semantics in feature extraction and focus on optimizing action proposals, thus these solutions are unsuitable for widespread adoption in high-performance IoT applications due to the limitations in precision, such as autonomous driving, which necessitate robust and scalable intelligent video analytics analysis. To address this issue, we propose a novel generative attention-based model to learn the relation of feature semantics. Specifically, by leveraging the differences of actions' foreground and background, our model simultaneously learns the frame- and segment-dependencies of temporal action feature semantics, which takes advantage of feature semantics in the feature extraction effectively. To evaluate the effectiveness of our model, we conduct extensive experiments on two benchmark video task, action recognition and action detection. In the context of action detection tasks, we substantiate the superiority of our approach through comprehensive validation on widely recognized datasets. Moreover, we extend the validation of the effectiveness of our proposed method to a broader task, video action recognition. Our code is available at https://github.com/Generative-Feature-Model/GAF.
中文: 本文提出了一种基于生成注意力的模型,通过学习帧和片段依赖关系来增强视频动作分析中的特征语义,在标准数据集的动作检测与识别任务中验证了其优越性能。
English: This paper introduces a generative attention-based model that enhances video action analysis by learning feature semantics through frame- and segment-dependencies, demonstrating superior performance in action detection and recognition tasks on benchmark datasets.

Authors:Yuchen Yang, Linfeng Dong, Wei Wang, Zhihang Zhong, Xiao Sun
Title: Learnable SMPLify: A Neural Solution for Optimization-Free Human Pose Inverse Kinematics
Abstract:
In 3D human pose and shape estimation, SMPLify remains a robust baseline that solves inverse kinematics (IK) through iterative optimization. However, its high computational cost limits its practicality. Recent advances across domains have shown that replacing iterative optimization with data-driven neural networks can achieve significant runtime improvements without sacrificing accuracy. Motivated by this trend, we propose Learnable SMPLify, a neural framework that replaces the iterative fitting process in SMPLify with a single-pass regression model. The design of our framework targets two core challenges in neural IK: data construction and generalization. To enable effective training, we propose a temporal sampling strategy that constructs initialization-target pairs from sequential frames. To improve generalization across diverse motions and unseen poses, we propose a human-centric normalization scheme and residual learning to narrow the solution space. Learnable SMPLify supports both sequential inference and plug-in post-processing to refine existing image-based estimators. Extensive experiments demonstrate that our method establishes itself as a practical and simple baseline: it achieves nearly 200x faster runtime compared to SMPLify, generalizes well to unseen 3DPW and RICH, and operates in a model-agnostic manner when used as a plug-in tool on LucidAction. The code is available at https://github.com/Charrrrrlie/Learnable-SMPLify.
中文摘要:Learnable SMPLify通过神经网络回归模型取代SMPLify的迭代优化,利用时序采样和标准化技术,在保持精度的同时实现200倍加速。
English Summary: Learnable SMPLify replaces SMPLify's iterative optimization with a neural regression model, achieving 200x faster speed while maintaining accuracy through temporal sampling and normalization techniques.

Authors:Zhen Qu, Xian Tao, Xinyi Gong, ShiChen Qu, Xiaopei Zhang, Xingang Wang, Fei Shen, Zhengtao Zhang, Mukesh Prasad, Guiguang Ding
Title: DictAS: A Framework for Class-Generalizable Few-Shot Anomaly Segmentation via Dictionary Lookup
Abstract:
Recent vision-language models (e.g., CLIP) have demonstrated remarkable class-generalizable ability to unseen classes in few-shot anomaly segmentation (FSAS), leveraging supervised prompt learning or fine-tuning on seen classes. However, their cross-category generalization largely depends on prior knowledge of real seen anomaly samples. In this paper, we propose a novel framework, namely DictAS, which enables a unified model to detect visual anomalies in unseen object categories without any retraining on the target data, only employing a few normal reference images as visual prompts. The insight behind DictAS is to transfer dictionary lookup capabilities to the FSAS task for unseen classes via self-supervised learning, instead of merely memorizing the normal and abnormal feature patterns from the training set. Specifically, DictAS mainly consists of three components: (1) Dictionary Construction - to simulate the index and content of a real dictionary using features from normal reference images. (2) Dictionary Lookup - to retrieve queried region features from the dictionary via a sparse lookup strategy. When a query feature cannot be retrieved, it is classified as an anomaly. (3) Query Discrimination Regularization - to enhance anomaly discrimination by making abnormal features harder to retrieve from the dictionary. To achieve this, Contrastive Query Constraint and Text Alignment Constraint are further proposed. Extensive experiments on seven public industrial and medical datasets demonstrate that DictAS consistently outperforms state-of-the-art FSAS methods.
中文: DictAS是一种新颖框架,通过自监督的字典查询机制仅使用少量正常参考图像即可实现无需重新训练的无类别异常分割,在工业和医疗数据集上均优于现有方法。
English: DictAS is a novel framework that enables few-shot anomaly segmentation for unseen object categories without retraining by using self-supervised dictionary lookup with normal reference images, outperforming existing methods across industrial and medical datasets.

Authors:Sukhun Ko, Dahyeon Kye, Kyle Min, Chanho Eom, Jihyong Oh
Title: FLAIR: Frequency- and Locality-Aware Implicit Neural Representations
Abstract:
Implicit Neural Representations (INRs) leverage neural networks to map coordinates to corresponding signals, enabling continuous and compact representations. This paradigm has driven significant advances in various vision tasks. However, existing INRs lack frequency selectivity, spatial localization, and sparse representations, leading to an over-reliance on redundant signal components. Consequently, they exhibit spectral bias, tending to learn low-frequency components early while struggling to capture fine high-frequency details. To address these issues, we propose FLAIR (Frequency- and Locality-Aware Implicit Neural Representations), which incorporates two key innovations. The first is RC-GAUSS, a novel activation designed for explicit frequency selection and spatial localization under the constraints of the time-frequency uncertainty principle (TFUP). The second is Wavelet-Energy-Guided Encoding (WEGE), which leverages the discrete wavelet transform (DWT) to compute energy scores and explicitly guide frequency information to the network. Our method consistently outperforms existing INRs in 2D image representation and restoration, as well as 3D reconstruction.

Authors:Shihao Dong, Yuhui Zheng, Huiying Xu, Xinzhong Zhu
Title: Multi-view Clustering via Bi-level Decoupling and Consistency Learning
Abstract:
Multi-view clustering has shown to be an effective method for analyzing underlying patterns in multi-view data. The performance of clustering can be improved by learning the consistency and complementarity between multi-view features, however, cluster-oriented representation learning is often overlooked. In this paper, we propose a novel Bi-level Decoupling and Consistency Learning framework (BDCL) to further explore the effective representation for multi-view data to enhance inter-cluster discriminability and intra-cluster compactness of features in multi-view clustering. Our framework comprises three modules: 1) The multi-view instance learning module aligns the consistent information while preserving the private features between views through reconstruction autoencoder and contrastive learning. 2) The bi-level decoupling of features and clusters enhances the discriminability of feature space and cluster space. 3) The consistency learning module treats the different views of the sample and their neighbors as positive pairs, learns the consistency of their clustering assignments, and further compresses the intra-cluster space. Experimental results on five benchmark datasets demonstrate the superiority of the proposed method compared with the SOTA methods. Our code is published on https://github.com/LouisDong95/BDCL.
中文: 提出的双层解耦与一致性学习(BDCL)框架通过实例对齐、特征解耦和一致性学习增强多视图聚类中的类间区分度与类内紧密度,在基准数据集上展现了优越性能。
English: The proposed Bi-level Decoupling and Consistency Learning (BDCL) framework enhances multi-view clustering by improving inter-cluster discriminability and intra-cluster compactness through instance alignment, feature decoupling, and consistency learning, demonstrating superior performance on benchmark datasets.

Authors:Jingwen Yu, Jiayi Yang, Anjun Hu, Jiankun Wang, Ping Tan, Hong Zhang
Title: ROVER: Robust Loop Closure Verification with Trajectory Prior in Repetitive Environments
Abstract:
Loop closure detection is important for simultaneous localization and mapping (SLAM), which associates current observations with historical keyframes, achieving drift correction and global relocalization. However, a falsely detected loop can be fatal, and this is especially difficult in repetitive environments where appearance-based features fail due to the high similarity. Therefore, verification of a loop closure is a critical step in avoiding false positive detections. Existing works in loop closure verification predominantly focus on learning invariant appearance features, neglecting the prior knowledge of the robot's spatial-temporal motion cue, i.e., trajectory. In this letter, we propose ROVER, a loop closure verification method that leverages the historical trajectory as a prior constraint to reject false loops in challenging repetitive environments. For each loop candidate, it is first used to estimate the robot trajectory with pose-graph optimization. This trajectory is then submitted to a scoring scheme that assesses its compliance with the trajectory without the loop, which we refer to as the trajectory prior, to determine if the loop candidate should be accepted. Benchmark comparisons and real-world experiments demonstrate the effectiveness of the proposed method. Furthermore, we integrate ROVER into state-of-the-art SLAM systems to verify its robustness and efficiency. Our source code and self-collected dataset are available at https://github.com/jarvisyjw/ROVER.
中文: ROVER是一种利用历史轨迹作为先验约束的闭环验证方法,在重复环境中通过位姿图优化和评分机制评估轨迹一致性来拒绝错误闭环,从而提高SLAM系统的可靠性。
English: ROVER is a loop closure verification method that uses historical trajectory as a prior constraint to reject false loops in repetitive environments, enhancing SLAM reliability by assessing trajectory consistency through pose-graph optimization and a scoring scheme.

Authors:Pei Liu, Luping Ji, Jiaxiang Gou, Xiangxiang Zeng
Title: Cross-Cancer Knowledge Transfer in WSI-based Prognosis Prediction
Abstract:
Whole-Slide Image (WSI) is an important tool for estimating cancer prognosis. Current studies generally follow a conventional cancer-specific paradigm where one cancer corresponds to one model. However, it naturally struggles to scale to rare tumors and cannot utilize the knowledge of other cancers. Although a multi-task learning-like framework has been studied recently, it usually has high demands on computational resources and needs considerable costs in iterative training on ultra-large multi-cancer WSI datasets. To this end, this paper makes a paradigm shift to knowledge transfer and presents the first preliminary yet systematic study on cross-cancer prognosis knowledge transfer in WSIs, called CROPKT. It has three major parts: (i) we curate a large dataset (UNI2-h-DSS) with 26 cancers and use it to measure the transferability of WSI-based prognostic knowledge across different cancers (including rare tumors); (ii) beyond a simple evaluation merely for benchmark, we design a range of experiments to gain deeper insights into the underlying mechanism of transferability; (iii) we further show the utility of cross-cancer knowledge transfer, by proposing a routing-based baseline approach (ROUPKT) that could often efficiently utilize the knowledge transferred from off-the-shelf models of other cancers. We hope CROPKT could serve as an inception and lay the foundation for this nascent paradigm, i.e., WSI-based prognosis prediction with cross-cancer knowledge transfer. Our source code is available at https://github.com/liupei101/CROPKT.
中文: 本文提出CROPKT,首次系统研究全切片图像的跨癌症预后知识迁移,通过构建26种癌症数据集和路由基线方法,突破传统单一癌症模型的局限,为罕见肿瘤预后建立新范式基础。
English: This paper introduces CROPKT, a pioneering study on cross-cancer prognosis knowledge transfer using Whole-Slide Images, which overcomes limitations of cancer-specific models by enabling efficient knowledge sharing across 26 cancers including rare tumors through curated datasets and routing-based methods.

Authors:Shuxin Liang, Yihan Xiao, Wenlu Tang
Title: InnerGS: Internal Scenes Rendering via Factorized 3D Gaussian Splatting
Abstract:
3D Gaussian Splatting (3DGS) has recently gained popularity for efficient scene rendering by representing scenes as explicit sets of anisotropic 3D Gaussians. However, most existing work focuses primarily on modeling external surfaces. In this work, we target the reconstruction of internal scenes, which is crucial for applications that require a deep understanding of an object's interior. By directly modeling a continuous volumetric density through the inner 3D Gaussian distribution, our model effectively reconstructs smooth and detailed internal structures from sparse sliced data. Our approach eliminates the need for camera poses, is plug-and-play, and is inherently compatible with any data modalities. We provide cuda implementation at: https://github.com/Shuxin-Liang/InnerGS.
Chinese: 本研究提出了一种利用3D高斯泼溅技术重建内部场景的新方法,通过建模连续体积密度从稀疏数据中生成精细内部结构,无需相机位姿即可实现。
English: This work introduces a novel method for reconstructing internal scenes using 3D Gaussian Splatting, which models continuous volumetric density to create detailed interior structures from sparse data without requiring camera poses.

Authors:Zeynep Ozdemir, Hacer Yalim Keles, Omer Ozgur Tanriover
Title: CLoE: Curriculum Learning on Endoscopic Images for Robust MES Classification
Abstract:
Estimating disease severity from endoscopic images is essential in assessing ulcerative colitis, where the Mayo Endoscopic Subscore (MES) is widely used to grade inflammation. However, MES classification remains challenging due to label noise from inter-observer variability and the ordinal nature of the score, which standard models often ignore. We propose CLoE, a curriculum learning framework that accounts for both label reliability and ordinal structure. Image quality, estimated via a lightweight model trained on Boston Bowel Preparation Scale (BBPS) labels, is used as a proxy for annotation confidence to order samples from easy (clean) to hard (noisy). This curriculum is further combined with ResizeMix augmentation to improve robustness. Experiments on the LIMUC and HyperKvasir datasets, using both CNNs and Transformers, show that CLoE consistently improves performance over strong supervised and self-supervised baselines. For instance, ConvNeXt-Tiny reaches 82.5\% accuracy and a QWK of 0.894 on LIMUC with low computational cost. These results highlight the potential of difficulty-aware training strategies for improving ordinal classification under label uncertainty. Code will be released at https://github.com/zeynepozdemir/CLoE.
Chinese: 提出的CLoE框架通过课程学习和图像质量评估,解决了溃疡性结肠炎严重程度分类中的标签噪声和有序结构问题,在医学数据集上实现了更高的准确性和鲁棒性。
English: The proposed CLoE framework uses curriculum learning and image quality assessment to address label noise and ordinal structure in ulcerative colitis severity classification, achieving improved accuracy and robustness on medical datasets.

Authors:Suhang Hu, Wei Hu, Yuhang Su, Fan Zhang
Title: RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning
Abstract:
Vision-Language Models (VLMs) struggle with complex image annotation tasks, such as emotion classification and context-driven object detection, which demand sophisticated reasoning. Standard Supervised Fine-Tuning (SFT) focuses solely on annotation outcomes, ignoring underlying rationales, while Visual Reinforcement Fine-Tuning (Visual-RFT) produces inconsistent Chains of Thought (CoTs) due to the absence of high-quality, verified CoTs during pre-training. We introduce RISE (Reason-Inspire-Strengthen-Expertise), a two-stage framework to overcome these limitations. In the Reason stage (RISE-CoT), a reinforcement learning-driven "annotation-reasoning-annotation" closed-loop generates visually grounded, logically consistent CoTs by verifying their ability to reconstruct original annotations without direct leakage. The Inspire and Strengthen stage (RISE-R1) leverages a high-quality CoT subset, filtered by RISE-CoT rewards, for supervised fine-tuning, followed by reinforcement fine-tuning to produce interpretable reasoning and accurate annotations, achieving Expertise in complex visual tasks. Evaluated on complex and simple image annotation tasks, RISE-trained Qwen2-VL-2B outperforms SFT and Visual-RFT, achieving robust performance and enhanced explainability. RISE offers a self-supervised solution for advancing VLM reasoning without requiring manually annotated CoTs.Code and resources are available at: https://github.com/HSH55/RISE.
中文摘要:RISE框架通过两阶段方法改进视觉语言模型,首先生成经过验证的推理链,再通过微调使模型在复杂图像标注任务中实现更优性能,且无需人工标注推理过程。
English Summary: The RISE framework enhances Vision-Language Models through a two-stage process that generates verified reasoning chains and fine-tunes models to achieve superior performance in complex image annotation tasks without requiring manual rationale annotations.

Authors:Shilong Li, Xingyuan Bu, Wenjie Wang, Jiaheng Liu, Jun Dong, Haoyang He, Hao Lu, Haozhe Zhang, Chenchen Jing, Zhen Li, Chuanhao Li, Jiayi Tian, Chenchen Zhang, Tianhao Peng, Yancheng He, Jihao Gu, Yuanxing Zhang, Jian Yang, Ge Zhang, Wenhao Huang, Wangchunshu Zhou, Zhaoxiang Zhang, Ruizhe Ding, Shilei Wen
Title: MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Abstract:
AI agents with advanced reasoning and tool use capabilities have demonstrated impressive performance in web browsing for deep search. While existing benchmarks such as BrowseComp evaluate these browsing abilities, they primarily focus on textual information, overlooking the prevalence of multimodal content. To bridge this gap, we introduce MM-BrowseComp, a novel benchmark comprising 224 challenging, hand-crafted questions specifically designed to assess agents' multimodal retrieval and reasoning capabilities. These questions often incorporate images in prompts, and crucial information encountered during the search and reasoning process may also be embedded within images or videos on webpages. Consequently, methods relying solely on text prove insufficient for our benchmark. Additionally, we provide a verified checklist for each question, enabling fine-grained analysis of multimodal dependencies and reasoning paths. Our comprehensive evaluation of state-of-the-art models on MM-BrowseComp reveals that even top models like OpenAI o3 with tools achieve only 29.02\% accuracy, highlighting the suboptimal multimodal capabilities and lack of native multimodal reasoning in current models.
Chinese: 本文提出了MM-BrowseComp这一包含224个手工设计问题的新型基准,用于评估AI代理的多模态网络浏览能力,结果显示即使顶尖模型也因缺乏多模态推理能力而表现不佳,准确率仅为29.02%。
English: This paper introduces MM-BrowseComp, a new benchmark with 224 hand-crafted questions to evaluate AI agents' multimodal web browsing capabilities, revealing that even top models perform poorly with only 29.02% accuracy due to insufficient multimodal reasoning.

Authors:Wenhao Hu, Zesheng Li, Haonan Zhou, Liu Liu, Xuexiang Wen, Zhizhong Su, Xi Li, Gaoang Wang
Title: IGFuse: Interactive 3D Gaussian Scene Reconstruction via Multi-Scans Fusion
Abstract:
Reconstructing complete and interactive 3D scenes remains a fundamental challenge in computer vision and robotics, particularly due to persistent object occlusions and limited sensor coverage. Multiview observations from a single scene scan often fail to capture the full structural details. Existing approaches typically rely on multi stage pipelines, such as segmentation, background completion, and inpainting or require per-object dense scanning, both of which are error-prone, and not easily scalable. We propose IGFuse, a novel framework that reconstructs interactive Gaussian scene by fusing observations from multiple scans, where natural object rearrangement between captures reveal previously occluded regions. Our method constructs segmentation aware Gaussian fields and enforces bi-directional photometric and semantic consistency across scans. To handle spatial misalignments, we introduce a pseudo-intermediate scene state for unified alignment, alongside collaborative co-pruning strategies to refine geometry. IGFuse enables high fidelity rendering and object level scene manipulation without dense observations or complex pipelines. Extensive experiments validate the framework's strong generalization to novel scene configurations, demonstrating its effectiveness for real world 3D reconstruction and real-to-simulation transfer. Our project page is available online.

Authors:Yuang Wang, Chao Wen, Haoyu Guo, Sida Peng, Minghan Qin, Hujun Bao, Xiaowei Zhou, Ruizhen Hu
Title: Precise Action-to-Video Generation Through Visual Action Prompts
Abstract:
We present visual action prompts, a unified action representation for action-to-video generation of complex high-DoF interactions while maintaining transferable visual dynamics across domains. Action-driven video generation faces a precision-generality trade-off: existing methods using text, primitive actions, or coarse masks offer generality but lack precision, while agent-centric action signals provide precision at the cost of cross-domain transferability. To balance action precision and dynamic transferability, we propose to "render" actions into precise visual prompts as domain-agnostic representations that preserve both geometric precision and cross-domain adaptability for complex actions; specifically, we choose visual skeletons for their generality and accessibility. We propose robust pipelines to construct skeletons from two interaction-rich data sources - human-object interactions (HOI) and dexterous robotic manipulation - enabling cross-domain training of action-driven generative models. By integrating visual skeletons into pretrained video generation models via lightweight fine-tuning, we enable precise action control of complex interaction while preserving the learning of cross-domain dynamics. Experiments on EgoVid, RT-1 and DROID demonstrate the effectiveness of our proposed approach. Project page: https://zju3dv.github.io/VAP/.
中文摘要:本文提出视觉动作提示作为统一表征,通过从人-物交互和机器人操作数据中提取视觉骨架,在动作到视频生成中实现了几何精度与跨域适应性的平衡,既能精确控制复杂交互又能保持动态迁移能力。
English Summary: The paper introduces visual action prompts as a unified representation that balances geometric precision and cross-domain adaptability for action-to-video generation, using visual skeletons extracted from human-object interactions and robotic manipulation data to enable precise control while maintaining transferable dynamics.

Authors:Rui Shao, Wei Li, Lingsen Zhang, Renshan Zhang, Zhiyang Liu, Ran Chen, Liqiang Nie
Title: Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey
Abstract:
Robotic manipulation, a key frontier in robotics and embodied AI, requires precise motor control and multimodal understanding, yet traditional rule-based methods fail to scale or generalize in unstructured, novel environments. In recent years, Vision-Language-Action (VLA) models, built upon Large Vision-Language Models (VLMs) pretrained on vast image-text datasets, have emerged as a transformative paradigm. This survey provides the first systematic, taxonomy-oriented review of large VLM-based VLA models for robotic manipulation. We begin by clearly defining large VLM-based VLA models and delineating two principal architectural paradigms: (1) monolithic models, encompassing single-system and dual-system designs with differing levels of integration; and (2) hierarchical models, which explicitly decouple planning from execution via interpretable intermediate representations. Building on this foundation, we present an in-depth examination of large VLM-based VLA models: (1) integration with advanced domains, including reinforcement learning, training-free optimization, learning from human videos, and world model integration; (2) synthesis of distinctive characteristics, consolidating architectural traits, operational strengths, and the datasets and benchmarks that support their development; (3) identification of promising directions, including memory mechanisms, 4D perception, efficient adaptation, multi-agent cooperation, and other emerging capabilities. This survey consolidates recent advances to resolve inconsistencies in existing taxonomies, mitigate research fragmentation, and fill a critical gap through the systematic integration of studies at the intersection of large VLMs and robotic manipulation. We provide a regularly updated project page to document ongoing progress: https://github.com/JiuTian-VL/Large-VLM-based-VLA-for-Robotic-Manipulation
中文: 本综述系统梳理了基于大视觉语言模型的视觉-语言-动作机器人操控研究,通过分类架构特点与指明未来方向推动领域发展。
English: This survey systematically reviews Vision-Language-Action models for robotic manipulation, categorizing architectures and identifying future research directions to advance the field.

Authors:Tejas Chaudhari, Akarsh J., Tanushree Dewangan, Mukul Lokhande, Santosh Kumar Vishvakarma
Title: XR-NPE: High-Throughput Mixed-precision SIMD Neural Processing Engine for Extended Reality Perception Workloads
Abstract:
This work proposes XR-NPE, a high-throughput Mixed-precision SIMD Neural Processing Engine, designed for extended reality (XR) perception workloads like visual inertial odometry (VIO), object classification, and eye gaze extraction. XR-NPE is first to support FP4, Posit (4,1), Posit (8,0), and Posit (16,1) formats, with layer adaptive hybrid-algorithmic implementation supporting ultra-low bit precision to significantly reduce memory bandwidth requirements, and accompanied by quantization-aware training for minimal accuracy loss. The proposed Reconfigurable Mantissa Multiplication and Exponent processing Circuitry (RMMEC) reduces dark silicon in the SIMD MAC compute engine, assisted by selective power gating to reduce energy consumption, providing 2.85x improved arithmetic intensity. XR-NPE achieves a maximum operating frequency of 1.72 GHz, area 0.016 mm2 , and arithmetic intensity 14 pJ at CMOS 28nm, reducing 42% area, 38% power compared to the best of state-of-the-art MAC approaches. The proposed XR-NPE based AXI-enabled Matrix-multiplication co-processor consumes 1.4x fewer LUTs, 1.77x fewer FFs, and provides 1.2x better energy efficiency compared to SoTA accelerators on VCU129. The proposed co-processor provides 23% better energy efficiency and 4% better compute density for VIO workloads. XR-NPE establishes itself as a scalable, precision-adaptive compute engine for future resource-constrained XR devices. The complete set for codes for results reproducibility are released publicly, enabling designers and researchers to readily adopt and build upon them. https://github.com/mukullokhande99/XR-NPE.
中文:XR-NPE是一种面向扩展现实应用的高吞吐量混合精度神经网络处理引擎,采用创新的低精度格式和可重构电路设计,在保持精度的同时显著降低能耗和硬件需求。
English: XR-NPE is a high-throughput mixed-precision neural processing engine designed for extended reality applications, featuring innovative low-precision formats and reconfigurable circuitry to significantly reduce energy consumption and hardware requirements while maintaining accuracy.

Authors:Ayaka Yasunaga, Hideo Saito, Dieter Schmalstieg, Shohei Mori
Title: IntelliCap: Intelligent Guidance for Consistent View Sampling
Abstract:
Novel view synthesis from images, for example, with 3D Gaussian splatting, has made great progress. Rendering fidelity and speed are now ready even for demanding virtual reality applications. However, the problem of assisting humans in collecting the input images for these rendering algorithms has received much less attention. High-quality view synthesis requires uniform and dense view sampling. Unfortunately, these requirements are not easily addressed by human camera operators, who are in a hurry, impatient, or lack understanding of the scene structure and the photographic process. Existing approaches to guide humans during image acquisition concentrate on single objects or neglect view-dependent material characteristics. We propose a novel situated visualization technique for scanning at multiple scales. During the scanning of a scene, our method identifies important objects that need extended image coverage to properly represent view-dependent appearance. To this end, we leverage semantic segmentation and category identification, ranked by a vision-language model. Spherical proxies are generated around highly ranked objects to guide the user during scanning. Our results show superior performance in real scenes compared to conventional view sampling strategies.

Authors:Ruru Xu, Ilkay Oksuz
Title: HierAdaptMR: Cross-Center Cardiac MRI Reconstruction with Hierarchical Feature Adapters
Abstract:
Deep learning-based cardiac MRI reconstruction faces significant domain shift challenges when deployed across multiple clinical centers with heterogeneous scanner configurations and imaging protocols. We propose HierAdaptMR, a hierarchical feature adaptation framework that addresses multi-level domain variations through parameter-efficient adapters. Our method employs Protocol-Level Adapters for sequence-specific characteristics and Center-Level Adapters for scanner-dependent variations, built upon a variational unrolling backbone. A Universal Adapter enables generalization to entirely unseen centers through stochastic training that learns center-invariant adaptations. The framework utilizes multi-scale SSIM loss with frequency domain enhancement and contrast-adaptive weighting for robust optimization. Comprehensive evaluation on the CMRxRecon2025 dataset spanning 5+ centers, 10+ scanners, and 9 modalities demonstrates superior cross-center generalization while maintaining reconstruction quality. code: https://github.com/Ruru-Xu/HierAdaptMR
Chinese: HierAdaptMR是一种分层特征自适应框架,通过协议级和中心级适配器以及通用适配器处理多中心心脏MRI重建中的领域偏移问题,在CMRxRecon2025数据集上实现了卓越的跨中心泛化能力。
English: HierAdaptMR is a hierarchical feature adaptation framework that tackles domain shifts in cardiac MRI reconstruction across multiple clinical centers by employing protocol-level and center-level adapters, along with a universal adapter for unseen centers, achieving superior cross-center generalization on the CMRxRecon2025 dataset.

Authors:Rohan Asthana, Joschua Conrad, Maurits Ortmanns, Vasileios Belagiannis
Title: Dextr: Zero-Shot Neural Architecture Search with Singular Value Decomposition and Extrinsic Curvature
Abstract:
Zero-shot Neural Architecture Search (NAS) typically optimises the architecture search process by exploiting the network or gradient properties at initialisation through zero-cost proxies. The existing proxies often rely on labelled data, which is usually unavailable in real-world settings. Furthermore, the majority of the current methods focus either on optimising the convergence and generalisation attributes or solely on the expressivity of the network architectures. To address both limitations, we first demonstrate how channel collinearity affects the convergence and generalisation properties of a neural network. Then, by incorporating the convergence, generalisation and expressivity in one approach, we propose a zero-cost proxy that omits the requirement of labelled data for its computation. In particular, we leverage the Singular Value Decomposition (SVD) of the neural network layer features and the extrinsic curvature of the network output to design our proxy. %As a result, the proposed proxy is formulated as the simplified harmonic mean of the logarithms of two key components: the sum of the inverse of the feature condition number and the extrinsic curvature of the network output. Our approach enables accurate prediction of network performance on test data using only a single label-free data sample. Our extensive evaluation includes a total of six experiments, including the Convolutional Neural Network (CNN) search space, i.e. DARTS and the Transformer search space, i.e. AutoFormer. The proposed proxy demonstrates a superior performance on multiple correlation benchmarks, including NAS-Bench-101, NAS-Bench-201, and TransNAS-Bench-101-micro; as well as on the NAS task within the DARTS and the AutoFormer search space, all while being notably efficient. The code is available at https://github.com/rohanasthana/Dextr.
Chinese: 本研究提出了一种新颖的零样本神经架构搜索代理方法,通过奇异值分解和外在曲率分析,将网络收敛性、泛化性和表达能力相结合,无需依赖标注数据即可实现高效架构评估。
English: This study introduces a novel zero-shot neural architecture search proxy that eliminates the need for labeled data by integrating network convergence, generalization, and expressivity through singular value decomposition and extrinsic curvature analysis.

Authors:Qirui Li, Guangcong Zheng, Qi Zhao, Jie Li, Bin Dong, Yiwu Yao, Xi Li
Title: Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation
Abstract:
The computational demands of self-attention mechanisms pose a critical challenge for transformer-based video generation, particularly in synthesizing ultra-long sequences. Current approaches, such as factorized attention and fixed sparse patterns, fail to fully exploit the inherent spatio-temporal redundancies in video data. Through systematic analysis of video diffusion transformers (DiT), we uncover a key insight: Attention matrices exhibit structured, yet heterogeneous sparsity patterns, where specialized heads dynamically attend to distinct spatiotemporal regions (e.g., local pattern, cross-shaped pattern, or global pattern). Existing sparse attention methods either impose rigid constraints or introduce significant overhead, limiting their effectiveness. To address this, we propose Compact Attention, a hardware-aware acceleration framework featuring three innovations: 1) Adaptive tiling strategies that approximate diverse spatial interaction patterns via dynamic tile grouping, 2) Temporally varying windows that adjust sparsity levels based on frame proximity, and 3) An automated configuration search algorithm that optimizes sparse patterns while preserving critical attention pathways. Our method achieves 1.6~2.5x acceleration in attention computation on single-GPU setups while maintaining comparable visual quality with full-attention baselines. This work provides a principled approach to unlocking efficient long-form video generation through structured sparsity exploitation. Project Page: https://yo-ava.github.io/Compact-Attention.github.io/

Authors:Hongyang Chen, Shaoling Pu, Lingyu Zheng, Zhongwu Sun
Title: SEDEG:Sequential Enhancement of Decoder and Encoder's Generality for Class Incremental Learning with Small Memory
Abstract:
In incremental learning, enhancing the generality of knowledge is crucial for adapting to dynamic data inputs. It can develop generalized representations or more balanced decision boundaries, preventing the degradation of long-term knowledge over time and thus mitigating catastrophic forgetting. Some emerging incremental learning methods adopt an encoder-decoder architecture and have achieved promising results. In the encoder-decoder achitecture, improving the generalization capabilities of both the encoder and decoder is critical, as it helps preserve previously learned knowledge while ensuring adaptability and robustness to new, diverse data inputs. However, many existing continual methods focus solely on enhancing one of the two components, which limits their effectiveness in mitigating catastrophic forgetting. And these methods perform even worse in small-memory scenarios, where only a limited number of historical samples can be stored. To mitigate this limitation, we introduces SEDEG, a two-stage training framework for vision transformers (ViT), focusing on sequentially improving the generality of both Decoder and Encoder. Initially, SEDEG trains an ensembled encoder through feature boosting to learn generalized representations, which subsequently enhance the decoder's generality and balance the classifier. The next stage involves using knowledge distillation (KD) strategies to compress the ensembled encoder and develop a new, more generalized encoder. This involves using a balanced KD approach and feature KD for effective knowledge transfer. Extensive experiments on three benchmark datasets show SEDEG's superior performance, and ablation studies confirm the efficacy of its components. The code is available at https://github.com/ShaolingPu/CIL.
中文: SEDEG是一种针对视觉变换器的两阶段训练框架,通过依次提升解码器和编码器的泛化能力来缓解增量学习中的灾难性遗忘问题,尤其在小内存场景下表现优异。
English: SEDEG is a two-stage training framework for vision transformers that sequentially enhances the generality of both the decoder and encoder to mitigate catastrophic forgetting in incremental learning, particularly in small-memory scenarios.

Authors:Ximiao Zhang, Min Xu, Xiuzhuang Zhou
Title: Towards High-Resolution Industrial Image Anomaly Detection
Abstract:
Current anomaly detection methods primarily focus on low-resolution scenarios. For high-resolution images, conventional downsampling often results in missed detections of subtle anomalous regions due to the loss of fine-grained discriminative information. Despite some progress, recent studies have attempted to improve detection resolution by employing lightweight networks or using simple image tiling and ensemble methods. However, these approaches still struggle to meet the practical demands of industrial scenarios in terms of detection accuracy and efficiency. To address the above issues, we propose HiAD, a general framework for high-resolution anomaly detection. HiAD is capable of detecting anomalous regions of varying sizes in high-resolution images under limited computational resources. Specifically, HiAD employs a dual-branch architecture that integrates anomaly cues across different scales to comprehensively capture both subtle and large-scale anomalies. Furthermore, it incorporates a multi-resolution feature fusion strategy to tackle the challenges posed by fine-grained texture variations in high-resolution images. To enhance both adaptability and efficiency, HiAD utilizes a detector pool in conjunction with various detector assignment strategies, enabling detectors to be adaptively assigned based on patch features, ensuring detection performance while effectively controlling computational costs. We conduct extensive experiments on our specifically constructed high-resolution anomaly detection benchmarks, including MVTec-HD, VisA-HD, and the real-world benchmark RealIAD-HD, demonstrating the superior performance of HiAD. The code is available at https://github.com/cnulab/HiAD.
中文摘要:现有异常检测方法因下采样导致高分辨率图像细节丢失,为此我们提出HiAD框架,采用双分支结构和多分辨率特征融合,通过自适应分配检测器实现在控制计算成本的同时有效识别不同尺寸的异常区域。
English Summary: Current anomaly detection methods are inadequate for high-resolution images due to information loss from downsampling, so we propose HiAD, a dual-branch framework with multi-resolution fusion that adaptively assigns detectors to effectively identify anomalies of various sizes while controlling computational costs.

Authors:Elena Izzo, Luca Parolari, Davide Vezzaro, Lamberto Ballan
Title: 7Bench: a Comprehensive Benchmark for Layout-guided Text-to-image Models
Abstract:
Layout-guided text-to-image models offer greater control over the generation process by explicitly conditioning image synthesis on the spatial arrangement of elements. As a result, their adoption has increased in many computer vision applications, ranging from content creation to synthetic data generation. A critical challenge is achieving precise alignment between the image, textual prompt, and layout, ensuring semantic fidelity and spatial accuracy. Although recent benchmarks assess text alignment, layout alignment remains overlooked, and no existing benchmark jointly evaluates both. This gap limits the ability to evaluate a model's spatial fidelity, which is crucial when using layout-guided generation for synthetic data, as errors can introduce noise and degrade data quality. In this work, we introduce 7Bench, the first benchmark to assess both semantic and spatial alignment in layout-guided text-to-image generation. It features text-and-layout pairs spanning seven challenging scenarios, investigating object generation, color fidelity, attribute recognition, inter-object relationships, and spatial control. We propose an evaluation protocol that builds on existing frameworks by incorporating the layout alignment score to assess spatial accuracy. Using 7Bench, we evaluate several state-of-the-art diffusion models, uncovering their respective strengths and limitations across diverse alignment tasks. The benchmark is available at https://github.com/Elizzo/7Bench.
中文摘要:7Bench作为首个布局引导文本生成图像的综合基准,通过七种挑战性场景评估语义与空间双重对齐,填补了现有评估体系的关键空白。
English Summary: 7Bench is the first comprehensive benchmark for evaluating both semantic and spatial alignment in layout-guided text-to-image generation, addressing critical gaps in existing evaluation frameworks across seven challenging scenarios.

Authors:Ronghao Lin, Shuai Shen, Weipeng Hu, Qiaolin He, Aolin Xiong, Li Huang, Haifeng Hu, Yap-peng Tan
Title: E3RG: Building Explicit Emotion-driven Empathetic Response Generation System with Multimodal Large Language Model
Abstract:
Multimodal Empathetic Response Generation (MERG) is crucial for building emotionally intelligent human-computer interactions. Although large language models (LLMs) have improved text-based ERG, challenges remain in handling multimodal emotional content and maintaining identity consistency. Thus, we propose E3RG, an Explicit Emotion-driven Empathetic Response Generation System based on multimodal LLMs which decomposes MERG task into three parts: multimodal empathy understanding, empathy memory retrieval, and multimodal response generation. By integrating advanced expressive speech and video generative models, E3RG delivers natural, emotionally rich, and identity-consistent responses without extra training. Experiments validate the superiority of our system on both zero-shot and few-shot settings, securing Top-1 position in the Avatar-based Multimodal Empathy Challenge on ACM MM 25. Our code is available at https://github.com/RH-Lin/E3RG.
中文摘要:E3RG是一个基于多模态大语言模型的显式情感驱动系统,通过将共情响应生成分解为理解、记忆和生成三阶段,无需额外训练即可产生自然且情感一致的多模态回应,并在权威评测中取得最佳成绩。
English Summary: E3RG is an explicit emotion-driven system that enhances multimodal empathetic response generation by decomposing it into empathy understanding, memory retrieval, and response generation, achieving top performance without additional training.

Authors:Ronghao Lin, Sijie Mai, Ying Zeng, Qiaolin He, Aolin Xiong, Haifeng Hu
Title: Multi-source Multimodal Progressive Domain Adaption for Audio-Visual Deception Detection
Abstract:
This paper presents the winning approach for the 1st MultiModal Deception Detection (MMDD) Challenge at the 1st Workshop on Subtle Visual Computing (SVC). Aiming at the domain shift issue across source and target domains, we propose a Multi-source Multimodal Progressive Domain Adaptation (MMPDA) framework that transfers the audio-visual knowledge from diverse source domains to the target domain. By gradually aligning source and the target domain at both feature and decision levels, our method bridges domain shifts across diverse multimodal datasets. Extensive experiments demonstrate the effectiveness of our approach securing Top-2 place. Our approach reaches 60.43% on accuracy and 56.99\% on F1-score on competition stage 2, surpassing the 1st place team by 5.59% on F1-score and the 3rd place teams by 6.75% on accuracy. Our code is available at https://github.com/RH-Lin/MMPDA.
Chinese: 本文提出的MMPDA框架通过渐进式多模态领域自适应方法,在特征和决策层面跨域对齐音视频数据,以60.43%准确率和56.99% F1值在MMDD挑战赛中取得优异成绩。
English: This paper introduces the MMPDA framework, which addresses domain shift by progressively aligning multimodal data across source and target domains, achieving top performance in the MMDD Challenge with 60.43% accuracy and 56.99% F1-score.

Authors:Friedhelm Hamann, Emil Mededovic, Fabian Gülhan, Yuli Wu, Johannes Stegmaier, Jing He, Yiqing Wang, Kexin Zhang, Lingling Li, Licheng Jiao, Mengru Ma, Hongxiang Huang, Yuhao Yan, Hongwei Ren, Xiaopeng Lin, Yulong Huang, Bojun Cheng, Se Hyun Lee, Gyu Sung Ham, Kanghan Oh, Gi Hyun Lim, Boxuan Yang, Bowen Du, Guillermo Gallego
Title: SIS-Challenge: Event-based Spatio-temporal Instance Segmentation Challenge at the CVPR 2025 Event-based Vision Workshop
Abstract:
We present an overview of the Spatio-temporal Instance Segmentation (SIS) challenge held in conjunction with the CVPR 2025 Event-based Vision Workshop. The task is to predict accurate pixel-level segmentation masks of defined object classes from spatio-temporally aligned event camera and grayscale camera data. We provide an overview of the task, dataset, challenge details and results. Furthermore, we describe the methods used by the top-5 ranking teams in the challenge. More resources and code of the participants' methods are available here: https://github.com/tub-rip/MouseSIS/blob/main/docs/challenge_results.md
中文: 本文概述了CVPR 2025会议中时空实例分割挑战赛,介绍了从事件相机与灰度相机数据预测物体分割掩码的任务、数据集、比赛结果及优胜团队的解决方案。
English: This abstract summarizes the Spatio-temporal Instance Segmentation challenge at CVPR 2025, detailing the task of predicting object masks from event and grayscale camera data, along with challenge results and top methods.

Authors:Damian Machlanski, Stephanie Riley, Edward Moroshko, Kurt Butler, Panagiotis Dimitrakopoulos, Thomas Melistas, Akchunya Chanchal, Steven McDonagh, Ricardo Silva, Sotirios A. Tsaftaris
Title: A Shift in Perspective on Causality in Domain Generalization
Abstract:
The promise that causal modelling can lead to robust AI generalization has been challenged in recent work on domain generalization (DG) benchmarks. We revisit the claims of the causality and DG literature, reconciling apparent contradictions and advocating for a more nuanced theory of the role of causality in generalization. We also provide an interactive demo at https://chai-uk.github.io/ukairs25-causal-predictors/.

Authors:Peihao Li, Yan Fang, Man Liu, Huihui Bai, Anhong Wang, Yunchao Wei, Yao Zhao
Title: Harnessing Group-Oriented Consistency Constraints for Semi-Supervised Semantic Segmentation in CdZnTe Semiconductors
Abstract:
Labeling Cadmium Zinc Telluride (CdZnTe) semiconductor images is challenging due to the low-contrast defect boundaries, necessitating annotators to cross-reference multiple views. These views share a single ground truth (GT), forming a unique ``many-to-one'' relationship. This characteristic renders advanced semi-supervised semantic segmentation (SSS) methods suboptimal, as they are generally limited by a ``one-to-one'' relationship, where each image is independently associated with its GT. Such limitation may lead to error accumulation in low-contrast regions, further exacerbating confirmation bias. To address this issue, we revisit the SSS pipeline from a group-oriented perspective and propose a human-inspired solution: the Intra-group Consistency Augmentation Framework (ICAF). First, we experimentally validate the inherent consistency constraints within CdZnTe groups, establishing a group-oriented baseline using the Intra-group View Sampling (IVS). Building on this insight, we introduce the Pseudo-label Correction Network (PCN) to enhance consistency representation, which consists of two key modules. The View Augmentation Module (VAM) improves boundary details by dynamically synthesizing a boundary-aware view through the aggregation of multiple views. In the View Correction Module (VCM), this synthesized view is paired with other views for information interaction, effectively emphasizing salient regions while minimizing noise. Extensive experiments demonstrate the effectiveness of our solution for CdZnTe materials. Leveraging DeepLabV3+ with a ResNet-101 backbone as our segmentation model, we achieve a 70.6\% mIoU on the CdZnTe dataset using only 2 group-annotated data (5\textperthousand). The code is available at \href{https://github.com/pipixiapipi/ICAF}{https://github.com/pipixiapipi/ICAF}.
中文摘要:针对碲锌镉半导体图像低对比度缺陷边界标注难题,本文提出基于组内一致性增强框架(ICAF),通过视图增强与校正模块强化多视图间一致性表征,仅用千分之五标注数据即在CdZnTe数据集上实现70.6%的mIoU。
English Summary: The proposed Intra-group Consistency Augmentation Framework (ICAF) addresses the limitations of semi-supervised semantic segmentation in low-contrast CdZnTe semiconductor images by leveraging group-oriented consistency constraints and pseudo-label correction, achieving 70.6% mIoU with minimal annotated data.

Authors:Cristo J. van den Berg, Frank G. te Nijenhuis, Mirre J. Blaauboer, Daan T. W. van Erp, Carlijn M. Keppels, Matthijs van der Sluijs, Bob Roozenbeek, Wim van Zwam, Sandra Cornelissen, Danny Ruijters, Ruisheng Su, Theo van Walsum
Title: CLAIRE-DSA: Fluoroscopic Image Classification for Quality Assurance of Computer Vision Pipelines in Acute Ischemic Stroke
Abstract:
Computer vision models can be used to assist during mechanical thrombectomy (MT) for acute ischemic stroke (AIS), but poor image quality often degrades performance. This work presents CLAIRE-DSA, a deep learning--based framework designed to categorize key image properties in minimum intensity projections (MinIPs) acquired during MT for AIS, supporting downstream quality control and workflow optimization. CLAIRE-DSA uses pre-trained ResNet backbone models, fine-tuned to predict nine image properties (e.g., presence of contrast, projection angle, motion artefact severity). Separate classifiers were trained on an annotated dataset containing $1,758$ fluoroscopic MinIPs. The model achieved excellent performance on all labels, with ROC-AUC ranging from $0.91$ to $0.98$, and precision ranging from $0.70$ to $1.00$. The ability of CLAIRE-DSA to identify suitable images was evaluated on a segmentation task by filtering poor quality images and comparing segmentation performance on filtered and unfiltered datasets. Segmentation success rate increased from $42%$ to $69%$, $p < 0.001$. CLAIRE-DSA demonstrates strong potential as an automated tool for accurately classifying image properties in DSA series of acute ischemic stroke patients, supporting image annotation and quality control in clinical and research applications. Source code is available at https://gitlab.com/icai-stroke-lab/wp3_neurointerventional_ai/claire-dsa.

Authors:Kangjie Chen, Yingji Zhong, Zhihao Li, Jiaqi Lin, Youyu Chen, Minghan Qin, Haoqian Wang
Title: Quantifying and Alleviating Co-Adaptation in Sparse-View 3D Gaussian Splatting
Abstract:
3D Gaussian Splatting (3DGS) has demonstrated impressive performance in novel view synthesis under dense-view settings. However, in sparse-view scenarios, despite the realistic renderings in training views, 3DGS occasionally manifests appearance artifacts in novel views. This paper investigates the appearance artifacts in sparse-view 3DGS and uncovers a core limitation of current approaches: the optimized Gaussians are overly-entangled with one another to aggressively fit the training views, which leads to a neglect of the real appearance distribution of the underlying scene and results in appearance artifacts in novel views. The analysis is based on a proposed metric, termed Co-Adaptation Score (CA), which quantifies the entanglement among Gaussians, i.e., co-adaptation, by computing the pixel-wise variance across multiple renderings of the same viewpoint, with different random subsets of Gaussians. The analysis reveals that the degree of co-adaptation is naturally alleviated as the number of training views increases. Based on the analysis, we propose two lightweight strategies to explicitly mitigate the co-adaptation in sparse-view 3DGS: (1) random gaussian dropout; (2) multiplicative noise injection to the opacity. Both strategies are designed to be plug-and-play, and their effectiveness is validated across various methods and benchmarks. We hope that our insights into the co-adaptation effect will inspire the community to achieve a more comprehensive understanding of sparse-view 3DGS.
中文: 3D高斯泼溅在稀疏视角下因高斯过度协同适应而产生外观伪影,但可通过随机丢弃和透明度噪声注入等即插即用策略有效缓解。
English: 3D Gaussian Splatting struggles with appearance artifacts in sparse-view scenarios due to excessive co-adaptation among Gaussians, but this can be mitigated through plug-and-play strategies like random dropout and opacity noise injection.

Authors:Felix Embacher, David Holtz, Jonas Uhrig, Marius Cordts, Markus Enzweiler
Title: Neural Rendering for Sensor Adaptation in 3D Object Detection
Abstract:
Autonomous vehicles often have varying camera sensor setups, which is inevitable due to restricted placement options for different vehicle types. Training a perception model on one particular setup and evaluating it on a new, different sensor setup reveals the so-called cross-sensor domain gap, typically leading to a degradation in accuracy. In this paper, we investigate the impact of the cross-sensor domain gap on state-of-the-art 3D object detectors. To this end, we introduce CamShift, a dataset inspired by nuScenes and created in CARLA to specifically simulate the domain gap between subcompact vehicles and sport utility vehicles (SUVs). Using CamShift, we demonstrate significant cross-sensor performance degradation, identify robustness dependencies on model architecture, and propose a data-driven solution to mitigate the effect. On the one hand, we show that model architectures based on a dense Bird's Eye View (BEV) representation with backward projection, such as BEVFormer, are the most robust against varying sensor configurations. On the other hand, we propose a novel data-driven sensor adaptation pipeline based on neural rendering, which can transform entire datasets to match different camera sensor setups. Applying this approach improves performance across all investigated 3D object detectors, mitigating the cross-sensor domain gap by a large margin and reducing the need for new data collection by enabling efficient data reusability across vehicles with different sensor setups. The CamShift dataset and the sensor adaptation benchmark are available at https://dmholtz.github.io/camshift/.

Authors:Yuheng Zha, Kun Zhou, Yujia Wu, Yushu Wang, Jie Feng, Zhi Xu, Shibo Hao, Zhengzhong Liu, Eric P. Xing, Zhiting Hu
Title: Vision-G1: Towards General Vision Language Reasoning with Multi-Domain Data Curation
Abstract:
Despite their success, current training pipelines for reasoning VLMs focus on a limited range of tasks, such as mathematical and logical reasoning. As a result, these models face difficulties in generalizing their reasoning capabilities to a wide range of domains, primarily due to the scarcity of readily available and verifiable reward data beyond these narrowly defined areas. Moreover, integrating data from multiple domains is challenging, as the compatibility between domain-specific datasets remains uncertain. To address these limitations, we build a comprehensive RL-ready visual reasoning dataset from 46 data sources across 8 dimensions, covering a wide range of tasks such as infographic, mathematical, spatial, cross-image, graphic user interface, medical, common sense and general science. We propose an influence function based data selection and difficulty based filtering strategy to identify high-quality training samples from this dataset. Subsequently, we train the VLM, referred to as Vision-G1, using multi-round RL with a data curriculum to iteratively improve its visual reasoning capabilities. Our model achieves state-of-the-art performance across various visual reasoning benchmarks, outperforming similar-sized VLMs and even proprietary models like GPT-4o and Gemini-1.5 Flash. The model, code and dataset are publicly available at https://github.com/yuh-zha/Vision-G1.
Chinese: 为解决现有视觉推理模型的局限性,我们通过整合46个来源的八个领域数据构建了全面数据集,并采用多轮强化学习课程训练出Vision-G1模型,在多项基准测试中实现了最先进的性能表现。
English: To overcome the limitations of current visual reasoning models, we developed Vision-G1 by creating a comprehensive dataset from 46 sources across eight domains and training it with a multi-round reinforcement learning curriculum, achieving state-of-the-art performance on various benchmarks.

Authors:Abhijay Ghildyal, Li-Yun Wang, Feng Liu
Title: WP-CLIP: Leveraging CLIP to Predict Wölfflin's Principles in Visual Art
Abstract:
Wölfflin's five principles offer a structured approach to analyzing stylistic variations for formal analysis. However, no existing metric effectively predicts all five principles in visual art. Computationally evaluating the visual aspects of a painting requires a metric that can interpret key elements such as color, composition, and thematic choices. Recent advancements in vision-language models (VLMs) have demonstrated their ability to evaluate abstract image attributes, making them promising candidates for this task. In this work, we investigate whether CLIP, pre-trained on large-scale data, can understand and predict Wölfflin's principles. Our findings indicate that it does not inherently capture such nuanced stylistic elements. To address this, we fine-tune CLIP on annotated datasets of real art images to predict a score for each principle. We evaluate our model, WP-CLIP, on GAN-generated paintings and the Pandora-18K art dataset, demonstrating its ability to generalize across diverse artistic styles. Our results highlight the potential of VLMs for automated art analysis.
中文: 本研究通过微调CLIP模型来预测沃尔夫林的艺术风格五原则,证明其能泛化应用于不同艺术风格,并凸显了视觉语言模型在自动化艺术分析中的潜力。
English: This study fine-tunes the CLIP model to predict Wölfflin's five stylistic principles in art, demonstrating its generalization across diverse styles and highlighting the potential of vision-language models for automated art analysis.

Authors:Chen Qian, Danyang Li, Xinran Yu, Zheng Yang, Qiang Ma
Title: OpenMoCap: Rethinking Optical Motion Capture under Real-world Occlusion
Abstract:
Optical motion capture is a foundational technology driving advancements in cutting-edge fields such as virtual reality and film production. However, system performance suffers severely under large-scale marker occlusions common in real-world applications. An in-depth analysis identifies two primary limitations of current models: (i) the lack of training datasets accurately reflecting realistic marker occlusion patterns, and (ii) the absence of training strategies designed to capture long-range dependencies among markers. To tackle these challenges, we introduce the CMU-Occlu dataset, which incorporates ray tracing techniques to realistically simulate practical marker occlusion patterns. Furthermore, we propose OpenMoCap, a novel motion-solving model designed specifically for robust motion capture in environments with significant occlusions. Leveraging a marker-joint chain inference mechanism, OpenMoCap enables simultaneous optimization and construction of deep constraints between markers and joints. Extensive comparative experiments demonstrate that OpenMoCap consistently outperforms competing methods across diverse scenarios, while the CMU-Occlu dataset opens the door for future studies in robust motion solving. The proposed OpenMoCap is integrated into the MoSen MoCap system for practical deployment. The code is released at: https://github.com/qianchen214/OpenMoCap.
Chinese: 光学动作捕捉系统因标记点遮挡导致性能下降,为此提出了模拟真实遮挡的CMU-Occlu数据集和通过标记点-关节链优化实现鲁棒捕捉的OpenMoCap模型,有效解决了遮挡问题。
English: Optical motion capture systems face performance degradation due to marker occlusions, which is addressed by the new CMU-Occlu dataset simulating realistic occlusions and the OpenMoCap model that robustly handles these challenges through marker-joint chain optimization.

Authors:Tan-Hanh Pham, Chris Ngo
Title: Multimodal Chain of Continuous Thought for Latent-Space Reasoning in Vision-Language Models
Abstract:
Many reasoning techniques for large multimodal models adapt language model approaches, such as Chain-of-Thought (CoT) prompting, which express reasoning as word sequences. While effective for text, these methods are suboptimal for multimodal contexts, struggling to align audio, visual, and textual information dynamically. To explore an alternative paradigm, we propose the Multimodal Chain of Continuous Thought (MCOUT), which enables reasoning directly in a joint latent space rather than in natural language. In MCOUT, the reasoning state is represented as a continuous hidden vector, iteratively refined and aligned with visual and textual embeddings, inspired by human reflective cognition. We develop two variants: MCOUT-Base, which reuses the language model`s last hidden state as the continuous thought for iterative reasoning, and MCOUT-Multi, which integrates multimodal latent attention to strengthen cross-modal alignment between visual and textual features. Experiments on benchmarks including MMMU, ScienceQA, and MMStar show that MCOUT consistently improves multimodal reasoning, yielding up to 8.23% accuracy gains over strong baselines and improving BLEU scores up to 8.27% across multiple-choice and open-ended tasks. These findings highlight latent continuous reasoning as a promising direction for advancing LMMs beyond language-bound CoT, offering a scalable framework for human-like reflective multimodal inference. Code is available at https://github.com/Hanhpt23/OmniMod.
Chinese: 本文提出多模态连续思维链(MCOUT)方法,通过在联合潜在空间而非自然语言中进行推理,实现了视觉与文本信息的动态对齐,在多模态任务中显著提升了准确率。
English: This paper introduces Multimodal Chain of Continuous Thought (MCOUT), a reasoning method that operates in a joint latent space rather than natural language, achieving significant accuracy improvements in multimodal tasks by dynamically aligning visual and textual information.

Authors:Hongsong Wang, Wanjiang Weng, Junbo Wang, Fang Zhao, Guo-Sen Xie, Xin Geng, Liang Wang
Title: Foundation Model for Skeleton-Based Human Action Understanding
Abstract:
Human action understanding serves as a foundational pillar in the field of intelligent motion perception. Skeletons serve as a modality- and device-agnostic representation for human modeling, and skeleton-based action understanding has potential applications in humanoid robot control and interaction. \RED{However, existing works often lack the scalability and generalization required to handle diverse action understanding tasks. There is no skeleton foundation model that can be adapted to a wide range of action understanding tasks}. This paper presents a Unified Skeleton-based Dense Representation Learning (USDRL) framework, which serves as a foundational model for skeleton-based human action understanding. USDRL consists of a Transformer-based Dense Spatio-Temporal Encoder (DSTE), Multi-Grained Feature Decorrelation (MG-FD), and Multi-Perspective Consistency Training (MPCT). The DSTE module adopts two parallel streams to learn temporal dynamic and spatial structure features. The MG-FD module collaboratively performs feature decorrelation across temporal, spatial, and instance domains to reduce dimensional redundancy and enhance information extraction. The MPCT module employs both multi-view and multi-modal self-supervised consistency training. The former enhances the learning of high-level semantics and mitigates the impact of low-level discrepancies, while the latter effectively facilitates the learning of informative multimodal features. We perform extensive experiments on 25 benchmarks across across 9 skeleton-based action understanding tasks, covering coarse prediction, dense prediction, and transferred prediction. Our approach significantly outperforms the current state-of-the-art methods. We hope that this work would broaden the scope of research in skeleton-based action understanding and encourage more attention to dense prediction tasks.
中文摘要:本文提出了一种统一的基于骨架的密集表示学习(USDRL)框架,该框架通过创新的时空编码、特征解相关和多视角一致性训练模块,在25个基准测试中显著优于现有方法,为基于骨架的动作理解任务提供了基础模型支持。
English Summary: This paper introduces a Unified Skeleton-based Dense Representation Learning (USDRL) framework that serves as a foundational model for diverse skeleton-based action understanding tasks, significantly outperforming current methods across 25 benchmarks through its innovative modules for spatio-temporal encoding, feature decorrelation, and multi-perspective consistency training.

Authors:Jiayao Mai, Xiuyuan Lu, Kuan Dai, Shaojie Shen, Yi Zhou
Title: Temporal and Rotational Calibration for Event-Centric Multi-Sensor Systems
Abstract:
Event cameras generate asynchronous signals in response to pixel-level brightness changes, offering a sensing paradigm with theoretically microsecond-scale latency that can significantly enhance the performance of multi-sensor systems. Extrinsic calibration is a critical prerequisite for effective sensor fusion; however, the configuration that involves event cameras remains an understudied topic. In this paper, we propose a motion-based temporal and rotational calibration framework tailored for event-centric multi-sensor systems, eliminating the need for dedicated calibration targets. Our method uses as input the rotational motion estimates obtained from event cameras and other heterogeneous sensors, respectively. Different from conventional approaches that rely on event-to-frame conversion, our method efficiently estimates angular velocity from normal flow observations, which are derived from the spatio-temporal profile of event data. The overall calibration pipeline adopts a two-step approach: it first initializes the temporal offset and rotational extrinsics by exploiting kinematic correlations in the spirit of Canonical Correlation Analysis (CCA), and then refines both temporal and rotational parameters through a joint non-linear optimization using a continuous-time parametrization in SO(3). Extensive evaluations on both publicly available and self-collected datasets validate that the proposed method achieves calibration accuracy comparable to target-based methods, while exhibiting superior stability over purely CCA-based methods, and highlighting its precision, robustness and flexibility. To facilitate future research, our implementation will be made open-source. Code: https://github.com/NAIL-HNU/EvMultiCalib.
中文摘要:本文提出了一种基于运动的标定框架,通过利用事件相机与其他传感器的旋转运动数据,无需专用标定板即可实现时空参数联合优化,在公开和自采数据集上验证了其达到与传统标定方法相当的精度。
English Summary: This paper introduces a motion-based calibration framework for event camera multi-sensor systems that eliminates calibration targets by using rotational motion data and achieves accuracy comparable to target-based methods through a two-step optimization process.

Authors:Krishna Teja Chitty-Venkata, Murali Emani, Venkatram Vishwanath
Title: LangVision-LoRA-NAS: Neural Architecture Search for Variable LoRA Rank in Vision Language Models
Abstract:
Vision Language Models (VLMs) integrate visual and text modalities to enable multimodal understanding and generation. These models typically combine a Vision Transformer (ViT) as an image encoder and a Large Language Model (LLM) for text generation. LoRA (Low-Rank Adaptation) is an efficient fine-tuning method to adapt pre-trained models to new tasks by introducing low-rank updates to their weights. While LoRA has emerged as a powerful technique for fine-tuning large models by introducing low-rank updates, current implementations assume a fixed rank, potentially limiting flexibility and efficiency across diverse tasks. This paper introduces \textit{LangVision-LoRA-NAS}, a novel framework that integrates Neural Architecture Search (NAS) with LoRA to optimize VLMs for variable-rank adaptation. Our approach leverages NAS to dynamically search for the optimal LoRA rank configuration tailored to specific multimodal tasks, balancing performance and computational efficiency. Through extensive experiments using the LLaMA-3.2-11B model on several datasets, LangVision-LoRA-NAS demonstrates notable improvement in model performance while reducing fine-tuning costs. Our Base and searched fine-tuned models on LLaMA-3.2-11B-Vision-Instruct can be found \href{https://huggingface.co/collections/krishnateja95/llama-32-11b-vision-instruct-langvision-lora-nas-6786cac480357a6a6fcc59ee}{\textcolor{blue}{here}} and the code for LangVision-LoRA-NAS can be found \href{https://github.com/krishnateja95/LangVision-NAS}{\textcolor{blue}{here}}.
中文: 本文提出的LangVision-LoRA-NAS框架将神经架构搜索与LoRA相结合,通过动态优化视觉语言模型的秩配置,在提升多模态任务性能的同时显著降低微调成本。
English: This paper introduces LangVision-LoRA-NAS, a framework that integrates Neural Architecture Search with LoRA to dynamically optimize Vision Language Models' rank configurations for enhanced performance and efficiency across multimodal tasks.

Authors:Shayan Kebriti, Shahabedin Nabavi, Ali Gooya
Title: FractMorph: A Fractional Fourier-Based Multi-Domain Transformer for Deformable Image Registration
Abstract:
Deformable image registration (DIR) is a crucial and challenging technique for aligning anatomical structures in medical images and is widely applied in diverse clinical applications. However, existing approaches often struggle to capture fine-grained local deformations and large-scale global deformations simultaneously within a unified framework. We present FractMorph, a novel 3D dual-parallel transformer-based architecture that enhances cross-image feature matching through multi-domain fractional Fourier transform (FrFT) branches. Each Fractional Cross-Attention (FCA) block applies parallel FrFTs at fractional angles of $0^\circ$, $45^\circ$, $90^\circ$, along with a log-magnitude branch, to effectively extract local, semi-global, and global features at the same time. These features are fused via cross-attention between the fixed and moving image streams. A lightweight U-Net style network then predicts a dense deformation field from the transformer-enriched features. On the intra-patient ACDC cardiac MRI dataset, FractMorph achieves state-of-the-art performance with an overall Dice Similarity Coefficient (DSC) of $86.45\%$, an average per-structure DSC of $75.15\%$, and a 95th-percentile Hausdorff distance (HD95) of $1.54~\mathrm{mm}$ on our data split. FractMorph-Light, a lightweight variant of our model with only 29.6M parameters, preserves high accuracy while halving model complexity. Furthermore, we demonstrate the generality of our approach with solid performance on a cerebral atlas-to-patient dataset. Our results demonstrate that multi-domain spectral-spatial attention in transformers can robustly and efficiently model complex non-rigid deformations in medical images using a single end-to-end network, without the need for scenario-specific tuning or hierarchical multi-scale networks. The source code is available at https://github.com/shayankebriti/FractMorph.
中文:FractMorph提出了一种基于三维双并行Transformer的创新架构,通过多域分数阶傅里叶变换同时捕捉医学图像中的局部和全局形变,在心脏MRI和脑部数据集上仅用单一端到端网络就实现了最先进的配准性能。
English: FractMorph introduces a 3D dual-parallel transformer architecture using multi-domain fractional Fourier transforms to simultaneously capture local and global deformations in medical images, achieving state-of-the-art performance on cardiac MRI and cerebral datasets with a single end-to-end network.

Authors:Yaron Aloni, Rotem Shalev-Arkushin, Yonatan Shafir, Guy Tevet, Ohad Fried, Amit Haim Bermano
Title: Express4D: Expressive, Friendly, and Extensible 4D Facial Motion Generation Benchmark
Abstract:
Dynamic facial expression generation from natural language is a crucial task in Computer Graphics, with applications in Animation, Virtual Avatars, and Human-Computer Interaction. However, current generative models suffer from datasets that are either speech-driven or limited to coarse emotion labels, lacking the nuanced, expressive descriptions needed for fine-grained control, and were captured using elaborate and expensive equipment. We hence present a new dataset of facial motion sequences featuring nuanced performances and semantic annotation. The data is easily collected using commodity equipment and LLM-generated natural language instructions, in the popular ARKit blendshape format. This provides riggable motion, rich with expressive performances and labels. We accordingly train two baseline models, and evaluate their performance for future benchmarking. Using our Express4D dataset, the trained models can learn meaningful text-to-expression motion generation and capture the many-to-many mapping of the two modalities. The dataset, code, and video examples are available on our webpage: https://jaron1990.github.io/Express4D/

Authors:Ke Xing, Hanwen Liang, Dejia Xu, Yuyang Yin, Konstantinos N. Plataniotis, Yao Zhao, Yunchao Wei
Title: TiP4GEN: Text to Immersive Panorama 4D Scene Generation
Abstract:
With the rapid advancement and widespread adoption of VR/AR technologies, there is a growing demand for the creation of high-quality, immersive dynamic scenes. However, existing generation works predominantly concentrate on the creation of static scenes or narrow perspective-view dynamic scenes, falling short of delivering a truly 360-degree immersive experience from any viewpoint. In this paper, we introduce \textbf{TiP4GEN}, an advanced text-to-dynamic panorama scene generation framework that enables fine-grained content control and synthesizes motion-rich, geometry-consistent panoramic 4D scenes. TiP4GEN integrates panorama video generation and dynamic scene reconstruction to create 360-degree immersive virtual environments. For video generation, we introduce a \textbf{Dual-branch Generation Model} consisting of a panorama branch and a perspective branch, responsible for global and local view generation, respectively. A bidirectional cross-attention mechanism facilitates comprehensive information exchange between the branches. For scene reconstruction, we propose a \textbf{Geometry-aligned Reconstruction Model} based on 3D Gaussian Splatting. By aligning spatial-temporal point clouds using metric depth maps and initializing scene cameras with estimated poses, our method ensures geometric consistency and temporal coherence for the reconstructed scenes. Extensive experiments demonstrate the effectiveness of our proposed designs and the superiority of TiP4GEN in generating visually compelling and motion-coherent dynamic panoramic scenes. Our project page is at https://ke-xing.github.io/TiP4GEN/.
中文: 本文提出TiP4GEN框架,通过双分支全景生成与几何对齐重建技术,实现了文本到动态全景场景的转换,能够生成具有精细控制和运动一致性的360度沉浸式虚拟环境。
English: This paper introduces TiP4GEN, a text-to-dynamic panorama generation framework that integrates dual-branch video synthesis and geometry-aligned reconstruction to create immersive 360-degree scenes with fine-grained control and motion coherence.

Authors:Jun Zeng, Yannan Huang, Elif Keles, Halil Ertugrul Aktas, Gorkem Durak, Nikhil Kumar Tomar, Quoc-Huy Trinh, Deepak Ranjan Nayak, Ulas Bagci, Debesh Jha
Title: SRMA-Mamba: Spatial Reverse Mamba Attention Network for Pathological Liver Segmentation in MRI Volumes
Abstract:
Liver Cirrhosis plays a critical role in the prognosis of chronic liver disease. Early detection and timely intervention are critical in significantly reducing mortality rates. However, the intricate anatomical architecture and diverse pathological changes of liver tissue complicate the accurate detection and characterization of lesions in clinical settings. Existing methods underutilize the spatial anatomical details in volumetric MRI data, thereby hindering their clinical effectiveness and explainability. To address this challenge, we introduce a novel Mamba-based network, SRMA-Mamba, designed to model the spatial relationships within the complex anatomical structures of MRI volumes. By integrating the Spatial Anatomy-Based Mamba module (SABMamba), SRMA-Mamba performs selective Mamba scans within liver cirrhotic tissues and combines anatomical information from the sagittal, coronal, and axial planes to construct a global spatial context representation, enabling efficient volumetric segmentation of pathological liver structures. Furthermore, we introduce the Spatial Reverse Attention module (SRMA), designed to progressively refine cirrhotic details in the segmentation map, utilizing both the coarse segmentation map and hierarchical encoding features. Extensive experiments demonstrate that SRMA-Mamba surpasses state-of-the-art methods, delivering exceptional performance in 3D pathological liver segmentation. Our code is available for public: https://github.com/JunZengz/SRMA-Mamba.
中文摘要:肝硬化预后关键在于早期发现,而SRMA-Mamba网络通过整合MRI三维空间解剖信息,有效解决了临床诊断难题,实现了卓越的病理肝脏三维分割性能。
English Summary: Liver cirrhosis prognosis depends on early detection, and the proposed SRMA-Mamba network effectively addresses clinical challenges by integrating spatial anatomical details from MRI volumes for superior 3D pathological liver segmentation.

Authors:Liang Lv, Di Wang, Jing Zhang, Lefei Zhang
Title: S5: Scalable Semi-Supervised Semantic Segmentation in Remote Sensing
Abstract:
Semi-supervised semantic segmentation (S4) has advanced remote sensing (RS) analysis by leveraging unlabeled data through pseudo-labeling and consistency learning. However, existing S4 studies often rely on small-scale datasets and models, limiting their practical applicability. To address this, we propose S5, the first scalable framework for semi-supervised semantic segmentation in RS, which unlocks the potential of vast unlabeled Earth observation data typically underutilized due to costly pixel-level annotations. Built upon existing large-scale RS datasets, S5 introduces a data selection strategy that integrates entropy-based filtering and diversity expansion, resulting in the RS4P-1M dataset. Using this dataset, we systematically scales S4 methods by pre-training RS foundation models (RSFMs) of varying sizes on this extensive corpus, significantly boosting their performance on land cover segmentation and object detection tasks. Furthermore, during fine-tuning, we incorporate a Mixture-of-Experts (MoE)-based multi-dataset fine-tuning approach, which enables efficient adaptation to multiple RS benchmarks with fewer parameters. This approach improves the generalization and versatility of RSFMs across diverse RS benchmarks. The resulting RSFMs achieve state-of-the-art performance across all benchmarks, underscoring the viability of scaling semi-supervised learning for RS applications. All datasets, code, and models will be released at https://github.com/MiliLab/S5
中文: S5框架通过构建RS4P-1M数据集并采用大规模基础模型,提出可扩展的遥感半监督分割方法,结合数据选择和专家混合微调策略,在多个基准测试中实现了最先进的性能表现。
English: The S5 framework introduces a scalable semi-supervised approach for remote sensing by creating the RS4P-1M dataset and leveraging large-scale foundation models, achieving state-of-the-art performance across benchmarks through data selection and Mixture-of-Experts fine-tuning.

Authors:Hanwen Cao, Haobo Lu, Xiaosen Wang, Kun He
Title: ViT-EnsembleAttack: Augmenting Ensemble Models for Stronger Adversarial Transferability in Vision Transformers
Abstract:
Ensemble-based attacks have been proven to be effective in enhancing adversarial transferability by aggregating the outputs of models with various architectures. However, existing research primarily focuses on refining ensemble weights or optimizing the ensemble path, overlooking the exploration of ensemble models to enhance the transferability of adversarial attacks. To address this gap, we propose applying adversarial augmentation to the surrogate models, aiming to boost overall generalization of ensemble models and reduce the risk of adversarial overfitting. Meanwhile, observing that ensemble Vision Transformers (ViTs) gain less attention, we propose ViT-EnsembleAttack based on the idea of model adversarial augmentation, the first ensemble-based attack method tailored for ViTs to the best of our knowledge. Our approach generates augmented models for each surrogate ViT using three strategies: Multi-head dropping, Attention score scaling, and MLP feature mixing, with the associated parameters optimized by Bayesian optimization. These adversarially augmented models are ensembled to generate adversarial examples. Furthermore, we introduce Automatic Reweighting and Step Size Enlargement modules to boost transferability. Extensive experiments demonstrate that ViT-EnsembleAttack significantly enhances the adversarial transferability of ensemble-based attacks on ViTs, outperforming existing methods by a substantial margin. Code is available at https://github.com/Trustworthy-AI-Group/TransferAttack.
中文: 提出的ViT-EnsembleAttack通过多头丢弃、注意力缩放和MLP混合等对抗增强策略,结合自动重加权和步长放大模块,显著提升了视觉Transformer的对抗迁移能力,大幅优于现有方法。
English: The proposed ViT-EnsembleAttack enhances adversarial transferability for Vision Transformers by applying adversarial augmentation through multi-head dropping, attention scaling, and MLP mixing, combined with automatic reweighting and step size enlargement, significantly outperforming existing methods.

Authors:Junyi Ma, Erhang Zhang, Yin-Dong Zheng, Yuchen Xie, Yixuan Zhou, Hesheng Wang
Title: EgoLoc: A Generalizable Solution for Temporal Interaction Localization in Egocentric Videos
Abstract:
Analyzing hand-object interaction in egocentric vision facilitates VR/AR applications and human-robot policy transfer. Existing research has mostly focused on modeling the behavior paradigm of interactive actions (i.e., ``how to interact''). However, the more challenging and fine-grained problem of capturing the critical moments of contact and separation between the hand and the target object (i.e., ``when to interact'') is still underexplored, which is crucial for immersive interactive experiences in mixed reality and robotic motion planning. Therefore, we formulate this problem as temporal interaction localization (TIL). Some recent works extract semantic masks as TIL references, but suffer from inaccurate object grounding and cluttered scenarios. Although current temporal action localization (TAL) methods perform well in detecting verb-noun action segments, they rely on category annotations during training and exhibit limited precision in localizing hand-object contact/separation moments. To address these issues, we propose a novel zero-shot approach dubbed EgoLoc to localize hand-object contact and separation timestamps in egocentric videos. EgoLoc introduces hand-dynamics-guided sampling to generate high-quality visual prompts. It exploits the vision-language model to identify contact/separation attributes, localize specific timestamps, and provide closed-loop feedback for further refinement. EgoLoc eliminates the need for object masks and verb-noun taxonomies, leading to generalizable zero-shot implementation. Comprehensive experiments on the public dataset and our novel benchmarks demonstrate that EgoLoc achieves plausible TIL for egocentric videos. It is also validated to effectively facilitate multiple downstream applications in egocentric vision and robotic manipulation tasks. Code and relevant data will be released at https://github.com/IRMVLab/EgoLoc.
中文: 本研究提出EgoLoc这一零样本方法,旨在精确定位第一人称视频中手与物体接触和分离的时间点,无需依赖物体掩码或预定义分类,有效提升了混合现实和机器人应用的沉浸式交互体验。
English: This research introduces EgoLoc, a zero-shot method for precisely localizing hand-object contact and separation timestamps in egocentric videos, enhancing immersive experiences in mixed reality and robotic applications without relying on object masks or predefined taxonomies.

Authors:Ziye Wang, Minghang Yu, Chunyan Xu, Zhen Cui
Title: Semantic Discrepancy-aware Detector for Image Forgery Identification
Abstract:
With the rapid advancement of image generation techniques, robust forgery detection has become increasingly imperative to ensure the trustworthiness of digital media. Recent research indicates that the learned semantic concepts of pre-trained models are critical for identifying fake images. However, the misalignment between the forgery and semantic concept spaces hinders the model's forgery detection performance. To address this problem, we propose a novel Semantic Discrepancy-aware Detector (SDD) that leverages reconstruction learning to align the two spaces at a fine-grained visual level. By exploiting the conceptual knowledge embedded in the pre-trained vision language model, we specifically design a semantic token sampling module to mitigate the space shifts caused by features irrelevant to both forgery traces and semantic concepts. A concept-level forgery discrepancy learning module, built upon a visual reconstruction paradigm, is proposed to strengthen the interaction between visual semantic concepts and forgery traces, effectively capturing discrepancies under the concepts' guidance. Finally, the low-level forgery feature enhancemer integrates the learned concept level forgery discrepancies to minimize redundant forgery information. Experiments conducted on two standard image forgery datasets demonstrate the efficacy of the proposed SDD, which achieves superior results compared to existing methods. The code is available at https://github.com/wzy1111111/SSD.
中文摘要:提出的语义差异感知检测器(SDD)通过重建学习和专门设计的模块,在细粒度视觉层面实现伪造与语义概念空间的对齐,从而显著提升伪造图像检测性能。
English Summary: The proposed Semantic Discrepancy-aware Detector (SDD) aligns forgery and semantic concept spaces through reconstruction learning and specialized modules to significantly improve fake image detection performance.

Authors:Xiaobin Deng, Changyu Diao, Min Li, Ruohan Yu, Duanqing Xu
Title: Improving Densification in 3D Gaussian Splatting for High-Fidelity Rendering
Abstract:
Although 3D Gaussian Splatting (3DGS) has achieved impressive performance in real-time rendering, its densification strategy often results in suboptimal reconstruction quality. In this work, we present a comprehensive improvement to the densification pipeline of 3DGS from three perspectives: when to densify, how to densify, and how to mitigate overfitting. Specifically, we propose an Edge-Aware Score to effectively select candidate Gaussians for splitting. We further introduce a Long-Axis Split strategy that reduces geometric distortions introduced by clone and split operations. To address overfitting, we design a set of techniques, including Recovery-Aware Pruning, Multi-step Update, and Growth Control. Our method enhances rendering fidelity without introducing additional training or inference overhead, achieving state-of-the-art performance with fewer Gaussians.

Authors:Hongliang Wei, Xianqi Zhang, Xingtao Wang, Xiaopeng Fan, Debin Zhao
Title: Region-Level Context-Aware Multimodal Understanding
Abstract:
Despite significant progress, existing research on Multimodal Large Language Models (MLLMs) mainly focuses on general visual understanding, overlooking the ability to integrate textual context associated with objects for a more context-aware multimodal understanding -- an ability we refer to as Region-level Context-aware Multimodal Understanding (RCMU). To address this limitation, we first formulate the RCMU task, which requires models to respond to user instructions by integrating both image content and textual information of regions or objects. To equip MLLMs with RCMU capabilities, we propose Region-level Context-aware Visual Instruction Tuning (RCVIT), which incorporates object information into the model input and enables the model to utilize bounding box coordinates to effectively associate objects' visual content with their textual information. To address the lack of datasets, we introduce the RCMU dataset, a large-scale visual instruction tuning dataset that covers multiple RCMU tasks. We also propose RC\&P-Bench, a comprehensive benchmark that can evaluate the performance of MLLMs in RCMU and multimodal personalized understanding tasks. Additionally, we propose a reference-free evaluation metric to perform a comprehensive and fine-grained evaluation of the region-level context-aware image descriptions. By performing RCVIT on Qwen2-VL models with the RCMU dataset, we developed RC-Qwen2-VL models. Experimental results indicate that RC-Qwen2-VL models not only achieve outstanding performance on multiple RCMU tasks but also demonstrate successful applications in multimodal RAG and personalized conversation. Our data, model and benchmark are available at https://github.com/hongliang-wei/RC-MLLM
中文: 本研究提出了区域级上下文感知多模态理解(RCMU),通过结合对象文本信息与视觉内容,开发了RCVIT方法和相关数据集及基准,实验表明RC-Qwen2-VL模型在RCMU任务和实际应用中表现卓越。
English: This research introduces Region-level Context-aware Multimodal Understanding (RCMU) to enhance MLLMs by integrating object-specific textual context with visual data, proposing the RCVIT method and a new dataset and benchmark, with the resulting RC-Qwen2-VL models showing superior performance in RCMU tasks and practical applications.

Authors:Quan Chen, Xiong Yang, Rongfeng Lu, Qianyu Zhang, Yu Liu, Xiaofei Zhou, Bolun Zheng
Title: WXSOD: A Benchmark for Robust Salient Object Detection in Adverse Weather Conditions
Abstract:
Salient object detection (SOD) in complex environments remains a challenging research topic. Most existing methods perform well in natural scenes with negligible noise, and tend to leverage multi-modal information (e.g., depth and infrared) to enhance accuracy. However, few studies are concerned with the damage of weather noise on SOD performance due to the lack of dataset with pixel-wise annotations. To bridge this gap, this paper introduces a novel Weather-eXtended Salient Object Detection (WXSOD) dataset. It consists of 14,945 RGB images with diverse weather noise, along with the corresponding ground truth annotations and weather labels. To verify algorithm generalization, WXSOD contains two test sets, i.e., a synthesized test set and a real test set. The former is generated by adding weather noise to clean images, while the latter contains real-world weather noise. Based on WXSOD, we propose an efficient baseline, termed Weather-aware Feature Aggregation Network (WFANet), which adopts a fully supervised two-branch architecture. Specifically, the weather prediction branch mines weather-related deep features, while the saliency detection branch fuses semantic features extracted from the backbone with weather features for SOD. Comprehensive comparisons against 17 SOD methods shows that our WFANet achieves superior performance on WXSOD. The code and benchmark results will be made publicly available at https://github.com/C-water/WXSOD
中文: 本文提出WXSOD数据集以解决显著目标检测中的天气噪声问题,并开发了WFANet模型,通过融合天气特征实现了优越性能。
English: This paper introduces the WXSOD dataset to address weather noise in salient object detection and proposes the WFANet model, which effectively integrates weather features to achieve superior performance.

Authors:Butian Xiong, Rong Liu, Kenneth Xu, Meida Chen, Andrew Feng
Title: Splat Feature Solver
Abstract:
Feature lifting has emerged as a crucial component in 3D scene understanding, enabling the attachment of rich image feature descriptors (e.g., DINO, CLIP) onto splat-based 3D representations. The core challenge lies in optimally assigning rich general attributes to 3D primitives while addressing the inconsistency issues from multi-view images. We present a unified, kernel- and feature-agnostic formulation of the feature lifting problem as a sparse linear inverse problem, which can be solved efficiently in closed form. Our approach admits a provable upper bound on the global optimal error under convex losses for delivering high quality lifted features. To address inconsistencies and noise in multi-view observations, we introduce two complementary regularization strategies to stabilize the solution and enhance semantic fidelity. Tikhonov Guidance enforces numerical stability through soft diagonal dominance, while Post-Lifting Aggregation filters noisy inputs via feature clustering. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on open-vocabulary 3D segmentation benchmarks, outperforming training-based, grouping-based, and heuristic-forward baselines while producing the lifted features in minutes. Code is available at \href{https://github.com/saliteta/splat-distiller.git}{\textbf{github}}. We also have a \href{https://splat-distiller.pages.dev/}
Chinese: 本文提出了一种统一的特征提升方法,将问题构建为稀疏线性逆问题,通过高效的闭式解和正则化策略,在三维分割任务中实现了最先进的性能。
English: This paper introduces a unified feature lifting method that formulates the problem as a sparse linear inverse problem, achieving state-of-the-art performance in 3D segmentation through efficient closed-form solutions and regularization strategies.

Authors:Nikolaos-Antonios Ypsilantis, Kaifeng Chen, André Araujo, Ondřej Chum
Title: Infusing fine-grained visual knowledge to Vision-Language Models
Abstract:
Large-scale contrastive pre-training produces powerful Vision-and-Language Models (VLMs) capable of generating representations (embeddings) effective for a wide variety of visual and multimodal tasks. However, these pretrained embeddings remain suboptimal for fine-grained open-set visual retrieval, where state-of-the-art results require fine-tuning the vision encoder using annotated domain-specific samples. Naively performing such fine-tuning typically leads to catastrophic forgetting, severely diminishing the model's general-purpose visual and cross-modal capabilities. In this work, we propose a fine-tuning method explicitly designed to achieve optimal balance between fine-grained domain adaptation and retention of the pretrained VLM's broad multimodal knowledge. Drawing inspiration from continual learning literature, we systematically analyze standard regularization techniques aimed at knowledge retention and propose an efficient and effective combination strategy. Additionally, we address the commonly overlooked yet critical aspects of validation set design and hyperparameter tuning to ensure reproducibility and robust generalization across datasets and pretrained models. We extensively evaluate our method on both fine-grained and coarse-grained image-image and image-text retrieval benchmarks. Our approach consistently achieves strong results, notably retaining the visual-text alignment without utilizing any text data or the original text encoder during fine-tuning. Code and model checkpoints: https://github.com/nikosips/infusing .
Chinese: 本研究提出了一种微调方法,在保持预训练视觉语言模型广泛多模态能力的同时实现领域特定适应,无需在微调过程中使用文本数据即可在细粒度和粗粒度检索任务中取得优异性能。
English: This work introduces a fine-tuning method that balances domain-specific adaptation with preserving the broad multimodal capabilities of pretrained Vision-and-Language Models, achieving strong performance in both fine-grained and coarse-grained retrieval tasks without using text data during fine-tuning.

Authors:Seungju Yoo, Hyuk Kwon, Joong-Won Hwang, Kibok Lee
Title: Automated Model Evaluation for Object Detection via Prediction Consistency and Reliablity
Abstract:
Recent advances in computer vision have made training object detectors more efficient and effective; however, assessing their performance in real-world applications still relies on costly manual annotation. To address this limitation, we develop an automated model evaluation (AutoEval) framework for object detection. We propose Prediction Consistency and Reliability (PCR), which leverages the multiple candidate bounding boxes that conventional detectors generate before non-maximum suppression (NMS). PCR estimates detection performance without ground-truth labels by jointly measuring 1) the spatial consistency between boxes before and after NMS, and 2) the reliability of the retained boxes via the confidence scores of overlapping boxes. For a more realistic and scalable evaluation, we construct a meta-dataset by applying image corruptions of varying severity. Experimental results demonstrate that PCR yields more accurate performance estimates than existing AutoEval methods, and the proposed meta-dataset covers a wider range of detection performance. The code is available at https://github.com/YonseiML/autoeval-det.
中文: AutoEval框架提出预测一致性与可靠性(PCR)方法,通过分析边界框的空间一致性和置信度可靠性,无需真实标注即可自动评估目标检测性能,经多样化元数据集验证,其评估准确性优于现有方法。
English: The AutoEval framework introduces Prediction Consistency and Reliability (PCR) to automatically estimate object detection performance without ground-truth labels by analyzing spatial consistency and confidence reliability of bounding boxes, validated through a diverse meta-dataset showing superior accuracy over existing methods.

Authors:Durgesh Kumar Singh, Qing Cao, Sarina Thomas, Ahcène Boubekki, Robert Jenssen, Michael Kampffmeyer
Title: WiseLVAM: A Novel Framework For Left Ventricle Automatic Measurements
Abstract:
Clinical guidelines recommend performing left ventricular (LV) linear measurements in B-mode echocardiographic images at the basal level -- typically at the mitral valve leaflet tips -- and aligned perpendicular to the LV long axis along a virtual scanline (SL). However, most automated methods estimate landmarks directly from B-mode images for the measurement task, where even small shifts in predicted points along the LV walls can lead to significant measurement errors, reducing their clinical reliability. A recent semi-automatic method, EnLVAM, addresses this limitation by constraining landmark prediction to a clinician-defined SL and training on generated Anatomical Motion Mode (AMM) images to predict LV landmarks along the same. To enable full automation, a contour-aware SL placement approach is proposed in this work, in which the LV contour is estimated using a weakly supervised B-mode landmark detector. SL placement is then performed by inferring the LV long axis and the basal level- mimicking clinical guidelines. Building on this foundation, we introduce \textit{WiseLVAM} -- a novel, fully automated yet manually adaptable framework for automatically placing the SL and then automatically performing the LV linear measurements in the AMM mode. \textit{WiseLVAM} utilizes the structure-awareness from B-mode images and the motion-awareness from AMM mode to enhance robustness and accuracy with the potential to provide a practical solution for the routine clinical application. The source code is publicly available at https://github.com/SFI-Visual-Intelligence/wiselvam.git.
中文: 本文提出WiseLVAM全自动框架,通过结合轮廓感知扫描线定位与解剖运动模式成像技术,提升左心室线性测量的精确度和临床实用性。
English: This paper introduces WiseLVAM, a fully automated framework that enhances left ventricular linear measurements by combining contour-aware scanline placement with Anatomical Motion Mode imaging to improve accuracy and clinical reliability.

Authors:Yuanbin Fu, Liang Li, Xiaojie Guo
Title: PEdger++: Practical Edge Detection via Assembling Cross Information
Abstract:
Edge detection serves as a critical foundation for numerous computer vision applications, including object detection, semantic segmentation, and image editing, by extracting essential structural cues that define object boundaries and salient edges. To be viable for broad deployment across devices with varying computational capacities, edge detectors shall balance high accuracy with low computational complexity. While deep learning has evidently improved accuracy, they often suffer from high computational costs, limiting their applicability on resource-constrained devices. This paper addresses the challenge of achieving that balance: \textit{i.e.}, {how to efficiently capture discriminative features without relying on large-size and sophisticated models}. We propose PEdger++, a collaborative learning framework designed to reduce computational costs and model sizes while improving edge detection accuracy. The core principle of our PEdger++ is that cross-information derived from heterogeneous architectures, diverse training moments, and multiple parameter samplings, is beneficial to enhance learning from an ensemble perspective. Extensive experimental results on the BSDS500, NYUD and Multicue datasets demonstrate the effectiveness of our approach, both quantitatively and qualitatively, showing clear improvements over existing methods. We also provide multiple versions of the model with varying computational requirements, highlighting PEdger++'s adaptability with respect to different resource constraints. Codes are accessible at https://github.com/ForawardStar/EdgeDetectionviaPEdgerPlus/.
中文: PEdger++ 提出了一种协作学习框架,通过融合异构架构、不同训练时刻和多重参数采样的跨信息,在降低计算成本和模型大小的同时提高了边缘检测精度,并在多个数据集上展现出优于现有方法的性能。
English: PEdger++ introduces a collaborative learning framework that enhances edge detection accuracy while reducing computational costs and model size by leveraging cross-information from heterogeneous architectures, training moments, and parameter samplings, demonstrating superior performance across multiple datasets.

Authors:Seunghun Lee, Jiwan Seo, Jeonghoon Kim, Siwon Kim, Haeun Yun, Hyogyeong Jeon, Wonhyeok Choi, Jaehoon Jeong, Zane Durante, Sang Hyun Park, Sunghoon Im
Title: SAMDWICH: Moment-aware Video-text Alignment for Referring Video Object Segmentation
Abstract:
Referring Video Object Segmentation (RVOS) aims to segment and track objects in videos based on natural language expressions, requiring precise alignment between visual content and textual queries. However, existing methods often suffer from semantic misalignment, largely due to indiscriminate frame sampling and supervision of all visible objects during training -- regardless of their actual relevance to the expression. To address this, we introduce a moment-aware RVOS framework named SAMDWICH, along with a newly annotated dataset, MeViS-M, built upon the challenging MeViS benchmark. We manually annotate temporal moments indicating when each object is referred to by the expression, enabling semantically grounded supervision that strengthens video-text alignment. SAMDWICH leverages these aligned text-to-clip pairs to guide training, significantly enhancing referential understanding. Building upon this framework, we propose Moment-guided Dual-path Propagation (MDP), a moment-aware propagation strategy that improves both object grounding and tracking by training on both relevant and irrelevant frames through a moment-centric memory mechanism. In addition, we introduce Object-level Selective Supervision (OSS), an object-level filtering strategy that supervises only the objects temporally aligned with the expression in each training clip. This selective supervision reduces semantic noise and reinforces language-conditioned learning. Extensive experiments show that SAMDWICH achieves state-of-the-art performance on challenging MeViS benchmark, particularly excelling in complex scenarios involving diverse expressions.

Authors:Pallavi Jain, Diego Marcos, Dino Ienco, Roberto Interdonato, Tristan Berchoux
Title: TimeSenCLIP: A Vision-Language Model for Remote Sensing Using Single-Pixel Time Series
Abstract:
Vision-language models have shown significant promise in remote sensing applications, particularly for land-use and land-cover (LULC) via zero-shot classification and retrieval. However, current approaches face two key challenges: reliance on large spatial tiles that increase computational cost, and dependence on text-based supervision, which is often not readily available. In this work, we present TimeSenCLIP, a lightweight framework that reevaluate the role of spatial context by evaluating the effectiveness of a single pixel by leveraging its temporal and spectral dimensions, for classifying LULC and ecosystem types. By leveraging spectral and temporal information from Sentinel-2 imagery and cross-view learning with geo-tagged ground-level photos, we minimises the need for caption-based training while preserving semantic alignment between overhead (satellite) and ground perspectives. Our approach is grounded in the LUCAS and Sen4Map datasets, and evaluated on classification tasks including LULC, crop type, and ecosystem type. We demonstrate that single pixel inputs, when combined with temporal and spectral cues, are sufficient for thematic mapping, offering a scalable and efficient alternative for large-scale remote sensing applications. Code is available at https://github.com/pallavijain-pj/TimeSenCLIP
Chinese: TimeSenCLIP 提出了一种轻量级框架,利用 Sentinel-2 影像中单个像素的时空和光谱数据,结合地面照片的跨视角学习,有效分类土地利用和土地覆盖类型,降低了对文本监督和计算成本的依赖。
English: TimeSenCLIP introduces a lightweight framework that uses single pixels' temporal and spectral data from Sentinel-2 imagery and cross-view learning with ground photos to efficiently classify land-use and land-cover types, reducing reliance on text supervision and computational costs.

Authors:Runhao Zeng, Jiaqi Mao, Minghao Lai, Minh Hieu Phan, Yanjie Dong, Wei Wang, Qi Chen, Xiping Hu
Title: OVG-HQ: Online Video Grounding with Hybrid-modal Queries
Abstract:
Video grounding (VG) task focuses on locating specific moments in a video based on a query, usually in text form. However, traditional VG struggles with some scenarios like streaming video or queries using visual cues. To fill this gap, we present a new task named Online Video Grounding with Hybrid-modal Queries (OVG-HQ), which enables online segment localization using text, images, video segments, and their combinations. This task poses two new challenges: limited context in online settings and modality imbalance during training, where dominant modalities overshadow weaker ones. To address these, we propose OVG-HQ-Unify, a unified framework featuring a Parametric Memory Block (PMB) that retain previously learned knowledge to enhance current decision and a cross-modal distillation strategy that guides the learning of non-dominant modalities. This design enables a single model to effectively handle hybrid-modal queries. Due to the lack of suitable datasets, we construct QVHighlights-Unify, an expanded dataset with multi-modal queries. Besides, since offline metrics overlook prediction timeliness, we adapt them to the online setting, introducing oR@n, IoU=m, and online mean Average Precision (omAP) to evaluate both accuracy and efficiency. Experiments show that our OVG-HQ-Unify outperforms existing models, offering a robust solution for online, hybrid-modal video grounding. Source code and datasets are available at https://github.com/maojiaqi2324/OVG-HQ.
中文: 本研究提出了在线视频定位与混合模态查询(OVG-HQ)新任务,通过文本、图像和视频片段等多种查询方式定位视频片段,并开发了OVG-HQ-Unify统一框架,结合参数记忆块和跨模态蒸馏策略解决模态不平衡和上下文限制问题,新构建的数据集和在线评估指标验证了其优越性能。
English: The study introduces Online Video Grounding with Hybrid-modal Queries (OVG-HQ), a new task for locating video segments using diverse queries, and proposes OVG-HQ-Unify, a unified framework with a Parametric Memory Block and cross-modal distillation to handle modality imbalance and limited context, validated by a new dataset and online metrics showing superior performance.

Authors:Quanwei Hu, Yinggan Tang, Xuguang Zhang
Title: Large Kernel Modulation Network for Efficient Image Super-Resolution
Abstract:
Image super-resolution (SR) in resource-constrained scenarios demands lightweight models balancing performance and latency. Convolutional neural networks (CNNs) offer low latency but lack non-local feature capture, while Transformers excel at non-local modeling yet suffer slow inference. To address this trade-off, we propose the Large Kernel Modulation Network (LKMN), a pure CNN-based model. LKMN has two core components: Enhanced Partial Large Kernel Block (EPLKB) and Cross-Gate Feed-Forward Network (CGFN). The EPLKB utilizes channel shuffle to boost inter-channel interaction, incorporates channel attention to focus on key information, and applies large kernel strip convolutions on partial channels for non-local feature extraction with reduced complexity. The CGFN dynamically adjusts discrepancies between input, local, and non-local features via a learnable scaling factor, then employs a cross-gate strategy to modulate and fuse these features, enhancing their complementarity. Extensive experiments demonstrate that our method outperforms existing state-of-the-art (SOTA) lightweight SR models while balancing quality and efficiency. Specifically, LKMN-L achieves 0.23 dB PSNR improvement over DAT-light on the Manga109 dataset at $\times$4 upscale, with nearly $\times$4.8 times faster. Codes are in the supplementary materials. The code is available at https://github.com/Supereeeee/LKMN.
Chinese: 大核调制网络(LKMN)是一种基于CNN的轻量级模型,通过整合大核卷积进行非局部特征提取和交叉门控机制实现特征融合,在图像超分辨率任务中有效平衡了质量与效率,以更快推理速度超越了现有最优方法。
English: The Large Kernel Modulation Network (LKMN) is a lightweight CNN-based model that effectively balances image super-resolution quality and efficiency by integrating large kernel convolutions for non-local feature extraction and a cross-gate mechanism for feature fusion, outperforming state-of-the-art methods with faster inference.

Authors:Ming Cheng, Tong Wu, Jiazhen Hu, Jiaying Gong, Hoda Eldardiry
Title: VideoAVE: A Multi-Attribute Video-to-Text Attribute Value Extraction Dataset and Benchmark Models
Abstract:
Attribute Value Extraction (AVE) is important for structuring product information in e-commerce. However, existing AVE datasets are primarily limited to text-to-text or image-to-text settings, lacking support for product videos, diverse attribute coverage, and public availability. To address these gaps, we introduce VideoAVE, the first publicly available video-to-text e-commerce AVE dataset across 14 different domains and covering 172 unique attributes. To ensure data quality, we propose a post-hoc CLIP-based Mixture of Experts filtering system (CLIP-MoE) to remove the mismatched video-product pairs, resulting in a refined dataset of 224k training data and 25k evaluation data. In order to evaluate the usability of the dataset, we further establish a comprehensive benchmark by evaluating several state-of-the-art video vision language models (VLMs) under both attribute-conditioned value prediction and open attribute-value pair extraction tasks. Our results analysis reveals that video-to-text AVE remains a challenging problem, particularly in open settings, and there is still room for developing more advanced VLMs capable of leveraging effective temporal information. The dataset and benchmark code for VideoAVE are available at: https://github.com/gjiaying/VideoAVE
中文: 本研究推出了首个公开的视频到文本电商属性值提取数据集VideoAVE,涵盖14个领域和172种属性,填补了现有资源的空白,并通过基准测试表明视频到文本的属性值提取仍具挑战性,尤其是在开放设置中。
English: The study introduces VideoAVE, the first publicly available video-to-text dataset for Attribute Value Extraction in e-commerce, addressing gaps in existing resources by covering 14 domains and 172 attributes, and establishes a benchmark showing the challenge of video-to-text AVE, especially in open settings.

Authors:Haojie Zhang, Yixiong Liang, Hulin Kuang, Lihui Cen, Zhe Qu, Yigang Cen, Min Zeng, Shichao Kan
Title: Contrastive Regularization over LoRA for Multimodal Biomedical Image Incremental Learning
Abstract:
Multimodal Biomedical Image Incremental Learning (MBIIL) is essential for handling diverse tasks and modalities in the biomedical domain, as training separate models for each modality or task significantly increases inference costs. Existing incremental learning methods focus on task expansion within a single modality, whereas MBIIL seeks to train a unified model incrementally across modalities. The MBIIL faces two challenges: I) How to preserve previously learned knowledge during incremental updates? II) How to effectively leverage knowledge acquired from existing modalities to support new modalities? To address these challenges, we propose MSLoRA-CR, a method that fine-tunes Modality-Specific LoRA modules while incorporating Contrastive Regularization to enhance intra-modality knowledge sharing and promote inter-modality knowledge differentiation. Our approach builds upon a large vision-language model (LVLM), keeping the pretrained model frozen while incrementally adapting new LoRA modules for each modality or task. Experiments on the incremental learning of biomedical images demonstrate that MSLoRA-CR outperforms both the state-of-the-art (SOTA) approach of training separate models for each modality and the general incremental learning method (incrementally fine-tuning LoRA). Specifically, MSLoRA-CR achieves a 1.88% improvement in overall performance compared to unconstrained incremental learning methods while maintaining computational efficiency. Our code is publicly available at https://github.com/VentusAislant/MSLoRA_CR.
中文摘要:MSLoRA-CR是一种新颖的多模态生物医学图像增量学习方法,通过对比正则化微调模态特定的LoRA模块,在保持计算效率的同时实现跨模态知识共享,性能比现有方法提升1.88%。
English Summary: MSLoRA-CR is a novel multimodal biomedical image incremental learning method that fine-tunes modality-specific LoRA modules with contrastive regularization to enable knowledge sharing across modalities while maintaining computational efficiency, outperforming existing approaches by 1.88%.

Authors:Yang Zhao, Tao Wang, Said Elhadi
Title: Data-driven RF Tomography via Cross-modal Sensing and Continual Learning
Abstract:
Data-driven radio frequency (RF) tomography has demonstrated significant potential for underground target detection, due to the penetrative nature of RF signals through soil. However, it is still challenging to achieve accurate and robust performance in dynamic environments. In this work, we propose a data-driven radio frequency tomography (DRIFT) framework with the following key components to reconstruct cross section images of underground root tubers, even with significant changes in RF signals. First, we design a cross-modal sensing system with RF and visual sensors, and propose to train an RF tomography deep neural network (DNN) model following the cross-modal learning approach. Then we propose to apply continual learning to automatically update the DNN model, once environment changes are detected in a dynamic environment. Experimental results show that our approach achieves an average equivalent diameter error of 2.29 cm, 23.2% improvement upon the state-of-the-art approach. Our DRIFT code and dataset are publicly available on https://github.com/Data-driven-RTI/DRIFT.
中文: 提出的DRIFT框架结合跨模态学习和持续学习,提升了数据驱动射频层析成像技术,在地下根茎成像中即使环境变化仍实现精度提升23.2%。
English: The proposed DRIFT framework utilizes cross-modal learning and continual learning to enhance data-driven RF tomography, achieving a 23.2% improvement in accuracy for underground root tuber imaging despite environmental changes.

Authors:Shilei Wang, Gong Cheng, Pujian Lai, Dong Gao, Junwei Han
Title: Multi-State Tracker: Enhancing Efficient Object Tracking via Multi-State Specialization and Interaction
Abstract:
Efficient trackers achieve faster runtime by reducing computational complexity and model parameters. However, this efficiency often compromises the expense of weakened feature representation capacity, thus limiting their ability to accurately capture target states using single-layer features. To overcome this limitation, we propose Multi-State Tracker (MST), which utilizes highly lightweight state-specific enhancement (SSE) to perform specialized enhancement on multi-state features produced by multi-state generation (MSG) and aggregates them in an interactive and adaptive manner using cross-state interaction (CSI). This design greatly enhances feature representation while incurring minimal computational overhead, leading to improved tracking robustness in complex environments. Specifically, the MSG generates multiple state representations at multiple stages during feature extraction, while SSE refines them to highlight target-specific features. The CSI module facilitates information exchange between these states and ensures the integration of complementary features. Notably, the introduced SSE and CSI modules adopt a highly lightweight hidden state adaptation-based state space duality (HSA-SSD) design, incurring only 0.1 GFLOPs in computation and 0.66 M in parameters. Experimental results demonstrate that MST outperforms all previous efficient trackers across multiple datasets, significantly improving tracking accuracy and robustness. In particular, it shows excellent runtime performance, with an AO score improvement of 4.5\% over the previous SOTA efficient tracker HCAT on the GOT-10K dataset. The code is available at https://github.com/wsumel/MST.
中文摘要:多状态跟踪器(MST)通过轻量级状态增强模块和跨状态交互机制,在极低计算开销下大幅提升特征表征能力,在多个数据集上实现了跟踪精度与运行效率的突破性提升。
English Summary: The Multi-State Tracker (MST) introduces lightweight state-specific enhancement and cross-state interaction modules to significantly boost feature representation with minimal computational cost, achieving superior tracking accuracy and runtime efficiency across multiple datasets.

Authors:Qian Liang, Zichong Chen, Yang Zhou, Hui Huang
Title: SPG: Style-Prompting Guidance for Style-Specific Content Creation
Abstract:
Although recent text-to-image (T2I) diffusion models excel at aligning generated images with textual prompts, controlling the visual style of the output remains a challenging task. In this work, we propose Style-Prompting Guidance (SPG), a novel sampling strategy for style-specific image generation. SPG constructs a style noise vector and leverages its directional deviation from unconditional noise to guide the diffusion process toward the target style distribution. By integrating SPG with Classifier-Free Guidance (CFG), our method achieves both semantic fidelity and style consistency. SPG is simple, robust, and compatible with controllable frameworks like ControlNet and IPAdapter, making it practical and widely applicable. Extensive experiments demonstrate the effectiveness and generality of our approach compared to state-of-the-art methods. Code is available at https://github.com/Rumbling281441/SPG.
中文: 本文提出风格提示引导(SPG)方法,通过构建风格噪声向量并利用其与无条件噪声的方向偏差来引导扩散过程,在保持语义准确性的同时实现视觉风格一致性,展现出卓越性能并与现有控制框架广泛兼容。
English: This paper introduces Style-Prompting Guidance (SPG), a novel sampling strategy that enhances text-to-image diffusion models by ensuring both semantic accuracy and consistent visual style through style noise vector guidance, demonstrating superior performance and broad compatibility with existing frameworks.

Authors:Hongjin Fang, Daniel Reisenbüchler, Kenji Ikemura, Mert R. Sabuncu, Yihe Yang, Ruining Deng
Title: CoFi: A Fast Coarse-to-Fine Few-Shot Pipeline for Glomerular Basement Membrane Segmentation
Abstract:
Accurate segmentation of the glomerular basement membrane (GBM) in electron microscopy (EM) images is fundamental for quantifying membrane thickness and supporting the diagnosis of various kidney diseases. While supervised deep learning approaches achieve high segmentation accuracy, their reliance on extensive pixel-level annotation renders them impractical for clinical workflows. Few-shot learning can reduce this annotation burden but often struggles to capture the fine structural details necessary for GBM analysis. In this study, we introduce CoFi, a fast and efficient coarse-to-fine few-shot segmentation pipeline designed for GBM delineation in EM images. CoFi first trains a lightweight neural network using only three annotated images to produce an initial coarse segmentation mask. This mask is then automatically processed to generate high-quality point prompts with morphology-aware pruning, which are subsequently used to guide SAM in refining the segmentation. The proposed method achieved exceptional GBM segmentation performance, with a Dice coefficient of 74.54% and an inference speed of 1.9 FPS. We demonstrate that CoFi not only alleviates the annotation and computational burdens associated with conventional methods, but also achieves accurate and reliable segmentation results. The pipeline's speed and annotation efficiency make it well-suited for research and hold strong potential for clinical applications in renal pathology. The pipeline is publicly available at: https://github.com/ddrrnn123/CoFi.
中文: 本研究提出的CoFi是一种从粗到精的少样本分割流程,仅需少量标注即可高效分割电子显微镜图像中的肾小球基底膜,兼具高精度与快速处理能力,适用于临床应用。
English: The study introduces CoFi, a coarse-to-fine few-shot segmentation pipeline that efficiently segments the glomerular basement membrane in electron microscopy images using minimal annotations and achieves high accuracy and speed, making it suitable for clinical applications.

Authors:Augustine X. W. Lee, Pak-Hei Yeung, Jagath C. Rajapakse
Title: Subcortical Masks Generation in CT Images via Ensemble-Based Cross-Domain Label Transfer
Abstract:
Subcortical segmentation in neuroimages plays an important role in understanding brain anatomy and facilitating computer-aided diagnosis of traumatic brain injuries and neurodegenerative disorders. However, training accurate automatic models requires large amounts of labelled data. Despite the availability of publicly available subcortical segmentation datasets for Magnetic Resonance Imaging (MRI), a significant gap exists for Computed Tomography (CT). This paper proposes an automatic ensemble framework to generate high-quality subcortical segmentation labels for CT scans by leveraging existing MRI-based models. We introduce a robust ensembling pipeline to integrate them and apply it to unannotated paired MRI-CT data, resulting in a comprehensive CT subcortical segmentation dataset. Extensive experiments on multiple public datasets demonstrate the superior performance of our proposed framework. Furthermore, using our generated CT dataset, we train segmentation models that achieve improved performance on related segmentation tasks. To facilitate future research, we make our source code, generated dataset, and trained models publicly available at https://github.com/SCSE-Biomedical-Computing-Group/CT-Subcortical-Segmentation, marking the first open-source release for CT subcortical segmentation to the best of our knowledge.
中文: 本文提出一种自动集成框架,通过利用现有基于磁共振成像的模型为计算机断层扫描生成高质量皮层下分割标签,并首次开源相关数据集与模型以推动该领域研究。
English: This paper introduces an automatic ensemble framework that generates high-quality subcortical segmentation labels for CT scans by leveraging existing MRI-based models, creating the first open-source dataset and models for CT subcortical segmentation to advance related research.

Authors:Yinggan Tang, Quanwei Hu
Title: LKFMixer: Exploring Large Kernel Feature For Efficient Image Super-Resolution
Abstract:
The success of self-attention (SA) in Transformer demonstrates the importance of non-local information to image super-resolution (SR), but the huge computing power required makes it difficult to implement lightweight models. To solve this problem, we propose a pure convolutional neural network (CNN) model, LKFMixer, which utilizes large convolutional kernel to simulate the ability of self-attention to capture non-local features. Specifically, we increase the kernel size to 31 to obtain the larger receptive field as possible, and reduce the parameters and computations by coordinate decomposition. Meanwhile, a spatial feature modulation block (SFMB) is designed to enhance the focus of feature information on both spatial and channel dimension. In addition, by introducing feature selection block (FSB), the model can adaptively adjust the weights between local features and non-local features. Extensive experiments show that the proposed LKFMixer family outperform other state-of-the-art (SOTA) methods in terms of SR performance and reconstruction quality. In particular, compared with SwinIR-light on Manga109 dataset, LKFMixer-L achieves 0.6dB PSNR improvement at $\times$4 scale, while the inference speed is $\times$5 times faster. The code is available at https://github.com/Supereeeee/LKFMixer.
中文: LKFMixer模型采用大卷积核模拟自注意力机制以捕捉图像超分辨率中的非局部特征,在性能和重建质量上超越现有方法,且推理速度更快。
English: The LKFMixer model uses large convolutional kernels to mimic self-attention for capturing non-local features in image super-resolution, achieving superior performance and faster inference than existing methods.

Authors:Xinyi Wang, Smaranda Tasmoc, Nantheera Anantrasirichai, Angeliki Katsenou
Title: Guiding WaveMamba with Frequency Maps for Image Debanding
Abstract:
Compression at low bitrates in modern codecs often introduces banding artifacts, especially in smooth regions such as skies. These artifacts degrade visual quality and are common in user-generated content due to repeated transcoding. We propose a banding restoration method that employs the Wavelet State Space Model and a frequency masking map to preserve high-frequency details. Furthermore, we provide a benchmark of open-source banding restoration methods and evaluate their performance on two public banding image datasets. Experimentation on the available datasets suggests that the proposed post-processing approach effectively suppresses banding compared to the state-of-the-art method (a DBI value of 0.082 on BAND-2k) while preserving image textures. Visual inspections of the results confirm this. Code and supplementary material are available at: https://github.com/xinyiW915/Debanding-PCS2025.
中文摘要:该研究提出的基于小波状态空间模型和频率掩码的带状伪影修复方法,在保留图像纹理的同时有效抑制了压缩伪影,在公开数据集上的表现优于现有技术。
English Summary: The proposed banding restoration method using a Wavelet State Space Model and frequency masking effectively reduces compression artifacts while preserving image textures, outperforming existing techniques on benchmark datasets.

Authors:Qiangong Zhou, Zhiting Wang, Mingyou Yao, Zongyang Liu
Title: Allen: Rethinking MAS Design through Step-Level Policy Autonomy
Abstract:
We introduce a new Multi-Agent System (MAS) - Allen, designed to address two core challenges in current MAS design: (1) improve system's policy autonomy, empowering agents to dynamically adapt their behavioral strategies, and (2) achieving the trade-off between collaborative efficiency, task supervision, and human oversight in complex network topologies. Our core insight is to redefine the basic execution unit in the MAS, allowing agents to autonomously form different patterns by combining these units. We have constructed a four-tier state architecture (Task, Stage, Agent, Step) to constrain system behavior from both task-oriented and execution-oriented perspectives. This achieves a unification of topological optimization and controllable progress. Allen grants unprecedented Policy Autonomy, while making a trade-off for the controllability of the collaborative structure. The project code has been open source at: https://github.com/motern88/Allen
中文: Allen是一种新型多智能体系统,通过四层架构设计提升策略自主性,使智能体动态调整行为策略,并在协作效率与可控性之间实现平衡。
English: Allen is a novel Multi-Agent System that enhances policy autonomy by enabling agents to dynamically adapt strategies and balances collaborative efficiency with human oversight through a four-tier architecture.

Authors:Junjie Wang, Keyu Chen, Yulin Li, Bin Chen, Hengshuang Zhao, Xiaojuan Qi, Zhuotao Tian
Title: Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception
Abstract:
Dense visual perception tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded. While Vision-Language Models (VLMs) like CLIP have shown promise in open-vocabulary tasks, their direct application to dense perception often leads to suboptimal performance due to limitations in local feature representation. In this work, we present our observation that CLIP's image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local discriminability and spatial consistency. To address this issue, we propose DeCLIP, a novel framework that enhances CLIP by decoupling the self-attention module to obtain ``content'' and ``context'' features respectively. \revise{The context features are enhanced by jointly distilling semantic correlations from Vision Foundation Models (VFMs) and object integrity cues from diffusion models, thereby enhancing spatial consistency. In parallel, the content features are aligned with image crop representations and constrained by region correlations from VFMs to improve local discriminability. Extensive experiments demonstrate that DeCLIP establishes a solid foundation for open-vocabulary dense perception, consistently achieving state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.} Code is available at https://github.com/xiaomoguhz/DeCLIP
Chinese: DeCLIP通过解耦自注意力机制为内容和上下文特征,提升了局部区分度和空间一致性,在多种开放词汇密集感知任务中实现了最优性能。
English: DeCLIP enhances CLIP by decoupling self-attention into content and context features, improving local discriminability and spatial consistency for superior open-vocabulary dense perception across multiple tasks.

Authors:MengChao Wang, Qiang Wang, Fan Jiang, Mu Xu
Title: FantasyTalking2: Timestep-Layer Adaptive Preference Optimization for Audio-Driven Portrait Animation
Abstract:
Recent advances in audio-driven portrait animation have demonstrated impressive capabilities. However, existing methods struggle to align with fine-grained human preferences across multiple dimensions, such as motion naturalness, lip-sync accuracy, and visual quality. This is due to the difficulty of optimizing among competing preference objectives, which often conflict with one another, and the scarcity of large-scale, high-quality datasets with multidimensional preference annotations. To address these, we first introduce Talking-Critic, a multimodal reward model that learns human-aligned reward functions to quantify how well generated videos satisfy multidimensional expectations. Leveraging this model, we curate Talking-NSQ, a large-scale multidimensional human preference dataset containing 410K preference pairs. Finally, we propose Timestep-Layer adaptive multi-expert Preference Optimization (TLPO), a novel framework for aligning diffusion-based portrait animation models with fine-grained, multidimensional preferences. TLPO decouples preferences into specialized expert modules, which are then fused across timesteps and network layers, enabling comprehensive, fine-grained enhancement across all dimensions without mutual interference. Experiments demonstrate that Talking-Critic significantly outperforms existing methods in aligning with human preference ratings. Meanwhile, TLPO achieves substantial improvements over baseline models in lip-sync accuracy, motion naturalness, and visual quality, exhibiting superior performance in both qualitative and quantitative evaluations. Ours project page: https://fantasy-amap.github.io/fantasy-talking2/

Authors:Abhinav Kumar, Yuliang Guo, Zhihao Zhang, Xinyu Huang, Liu Ren, Xiaoming Liu
Title: CHARM3R: Towards Unseen Camera Height Robust Monocular 3D Detector
Abstract:
Monocular 3D object detectors, while effective on data from one ego camera height, struggle with unseen or out-of-distribution camera heights. Existing methods often rely on Plucker embeddings, image transformations or data augmentation. This paper takes a step towards this understudied problem by first investigating the impact of camera height variations on state-of-the-art (SoTA) Mono3D models. With a systematic analysis on the extended CARLA dataset with multiple camera heights, we observe that depth estimation is a primary factor influencing performance under height variations. We mathematically prove and also empirically observe consistent negative and positive trends in mean depth error of regressed and ground-based depth models, respectively, under camera height changes. To mitigate this, we propose Camera Height Robust Monocular 3D Detector (CHARM3R), which averages both depth estimates within the model. CHARM3R improves generalization to unseen camera heights by more than $45\%$, achieving SoTA performance on the CARLA dataset. Codes and Models at https://github.com/abhi1kumar/CHARM3R
中文: 本文针对单目三维物体检测器对相机高度变化敏感的问题,提出CHARM3R方法通过融合深度估计提升泛化能力,在CARLA数据集上实现了超过45%的性能提升并达到最优水平。
English: This paper addresses the challenge of monocular 3D object detectors' sensitivity to camera height variations by proposing CHARM3R, which averages depth estimates to enhance generalization, achieving over 45% improvement and state-of-the-art performance on the CARLA dataset.

Authors:Haomin Zhang, Kristin Qi, Shuxin Yang, Zihao Chen, Chaofan Ding, Xinhan Di
Title: LD-LAudio-V1: Video-to-Long-Form-Audio Generation Extension with Dual Lightweight Adapters
Abstract:
Generating high-quality and temporally synchronized audio from video content is essential for video editing and post-production tasks, enabling the creation of semantically aligned audio for silent videos. However, most existing approaches focus on short-form audio generation for video segments under 10 seconds or rely on noisy datasets for long-form video-to-audio zsynthesis. To address these limitations, we introduce LD-LAudio-V1, an extension of state-of-the-art video-to-audio models and it incorporates dual lightweight adapters to enable long-form audio generation. In addition, we release a clean and human-annotated video-to-audio dataset that contains pure sound effects without noise or artifacts. Our method significantly reduces splicing artifacts and temporal inconsistencies while maintaining computational efficiency. Compared to direct fine-tuning with short training videos, LD-LAudio-V1 achieves significant improvements across multiple metrics: $FD_{\text{passt}}$ 450.00 $\rightarrow$ 327.29 (+27.27%), $FD_{\text{panns}}$ 34.88 $\rightarrow$ 22.68 (+34.98%), $FD_{\text{vgg}}$ 3.75 $\rightarrow$ 1.28 (+65.87%), $KL_{\text{panns}}$ 2.49 $\rightarrow$ 2.07 (+16.87%), $KL_{\text{passt}}$ 1.78 $\rightarrow$ 1.53 (+14.04%), $IS_{\text{panns}}$ 4.17 $\rightarrow$ 4.30 (+3.12%), $IB_{\text{score}}$ 0.25 $\rightarrow$ 0.28 (+12.00%), $Energy\Delta10\text{ms}$ 0.3013 $\rightarrow$ 0.1349 (+55.23%), $Energy\Delta10\text{ms(vs.GT)}$ 0.0531 $\rightarrow$ 0.0288 (+45.76%), and $Sem.\,Rel.$ 2.73 $\rightarrow$ 3.28 (+20.15%). Our dataset aims to facilitate further research in long-form video-to-audio generation and is available at https://github.com/deepreasonings/long-form-video2audio.
中文: 该研究提出LD-LAudio-V1模型,通过集成双轻量适配器和发布纯净标注数据集,显著提升长视频音频生成的性能,减少拼接伪影和时间不一致性。
English: The study introduces LD-LAudio-V1, a model that enhances long-form video-to-audio generation by incorporating dual lightweight adapters and a clean, annotated dataset, significantly reducing artifacts and improving performance metrics.

Authors:Wentao Mo, Qingchao Chen, Yuxin Peng, Siyuan Huang, Yang Liu
Title: Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset
Abstract:
The advancement of 3D vision-language (3D VL) learning is hindered by several limitations in existing 3D VL datasets: they rarely necessitate reasoning beyond a close range of objects in single viewpoint, and annotations often link instructions to single objects, missing richer contextual alignments between multiple objects. This significantly curtails the development of models capable of deep, multi-view 3D scene understanding over distant objects. To address these challenges, we introduce MV-ScanQA, a novel 3D question answering dataset where 68% of questions explicitly require integrating information from multiple views (compared to less than 7% in existing datasets), thereby rigorously testing multi-view compositional reasoning. To facilitate the training of models for such demanding scenarios, we present TripAlign dataset, a large-scale and low-cost 2D-3D-language pre-training corpus containing 1M <2D view, set of 3D objects, text> triplets that explicitly aligns groups of contextually related objects with text, providing richer, view-grounded multi-object multimodal alignment signals than previous single-object annotations. We further develop LEGO, a baseline method for the multi-view reasoning challenge in MV-ScanQA, transferring knowledge from pre-trained 2D LVLMs to 3D domain with TripAlign. Empirically, LEGO pre-trained on TripAlign achieves state-of-the-art performance not only on the proposed MV-ScanQA, but also on existing benchmarks for 3D dense captioning and question answering. Datasets and code are available at https://matthewdm0816.github.io/tripalign-mvscanqa.

Authors:Kelin Yu, Sheng Zhang, Harshit Soora, Furong Huang, Heng Huang, Pratap Tokekar, Ruohan Gao
Title: GenFlowRL: Shaping Rewards with Generative Object-Centric Flow in Visual Reinforcement Learning
Abstract:
Recent advances have shown that video generation models can enhance robot learning by deriving effective robot actions through inverse dynamics. However, these methods heavily depend on the quality of generated data and struggle with fine-grained manipulation due to the lack of environment feedback. While video-based reinforcement learning improves policy robustness, it remains constrained by the uncertainty of video generation and the challenges of collecting large-scale robot datasets for training diffusion models. To address these limitations, we propose GenFlowRL, which derives shaped rewards from generated flow trained from diverse cross-embodiment datasets. This enables learning generalizable and robust policies from diverse demonstrations using low-dimensional, object-centric features. Experiments on 10 manipulation tasks, both in simulation and real-world cross-embodiment evaluations, demonstrate that GenFlowRL effectively leverages manipulation features extracted from generated object-centric flow, consistently achieving superior performance across diverse and challenging scenarios. Our Project Page: https://colinyu1.github.io/genflowrl

Authors:Wenbin An, Jiahao Nie, Yaqiang Wu, Feng Tian, Shijian Lu, Qinghua Zheng
Title: Empowering Multimodal LLMs with External Tools: A Comprehensive Survey
Abstract:
By integrating the perception capabilities of multimodal encoders with the generative power of Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), exemplified by GPT-4V, have achieved great success in various multimodal tasks, pointing toward a promising pathway to artificial general intelligence. Despite this progress, the limited quality of multimodal data, poor performance on many complex downstream tasks, and inadequate evaluation protocols continue to hinder the reliability and broader applicability of MLLMs across diverse domains. Inspired by the human ability to leverage external tools for enhanced reasoning and problem-solving, augmenting MLLMs with external tools (e.g., APIs, expert models, and knowledge bases) offers a promising strategy to overcome these challenges. In this paper, we present a comprehensive survey on leveraging external tools to enhance MLLM performance. Our discussion is structured along four key dimensions about external tools: (1) how they can facilitate the acquisition and annotation of high-quality multimodal data; (2) how they can assist in improving MLLM performance on challenging downstream tasks; (3) how they enable comprehensive and accurate evaluation of MLLMs; (4) the current limitations and future directions of tool-augmented MLLMs. Through this survey, we aim to underscore the transformative potential of external tools in advancing MLLM capabilities, offering a forward-looking perspective on their development and applications. The project page of this paper is publicly available athttps://github.com/Lackel/Awesome-Tools-for-MLLMs.
中文: 本综述探讨了如何通过外部工具增强多模态大语言模型,以克服数据质量、任务性能和评估方面的局限,强调了其在提升模型能力和应用前景方面的变革潜力。
English: This survey explores how augmenting Multimodal Large Language Models with external tools can overcome limitations in data quality, task performance, and evaluation, highlighting their transformative potential for advancing capabilities and applications.

Authors:Yoli Shavit, Yosi Keller
Title: Relative Pose Regression with Pose Auto-Encoders: Enhancing Accuracy and Data Efficiency for Retail Applications
Abstract:
Accurate camera localization is crucial for modern retail environments, enabling enhanced customer experiences, streamlined inventory management, and autonomous operations. While Absolute Pose Regression (APR) from a single image offers a promising solution, approaches that incorporate visual and spatial scene priors tend to achieve higher accuracy. Camera Pose Auto-Encoders (PAEs) have recently been introduced to embed such priors into APR. In this work, we extend PAEs to the task of Relative Pose Regression (RPR) and propose a novel re-localization scheme that refines APR predictions using PAE-based RPR, without requiring additional storage of images or pose data. We first introduce PAE-based RPR and establish its effectiveness by comparing it with image-based RPR models of equivalent architectures. We then demonstrate that our refinement strategy, driven by a PAE-based RPR, enhances APR localization accuracy on indoor benchmarks. Notably, our method is shown to achieve competitive performance even when trained with only 30% of the data, substantially reducing the data collection burden for retail deployment. Our code and pre-trained models are available at: https://github.com/yolish/camera-pose-auto-encoders
中文摘要:本研究通过将相机姿态自动编码器扩展至相对姿态回归任务,提出无需额外存储图像或姿态数据的优化方案,能够显著提升零售环境中单图像绝对姿态回归的定位精度,并在仅使用30%训练数据时仍保持竞争优势。
English Summary: This study enhances camera localization accuracy in retail settings by extending Camera Pose Auto-Encoders to Relative Pose Regression and introducing a novel refinement method that improves Absolute Pose Regression predictions without extra data storage, achieving competitive performance with only 30% of training data.

Authors:Wenqi Guo, Shan Du
Title: VSF: Simple, Efficient, and Effective Negative Guidance in Few-Step Image Generation Models By Value Sign Flip
Abstract:
We introduce Value Sign Flip (VSF), a simple and efficient method for incorporating negative prompt guidance in few-step diffusion and flow-matching image generation models. Unlike existing approaches such as classifier-free guidance (CFG), NASA, and NAG, VSF dynamically suppresses undesired content by flipping the sign of attention values from negative prompts. Our method requires only small computational overhead and integrates effectively with MMDiT-style architectures such as Stable Diffusion 3.5 Turbo, as well as cross-attention-based models like Wan. We validate VSF on challenging datasets with complex prompt pairs and demonstrate superior performance in both static image and video generation tasks. Experimental results show that VSF significantly improves negative prompt adherence compared to prior methods in few-step models, and even CFG in non-few-step models, while maintaining competitive image quality. Code and ComfyUI node are available in https://github.com/weathon/VSF/tree/main.
中文: 值符号翻转(VSF)是一种高效方法,通过动态翻转注意力值来增强扩散模型中的负向提示引导,在抑制不良内容方面优于现有技术,同时保持图像质量。
English: Value Sign Flip (VSF) is a computationally efficient method that enhances negative prompt guidance in few-step diffusion models by dynamically flipping attention values, outperforming existing techniques in suppressing undesired content while maintaining image quality.

Authors:Jianlong Wu, Wei Liu, Ye Liu, Meng Liu, Liqiang Nie, Zhouchen Lin, Chang Wen Chen
Title: A Survey on Video Temporal Grounding with Multimodal Large Language Model
Abstract:
The recent advancement in video temporal grounding (VTG) has significantly enhanced fine-grained video understanding, primarily driven by multimodal large language models (MLLMs). With superior multimodal comprehension and reasoning abilities, VTG approaches based on MLLMs (VTG-MLLMs) are gradually surpassing traditional fine-tuned methods. They not only achieve competitive performance but also excel in generalization across zero-shot, multi-task, and multi-domain settings. Despite extensive surveys on general video-language understanding, comprehensive reviews specifically addressing VTG-MLLMs remain scarce. To fill this gap, this survey systematically examines current research on VTG-MLLMs through a three-dimensional taxonomy: 1) the functional roles of MLLMs, highlighting their architectural significance; 2) training paradigms, analyzing strategies for temporal reasoning and task adaptation; and 3) video feature processing techniques, which determine spatiotemporal representation effectiveness. We further discuss benchmark datasets, evaluation protocols, and summarize empirical findings. Finally, we identify existing limitations and propose promising research directions. For additional resources and details, readers are encouraged to visit our repository at https://github.com/ki-lw/Awesome-MLLMs-for-Video-Temporal-Grounding.
中文: 本综述系统梳理了基于多模态大语言模型的视频时序定位研究,通过三维分类法分析模型功能、训练范式与视频特征处理,并指出当前局限性与未来研究方向。
English: This survey systematically reviews multimodal large language model-based video temporal grounding (VTG-MLLMs), analyzing their functional roles, training paradigms, and video processing techniques while identifying research gaps and future directions.

Authors:Chaoyue Song, Xiu Li, Fan Yang, Zhongcong Xu, Jiacheng Wei, Fayao Liu, Jiashi Feng, Guosheng Lin, Jianfeng Zhang
Title: Puppeteer: Rig and Animate Your 3D Models
Abstract:
Modern interactive applications increasingly demand dynamic 3D content, yet the transformation of static 3D models into animated assets constitutes a significant bottleneck in content creation pipelines. While recent advances in generative AI have revolutionized static 3D model creation, rigging and animation continue to depend heavily on expert intervention. We present Puppeteer, a comprehensive framework that addresses both automatic rigging and animation for diverse 3D objects. Our system first predicts plausible skeletal structures via an auto-regressive transformer that introduces a joint-based tokenization strategy for compact representation and a hierarchical ordering methodology with stochastic perturbation that enhances bidirectional learning capabilities. It then infers skinning weights via an attention-based architecture incorporating topology-aware joint attention that explicitly encodes inter-joint relationships based on skeletal graph distances. Finally, we complement these rigging advances with a differentiable optimization-based animation pipeline that generates stable, high-fidelity animations while being computationally more efficient than existing approaches. Extensive evaluations across multiple benchmarks demonstrate that our method significantly outperforms state-of-the-art techniques in both skeletal prediction accuracy and skinning quality. The system robustly processes diverse 3D content, ranging from professionally designed game assets to AI-generated shapes, producing temporally coherent animations that eliminate the jittering issues common in existing methods.
中文: Puppeteer框架通过自动化绑定和动画处理,显著提升了3D模型的骨骼预测与蒙皮质量,能够稳定生成无抖动的高保真动画,优于现有技术。
English: Puppeteer is an innovative framework that automates rigging and animation for diverse 3D objects, outperforming existing methods in skeletal prediction and skinning quality while eliminating jittering issues.

Authors:Mengyuan Liu, Xinshun Wang, Zhongbin Fang, Deheng Ye, Xia Li, Tao Tang, Songtao Wu, Xiangtai Li, Ming-Hsuan Yang
Title: Human-in-Context: Unified Cross-Domain 3D Human Motion Modeling via In-Context Learning
Abstract:
This paper aims to model 3D human motion across domains, where a single model is expected to handle multiple modalities, tasks, and datasets. Existing cross-domain models often rely on domain-specific components and multi-stage training, which limits their practicality and scalability. To overcome these challenges, we propose a new setting to train a unified cross-domain model through a single process, eliminating the need for domain-specific components and multi-stage training. We first introduce Pose-in-Context (PiC), which leverages in-context learning to create a pose-centric cross-domain model. While PiC generalizes across multiple pose-based tasks and datasets, it encounters difficulties with modality diversity, prompting strategy, and contextual dependency handling. We thus propose Human-in-Context (HiC), an extension of PiC that broadens generalization across modalities, tasks, and datasets. HiC combines pose and mesh representations within a unified framework, expands task coverage, and incorporates larger-scale datasets. Additionally, HiC introduces a max-min similarity prompt sampling strategy to enhance generalization across diverse domains and a network architecture with dual-branch context injection for improved handling of contextual dependencies. Extensive experimental results show that HiC performs better than PiC in terms of generalization, data scale, and performance across a wide range of domains. These results demonstrate the potential of HiC for building a unified cross-domain 3D human motion model with improved flexibility and scalability. The source codes and models are available at https://github.com/BradleyWang0416/Human-in-Context.
中文: 本文提出Human-in-Context (HiC)模型,在PiC框架基础上扩展了多模态处理能力,通过统一表示和优化策略实现了跨任务、跨数据集的3D人体运动建模,显著提升了泛化性能与可扩展性。
English: This paper introduces Human-in-Context (HiC), a unified model that extends the pose-based PiC framework to generalize across multiple modalities, tasks, and datasets for 3D human motion, achieving superior performance and scalability through enhanced prompt strategies and architecture.

Authors:Antoine Labatie, Michael Vaccaro, Nina Lardiere, Anatol Garioud, Nicolas Gonthier
Title: MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data
Abstract:
Self-supervised learning holds great promise for remote sensing, but standard self-supervised methods must be adapted to the unique characteristics of Earth observation data. We take a step in this direction by conducting a comprehensive benchmark of fusion strategies and reconstruction target normalization schemes for multimodal, multitemporal, and multispectral Earth observation data. Based on our findings, we propose MAESTRO, a novel adaptation of the Masked Autoencoder, featuring optimized fusion strategies and a tailored target normalization scheme that introduces a spectral prior as a self-supervisory signal. Evaluated on four Earth observation datasets, MAESTRO sets a new state-of-the-art on tasks that strongly rely on multitemporal dynamics, while remaining highly competitive on tasks dominated by a single mono-temporal modality. Code to reproduce all our experiments is available at https://github.com/ignf/maestro.
中文: 自监督学习通过MAESTRO模型针对遥感数据特性进行了优化,该模型结合了融合策略和光谱先验归一化,在多时序任务上取得了领先性能,并在多个地球观测数据集中保持竞争力。
English: Self-supervised learning is adapted for remote sensing with MAESTRO, a novel masked autoencoder that optimizes fusion and normalization using spectral priors, achieving state-of-the-art performance on multitemporal tasks and remaining competitive across four Earth observation datasets.

Authors:Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, Xingang Pan
Title: STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer
Abstract:
We present STream3R, a novel approach to 3D reconstruction that reformulates pointmap prediction as a decoder-only Transformer problem. Existing state-of-the-art methods for multi-view reconstruction either depend on expensive global optimization or rely on simplistic memory mechanisms that scale poorly with sequence length. In contrast, STream3R introduces an streaming framework that processes image sequences efficiently using causal attention, inspired by advances in modern language modeling. By learning geometric priors from large-scale 3D datasets, STream3R generalizes well to diverse and challenging scenarios, including dynamic scenes where traditional methods often fail. Extensive experiments show that our method consistently outperforms prior work across both static and dynamic scene benchmarks. Moreover, STream3R is inherently compatible with LLM-style training infrastructure, enabling efficient large-scale pretraining and fine-tuning for various downstream 3D tasks. Our results underscore the potential of causal Transformer models for online 3D perception, paving the way for real-time 3D understanding in streaming environments. More details can be found in our project page: https://nirvanalan.github.io/projects/stream3r.
中文: STream3R提出了一种仅解码器的Transformer框架,通过因果注意力机制实现高效三维重建,在静态和动态场景中均超越现有方法,并支持可扩展的训练。
English: STream3R introduces a decoder-only Transformer framework for efficient 3D reconstruction using causal attention, outperforming existing methods in both static and dynamic scenes while enabling scalable training.

Authors:Lingen Li, Guangzhi Wang, Zhaoyang Zhang, Yaowei Li, Xiaoyu Li, Qi Dou, Jinwei Gu, Tianfan Xue, Ying Shan
Title: ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing
Abstract:
Traditional cartoon and anime production involves keyframing, inbetweening, and colorization stages, which require intensive manual effort. Despite recent advances in AI, existing methods often handle these stages separately, leading to error accumulation and artifacts. For instance, inbetweening approaches struggle with large motions, while colorization methods require dense per-frame sketches. To address this, we introduce ToonComposer, a generative model that unifies inbetweening and colorization into a single post-keyframing stage. ToonComposer employs a sparse sketch injection mechanism to provide precise control using keyframe sketches. Additionally, it uses a cartoon adaptation method with the spatial low-rank adapter to tailor a modern video foundation model to the cartoon domain while keeping its temporal prior intact. Requiring as few as a single sketch and a colored reference frame, ToonComposer excels with sparse inputs, while also supporting multiple sketches at any temporal location for more precise motion control. This dual capability reduces manual workload and improves flexibility, empowering artists in real-world scenarios. To evaluate our model, we further created PKBench, a benchmark featuring human-drawn sketches that simulate real-world use cases. Our evaluation demonstrates that ToonComposer outperforms existing methods in visual quality, motion consistency, and production efficiency, offering a superior and more flexible solution for AI-assisted cartoon production.
中文摘要:ToonComposer是一种生成模型,将中间帧绘制和上色统一为单一阶段,通过稀疏草图注入和卡通适配技术提升动画制作的控制力与效率,在质量和灵活性上均优于现有方法。
English Summary: ToonComposer is a generative model that integrates inbetweening and colorization into a single stage, using sparse sketch injection and cartoon adaptation to enhance control and efficiency in cartoon production, outperforming existing methods in quality and flexibility.

Authors:Sushant Gautam, Vajira Thambawita, Michael Riegler, PÃ¥l Halvorsen, Steven Hicks
Title: Medico 2025: Visual Question Answering for Gastrointestinal Imaging
Abstract:
The Medico 2025 challenge addresses Visual Question Answering (VQA) for Gastrointestinal (GI) imaging, organized as part of the MediaEval task series. The challenge focuses on developing Explainable Artificial Intelligence (XAI) models that answer clinically relevant questions based on GI endoscopy images while providing interpretable justifications aligned with medical reasoning. It introduces two subtasks: (1) answering diverse types of visual questions using the Kvasir-VQA-x1 dataset, and (2) generating multimodal explanations to support clinical decision-making. The Kvasir-VQA-x1 dataset, created from 6,500 images and 159,549 complex question-answer (QA) pairs, serves as the benchmark for the challenge. By combining quantitative performance metrics and expert-reviewed explainability assessments, this task aims to advance trustworthy Artificial Intelligence (AI) in medical image analysis. Instructions, data access, and an updated guide for participation are available in the official competition repository: https://github.com/simula/MediaEval-Medico-2025
中文摘要:Medico 2025挑战赛通过基于Kvasir-VQA-x1数据集的视觉问答任务推进胃肠影像可解释人工智能发展,结合量化指标与专家评估以构建可信赖的医疗AI系统。
English Summary: The Medico 2025 challenge advances explainable AI for gastrointestinal imaging through Visual Question Answering tasks using the Kvasir-VQA-x1 dataset, combining performance metrics and expert evaluations to build trustworthy medical AI systems.

Authors:Yibo Zhang, Li Zhang, Rui Ma, Nan Cao
Title: TexVerse: A Universe of 3D Objects with High-Resolution Textures
Abstract:
We introduce TexVerse, a large-scale 3D dataset featuring high-resolution textures. While recent advances in large-scale 3D datasets have enhanced high-resolution geometry generation, creating high-resolution textures end-to-end remains underexplored due to the lack of suitable datasets. TexVerse fills this gap with a curated collection of over 858K unique high-resolution 3D models sourced from Sketchfab, including more than 158K models with physically based rendering (PBR) materials. Each model encompasses all of its high-resolution variants, bringing the total to 1.6M 3D instances. TexVerse also includes specialized subsets: TexVerse-Skeleton, with 69K rigged models, and TexVerse-Animation, with 54K animated models, both preserving original skeleton and animation data uploaded by the user. We also provide detailed model annotations describing overall characteristics, structural components, and intricate features. TexVerse offers a high-quality data resource with wide-ranging potential applications in texture synthesis, PBR material development, animation, and various 3D vision and graphics tasks.
中文:TexVerse是一个大规模3D数据集,通过提供超过85.8万个独特的高分辨率3D模型填补了高分辨率纹理生成的空白,包含专门针对绑定和动画模型的子集,并配有详细注释,适用于多种图形应用。
English: TexVerse is a large-scale 3D dataset addressing the gap in high-resolution texture generation by providing over 858K unique high-resolution 3D models, including specialized subsets for rigged and animated models, with detailed annotations for various graphics applications.

Authors:Harold Haodong Chen, Haojian Huang, Qifeng Chen, Harry Yang, Ser-Nam Lim
Title: Hierarchical Fine-grained Preference Optimization for Physically Plausible Video Generation
Abstract:
Recent advancements in video generation have enabled the creation of high-quality, visually compelling videos. However, generating videos that adhere to the laws of physics remains a critical challenge for applications requiring realism and accuracy. In this work, we propose PhysHPO, a novel framework for Hierarchical Cross-Modal Direct Preference Optimization, to tackle this challenge by enabling fine-grained preference alignment for physically plausible video generation. PhysHPO optimizes video alignment across four hierarchical granularities: a) Instance Level, aligning the overall video content with the input prompt; b) State Level, ensuring temporal consistency using boundary frames as anchors; c) Motion Level, modeling motion trajectories for realistic dynamics; and d) Semantic Level, maintaining logical consistency between narrative and visuals. Recognizing that real-world videos are the best reflections of physical phenomena, we further introduce an automated data selection pipeline to efficiently identify and utilize "good data" from existing large-scale text-video datasets, thereby eliminating the need for costly and time-intensive dataset construction. Extensive experiments on both physics-focused and general capability benchmarks demonstrate that PhysHPO significantly improves physical plausibility and overall video generation quality of advanced models. To the best of our knowledge, this is the first work to explore fine-grained preference alignment and data selection for video generation, paving the way for more realistic and human-preferred video generation paradigms.
中文: PhysHPO提出了一种分层跨模态直接偏好优化框架,通过四个粒度级别的偏好对齐和从现有数据集中自动选择优质数据,显著提升了视频生成的物理合理性和整体质量。
English: PhysHPO introduces a hierarchical cross-modal direct preference optimization framework to enhance physical plausibility in video generation by aligning preferences across four granular levels and leveraging an automated data selection pipeline from existing datasets.

Authors:Tajamul Ashraf, Iqra Altaf Gillani
Title: Generalizable Federated Learning using Client Adaptive Focal Modulation
Abstract:
Federated learning (FL) has proven essential for privacy-preserving, collaborative training across distributed clients. Our prior work, TransFed, introduced a robust transformer-based FL framework that leverages a learn-to-adapt hypernetwork to generate personalized focal modulation layers per client, outperforming traditional methods in non-IID and cross-domain settings. In this extended version, we propose AdaptFED, where we deepen the investigation of focal modulation in generalizable FL by incorporating: (1) a refined adaptation strategy that integrates task-aware client embeddings to personalize modulation dynamics further, (2) enhanced theoretical bounds on adaptation performance, and (3) broader empirical validation across additional modalities, including time-series and multilingual data. We also introduce an efficient variant of TransFed that reduces server-client communication overhead via low-rank hypernetwork conditioning, enabling scalable deployment in resource-constrained environments. Extensive experiments on eight diverse datasets reaffirm the superiority of our method over state-of-the-art baselines, particularly in source-free and cross-task federated setups. Our findings not only extend the capabilities of focal modulation in FL but also pave the way for more adaptive, scalable, and generalizable transformer-based federated systems. The code is available at http://github.com/Tajamul21/TransFed
中文: 这项扩展研究提出的AdaptFED框架,通过集成任务感知的客户端嵌入来优化个性化焦点调制机制,提供了更强的理论保证和跨多模态数据的广泛实验验证,同时推出能降低通信开销的高效变体,实现了可扩展的联邦学习系统。
English: This extended work introduces AdaptFED, which enhances the TransFed framework by refining personalized focal modulation with task-aware client embeddings, providing stronger theoretical guarantees and broader empirical validation across diverse data types, while also offering an efficient variant that reduces communication overhead for scalable federated learning.

Authors:Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, Yue Wen, Jingya Dou, Fei Tang, Jinzhen Lin, Yulin Liu, Zhenlin Guo, Yichen Gong, Heng Jia, Changlong Gao, Yuan Guo, Yong Deng, Zhenyu Guo, Liang Chen, Weiqiang Wang
Title: UI-Venus Technical Report: Building High-performance UI Agents with RFT
Abstract:
We present UI-Venus, a native UI agent that takes only screenshots as input based on a multimodal large language model. UI-Venus achieves SOTA performance on both UI grounding and navigation tasks using only several hundred thousand high-quality training samples through reinforcement finetune (RFT) based on Qwen2.5-VL. Specifically, the 7B and 72B variants of UI-Venus obtain 94.1% / 50.8% and 95.3% / 61.9% on the standard grounding benchmarks, i.e., Screenspot-V2 / Pro, surpassing the previous SOTA baselines including open-source GTA1 and closed-source UI-TARS-1.5. To show UI-Venus's summary and planing ability, we also evaluate it on the AndroidWorld, an online UI navigation arena, on which our 7B and 72B variants achieve 49.1% and 65.9% success rate, also beating existing models. To achieve this, we introduce carefully designed reward functions for both UI grounding and navigation tasks and corresponding efficient data cleaning strategies. To further boost navigation performance, we propose Self-Evolving Trajectory History Alignment & Sparse Action Enhancement that refine historical reasoning traces and balances the distribution of sparse but critical actions, leading to more coherent planning and better generalization in complex UI tasks. Our contributions include the publish of SOTA open-source UI agents, comprehensive data cleaning protocols and a novel self-evolving framework for improving navigation performance, which encourage further research and development in the community. Code is available at https://github.com/inclusionAI/UI-Venus.
中文:UI-Venus是一种先进的多模态UI代理,通过强化微调和创新的自我进化技术,仅用少量高质量训练样本就在UI定位和导航任务中超越了现有模型,实现了最优性能。
English: UI-Venus is a state-of-the-art multimodal UI agent that achieves superior performance in UI grounding and navigation tasks using reinforcement fine-tuning and innovative self-evolving techniques, outperforming existing models with minimal training data.

Authors:Zhenning Shi, Zizheng Yan, Yuhang Yu, Clara Xue, Jingyu Zhuang, Qi Zhang, Jinwei Chen, Tao Li, Qingnan Fan
Title: Ultra-High-Definition Reference-Based Landmark Image Super-Resolution with Generative Diffusion Prior
Abstract:
Reference-based Image Super-Resolution (RefSR) aims to restore a low-resolution (LR) image by utilizing the semantic and texture information from an additional reference high-resolution (reference HR) image. Existing diffusion-based RefSR methods are typically built upon ControlNet, which struggles to effectively align the information between the LR image and the reference HR image. Moreover, current RefSR datasets suffer from limited resolution and poor image quality, resulting in the reference images lacking sufficient fine-grained details to support high-quality restoration. To overcome the limitations above, we propose TriFlowSR, a novel framework that explicitly achieves pattern matching between the LR image and the reference HR image. Meanwhile, we introduce Landmark-4K, the first RefSR dataset for Ultra-High-Definition (UHD) landmark scenarios. Considering the UHD scenarios with real-world degradation, in TriFlowSR, we design a Reference Matching Strategy to effectively match the LR image with the reference HR image. Experimental results show that our approach can better utilize the semantic and texture information of the reference HR image compared to previous methods. To the best of our knowledge, we propose the first diffusion-based RefSR pipeline for ultra-high definition landmark scenarios under real-world degradation. Our code and model will be available at https://github.com/nkicsl/TriFlowSR.
中文:提出的TriFlowSR框架通过显式模式匹配有效对齐低分辨率与参考高分辨率图像,结合专为超高清场景设计的Landmark-4K数据集,在利用参考图像信息方面优于现有方法。
English: The proposed TriFlowSR framework effectively aligns low-resolution and reference high-resolution images through explicit pattern matching, supported by the new Landmark-4K dataset for ultra-high-definition restoration, outperforming previous methods in utilizing reference information.

Authors:Lixin Jia, Zhiqing Guo, Gaobo Yang, Liejun Wang, Keqin Li
Title: Forgery Guided Learning Strategy with Dual Perception Network for Deepfake Cross-domain Detection
Abstract:
The emergence of deepfake technology has introduced a range of societal problems, garnering considerable attention. Current deepfake detection methods perform well on specific datasets, but exhibit poor performance when applied to datasets with unknown forgery techniques. Moreover, as the gap between emerging and traditional forgery techniques continues to widen, cross-domain detection methods that rely on common forgery traces are becoming increasingly ineffective. This situation highlights the urgency of developing deepfake detection technology with strong generalization to cope with fast iterative forgery techniques. To address these challenges, we propose a Forgery Guided Learning (FGL) strategy designed to enable detection networks to continuously adapt to unknown forgery techniques. Specifically, the FGL strategy captures the differential information between known and unknown forgery techniques, allowing the model to dynamically adjust its learning process in real time. To further improve the ability to perceive forgery traces, we design a Dual Perception Network (DPNet) that captures both differences and relationships among forgery traces. In the frequency stream, the network dynamically perceives and extracts discriminative features across various forgery techniques, establishing essential detection cues. These features are then integrated with spatial features and projected into the embedding space. In addition, graph convolution is employed to perceive relationships across the entire feature space, facilitating a more comprehensive understanding of forgery trace correlations. Extensive experiments show that our approach generalizes well across different scenarios and effectively handles unknown forgery challenges, providing robust support for deepfake detection. Our code is available on https://github.com/vpsg-research/FGL.
中文: 本研究提出的伪造引导学习策略和双感知网络通过动态捕捉伪造痕迹差异与关联,能有效适应未知伪造技术,显著提升了深度伪造检测的泛化能力。
English: The proposed Forgery Guided Learning strategy and Dual Perception Network effectively enhance deepfake detection by dynamically adapting to unknown forgery techniques through differential feature extraction and comprehensive trace correlation analysis.

Authors:Matej Vitek, Darian Tomašević, Abhijit Das, Sabari Nathan, Gökhan Özbulak, Gözde Ayşe Tataroğlu Özbulak, Jean-Paul Calbimonte, André Anjos, Hariohm Hemant Bhatt, Dhruv Dhirendra Premani, Jay Chaudhari, Caiyong Wang, Jian Jiang, Chi Zhang, Qi Zhang, Iyyakutti Iyappan Ganapathi, Syed Sadaf Ali, Divya Velayudan, Maregu Assefa, Naoufel Werghi, Zachary A. Daniels, Leeon John, Ritesh Vyas, Jalil Nourmohammadi Khiarak, Taher Akbari Saeed, Mahsa Nasehi, Ali Kianfar, Mobina Pashazadeh Panahi, Geetanjali Sharma, Pushp Raj Panth, Raghavendra Ramachandra, Aditya Nigam, Umapada Pal, Peter Peer, Vitomir Štruc
Title: Privacy-enhancing Sclera Segmentation Benchmarking Competition: SSBC 2025
Abstract:
This paper presents a summary of the 2025 Sclera Segmentation Benchmarking Competition (SSBC), which focused on the development of privacy-preserving sclera-segmentation models trained using synthetically generated ocular images. The goal of the competition was to evaluate how well models trained on synthetic data perform in comparison to those trained on real-world datasets. The competition featured two tracks: $(i)$ one relying solely on synthetic data for model development, and $(ii)$ one combining/mixing synthetic with (a limited amount of) real-world data. A total of nine research groups submitted diverse segmentation models, employing a variety of architectural designs, including transformer-based solutions, lightweight models, and segmentation networks guided by generative frameworks. Experiments were conducted across three evaluation datasets containing both synthetic and real-world images, collected under diverse conditions. Results show that models trained entirely on synthetic data can achieve competitive performance, particularly when dedicated training strategies are employed, as evidenced by the top performing models that achieved $F_1$ scores of over $0.8$ in the synthetic data track. Moreover, performance gains in the mixed track were often driven more by methodological choices rather than by the inclusion of real data, highlighting the promise of synthetic data for privacy-aware biometric development. The code and data for the competition is available at: https://github.com/dariant/SSBC_2025.
中文: 2025年巩膜分割基准竞赛证明,完全基于合成眼部数据训练的模型能够取得有竞争力的性能(最优模型F1分数超过0.8),而混合数据赛道的结果表明方法选择比添加真实数据更重要,这凸显了合成数据在隐私保护生物识别开发中的潜力。
English: The 2025 Sclera Segmentation Benchmarking Competition demonstrated that models trained entirely on synthetic ocular data can achieve competitive performance, with top entries surpassing 0.8 F1 scores, while mixed-data track results showed methodological choices often outweighed the benefits of adding real data, highlighting synthetic data's potential for privacy-preserving biometric development.

Authors:Yanjun Li, Yuqian Fu, Tianwen Qian, Qi'ao Xu, Silong Dai, Danda Pani Paudel, Luc Van Gool, Xiaoling Wang
Title: EgoCross: Benchmarking Multimodal Large Language Models for Cross-Domain Egocentric Video Question Answering
Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have significantly pushed the frontier of egocentric video question answering (EgocentricQA). However, existing benchmarks and studies are mainly limited to common daily activities such as cooking and cleaning. In contrast, real-world deployment inevitably encounters domain shifts, where target domains differ substantially in both visual style and semantic content. To bridge this gap, we introduce \textbf{EgoCross}, a comprehensive benchmark designed to evaluate the cross-domain generalization of MLLMs in EgocentricQA. EgoCross covers four diverse and challenging domains, including surgery, industry, extreme sports, and animal perspective, representing realistic and high-impact application scenarios. It comprises approximately 1,000 QA pairs across 798 video clips, spanning four key QA tasks: prediction, recognition, localization, and counting. Each QA pair provides both OpenQA and CloseQA formats to support fine-grained evaluation. Extensive experiments show that most existing MLLMs, whether general-purpose or egocentric-specialized, struggle to generalize to domains beyond daily life, highlighting the limitations of current models. Furthermore, we conduct several pilot studies, \eg, fine-tuning and reinforcement learning, to explore potential improvements. We hope EgoCross and our accompanying analysis will serve as a foundation for advancing domain-adaptive, robust egocentric video understanding. Data and codes will be released at: \href{https://github.com/MyUniverse0726/EgoCross}{https://github.com/MyUniverse0726/EgoCross.}
Chinese: EgoCross基准测试旨在评估多模态大语言模型在第一人称视频问答中的跨领域泛化能力,揭示了其在日常活动之外领域的局限性并探索了改进方法。
English: The EgoCross benchmark is introduced to assess multimodal large language models' cross-domain generalization in egocentric video question answering, revealing their limitations beyond daily activities and exploring improvement strategies.

Authors:NextStep Team, Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, Hongyu Zhou, Kenkun Liu, Ailin Huang, Bin Wang, Changxin Miao, Deshan Sun, En Yu, Fukun Yin, Gang Yu, Hao Nie, Haoran Lv, Hanpeng Hu, Jia Wang, Jian Zhou, Jianjian Sun, Kaijun Tan, Kang An, Kangheng Lin, Liang Zhao, Mei Chen, Peng Xing, Rui Wang, Shiyu Liu, Shutao Xia, Tianhao You, Wei Ji, Xianfang Zeng, Xin Han, Xuelin Zhang, Yana Wei, Yanming Xu, Yimin Jiang, Yingming Wang, Yu Zhou, Yucheng Han, Ziyang Meng, Binxing Jiao, Daxin Jiang, Xiangyu Zhang, Yibo Zhu
Title: NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale
Abstract:
Prevailing autoregressive (AR) models for text-to-image generation either rely on heavy, computationally-intensive diffusion models to process continuous image tokens, or employ vector quantization (VQ) to obtain discrete tokens with quantization loss. In this paper, we push the autoregressive paradigm forward with NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, training on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. Furthermore, our method shows strong performance in image editing, highlighting the power and versatility of our unified approach. To facilitate open research, we will release our code and models to the community.
Chinese: NextStep-1 是一个140亿参数的自回归模型,通过结合流匹配头和离散文本与连续图像标记的训练,在文本到图像生成和图像编辑任务中实现了最先进的性能。
English: NextStep-1 is a 14B autoregressive model that achieves state-of-the-art performance in text-to-image generation and image editing by training on discrete text and continuous image tokens with a flow matching head.

Authors:Joohyeon Lee, Jin-Seop Lee, Jee-Hyong Lee
Title: CountCluster: Training-Free Object Quantity Guidance with Cross-Attention Map Clustering for Text-to-Image Generation
Abstract:
Diffusion-based text-to-image generation models have demonstrated strong performance in terms of image quality and diversity. However, they still struggle to generate images that accurately reflect the number of objects specified in the input prompt. Several approaches have been proposed that rely on either external counting modules for iterative refinement or quantity representations derived from learned tokens or latent features. However, they still have limitations in accurately reflecting the specified number of objects and overlook an important structural characteristic--The number of object instances in the generated image is largely determined in the early timesteps of the denoising process. To correctly reflect the object quantity for image generation, the highly activated regions in the object cross-attention map at the early timesteps should match the input object quantity, while each region should be clearly separated. To address this issue, we propose \textit{CountCluster}, a method that guides the object cross-attention map to be clustered according to the specified object count in the input, without relying on any external tools or additional training. The proposed method partitions the object cross-attention map into $k$ clusters at inference time based on attention scores, defines an ideal distribution in which each cluster is spatially well-separated, and optimizes the latent to align with this target distribution. Our method achieves an average improvement of 18.5\%p in object count accuracy compared to existing methods, and demonstrates superior quantity control performance across a variety of prompts. Code will be released at: https://github.com/JoohyeonL22/CountCluster .
中文: 提出的CountCluster方法通过在去噪早期步骤中根据输入数量对交叉注意力图进行聚类,使生成图像中的物体数量与文本描述精确匹配,无需外部工具或额外训练即可将物体计数准确率平均提升18.5%。
English: The proposed CountCluster method improves object count accuracy in diffusion-based text-to-image generation by clustering cross-attention maps during early denoising steps to align with specified object quantities, achieving an 18.5% accuracy boost without external tools or retraining.

Authors:Zhanwen Liu, Yujing Sun, Yang Wang, Nan Yang, Shengbo Eben Li, Xiangmo Zhao
Title: Beyond conventional vision: RGB-event fusion for robust object detection in dynamic traffic scenarios
Abstract:
The dynamic range limitation of conventional RGB cameras reduces global contrast and causes loss of high-frequency details such as textures and edges in complex traffic environments (e.g., nighttime driving, tunnels), hindering discriminative feature extraction and degrading frame-based object detection. To address this, we integrate a bio-inspired event camera with an RGB camera to provide high dynamic range information and propose a motion cue fusion network (MCFNet), which achieves optimal spatiotemporal alignment and adaptive cross-modal feature fusion under challenging lighting. Specifically, an event correction module (ECM) temporally aligns asynchronous event streams with image frames via optical-flow-based warping, jointly optimized with the detection network to learn task-aware event representations. The event dynamic upsampling module (EDUM) enhances spatial resolution of event frames to match image structures, ensuring precise spatiotemporal alignment. The cross-modal mamba fusion module (CMM) uses adaptive feature fusion with a novel interlaced scanning mechanism, effectively integrating complementary information for robust detection. Experiments conducted on the DSEC-Det and PKU-DAVIS-SOD datasets demonstrate that MCFNet significantly outperforms existing methods in various poor lighting and fast moving traffic scenarios. Notably, on the DSEC-Det dataset, MCFNet achieves a remarkable improvement, surpassing the best existing methods by 7.4% in mAP50 and 1.7% in mAP metrics, respectively. The code is available at https://github.com/Charm11492/MCFNet.
中文摘要:本研究针对复杂交通环境中RGB相机动态范围受限的问题,通过融合仿生事件相机提升动态范围,并提出了运动线索融合网络MCFNet,该网络通过时空对齐与自适应跨模态特征融合,在恶劣光照条件下实现了卓越的目标检测性能。
English Summary: This study tackles the limitations of RGB cameras in complex traffic settings by integrating an event camera to enhance dynamic range and proposing MCFNet, a motion cue fusion network that achieves superior object detection through spatiotemporal alignment and adaptive cross-modal feature fusion.

Authors:Zhaoyuan Qi, Weihua Gao, Wenlong Niu, Jie Tang, Yun Li, Xiaodong Peng
Title: HyperTea: A Hypergraph-based Temporal Enhancement and Alignment Network for Moving Infrared Small Target Detection
Abstract:
In practical application scenarios, moving infrared small target detection (MIRSTD) remains highly challenging due to the target's small size, weak intensity, and complex motion pattern. Existing methods typically only model low-order correlations between feature nodes and perform feature extraction and enhancement within a single temporal scale. Although hypergraphs have been widely used for high-order correlation learning, they have received limited attention in MIRSTD. To explore the potential of hypergraphs and enhance multi-timescale feature representation, we propose HyperTea, which integrates global and local temporal perspectives to effectively model high-order spatiotemporal correlations of features. HyperTea consists of three modules: the global temporal enhancement module (GTEM) realizes global temporal context enhancement through semantic aggregation and propagation; the local temporal enhancement module (LTEM) is designed to capture local motion patterns between adjacent frames and then enhance local temporal context; additionally, we further develop a temporal alignment module (TAM) to address potential cross-scale feature misalignment. To our best knowledge, HyperTea is the first work to integrate convolutional neural networks (CNNs), recurrent neural networks (RNNs), and hypergraph neural networks (HGNNs) for MIRSTD, significantly improving detection performance. Experiments on DAUB and IRDST demonstrate its state-of-the-art (SOTA) performance. Our source codes are available at https://github.com/Lurenjia-LRJ/HyperTea.
中文: HyperTea模型创新性地融合了卷积神经网络、循环神经网络和超图神经网络,通过多时间尺度的高阶时空关联建模有效解决了运动红外小目标检测难题,在基准数据集上实现了最先进的性能。
English: The proposed HyperTea model innovatively combines CNNs, RNNs, and hypergraph neural networks to address moving infrared small target detection challenges by effectively modeling high-order spatiotemporal correlations across multiple temporal scales, achieving state-of-the-art performance on benchmark datasets.

Authors:Feiran Li, Qianqian Xu, Shilong Bao, Boyu Han, Zhiyong Yang, Qingming Huang
Title: Hybrid Generative Fusion for Efficient and Privacy-Preserving Face Recognition Dataset Generation
Abstract:
In this paper, we present our approach to the DataCV ICCV Challenge, which centers on building a high-quality face dataset to train a face recognition model. The constructed dataset must not contain identities overlapping with any existing public face datasets. To handle this challenge, we begin with a thorough cleaning of the baseline HSFace dataset, identifying and removing mislabeled or inconsistent identities through a Mixture-of-Experts (MoE) strategy combining face embedding clustering and GPT-4o-assisted verification. We retain the largest consistent identity cluster and apply data augmentation up to a fixed number of images per identity. To further diversify the dataset, we generate synthetic identities using Stable Diffusion with prompt engineering. As diffusion models are computationally intensive, we generate only one reference image per identity and efficiently expand it using Vec2Face, which rapidly produces 49 identity-consistent variants. This hybrid approach fuses GAN-based and diffusion-based samples, enabling efficient construction of a diverse and high-quality dataset. To address the high visual similarity among synthetic identities, we adopt a curriculum learning strategy by placing them early in the training schedule, allowing the model to progress from easier to harder samples. Our final dataset contains 50 images per identity, and all newly generated identities are checked with mainstream face datasets to ensure no identity leakage. Our method achieves \textbf{1st place} in the competition, and experimental results show that our dataset improves model performance across 10K, 20K, and 100K identity scales. Code is available at https://github.com/Ferry-Li/datacv_fr.
中文: 本文介绍了DataCV ICCV挑战赛的夺冠方案,通过专家混合策略清洗HSFace数据集、增强真实身份图像,并结合Stable Diffusion与Vec2Face生成合成身份,采用课程学习优化训练过程,成功构建了无身份重叠的高质量人脸数据集。
English: This paper details a winning approach for the DataCV ICCV Challenge that constructs a non-overlapping, high-quality face dataset by cleaning HSFace with a Mixture-of-Experts strategy, augmenting real identities, and generating synthetic ones via Stable Diffusion and Vec2Face, while employing curriculum learning to enhance model training.

Authors:Zhangyong Tang, Tianyang Xu, Xuefeng Zhu, Chunyang Cheng, Tao Zhou, Xiaojun Wu, Josef Kittler
Title: Serial Over Parallel: Learning Continual Unification for Multi-Modal Visual Object Tracking and Benchmarking
Abstract:
Unifying multiple multi-modal visual object tracking (MMVOT) tasks draws increasing attention due to the complementary nature of different modalities in building robust tracking systems. Existing practices mix all data sensor types in a single training procedure, structuring a parallel paradigm from the data-centric perspective and aiming for a global optimum on the joint distribution of the involved tasks. However, the absence of a unified benchmark where all types of data coexist forces evaluations on separated benchmarks, causing \textit{inconsistency} between training and testing, thus leading to performance \textit{degradation}. To address these issues, this work advances in two aspects: \ding{182} A unified benchmark, coined as UniBench300, is introduced to bridge the inconsistency by incorporating multiple task data, reducing inference passes from three to one and cutting time consumption by 27\%. \ding{183} The unification process is reformulated in a serial format, progressively integrating new tasks. In this way, the performance degradation can be specified as knowledge forgetting of previous tasks, which naturally aligns with the philosophy of continual learning (CL), motivating further exploration of injecting CL into the unification process. Extensive experiments conducted on two baselines and four benchmarks demonstrate the significance of UniBench300 and the superiority of CL in supporting a stable unification process. Moreover, while conducting dedicated analyses, the performance degradation is found to be negatively correlated with network capacity. Additionally, modality discrepancies contribute to varying degradation levels across tasks (RGBT > RGBD > RGBE in MMVOT), offering valuable insights for future multi-modal vision research. Source codes and the proposed benchmark is available at \textit{https://github.com/Zhangyong-Tang/UniBench300}.
中文摘要:本文提出UniBench300统一基准,通过序列化重构多模态视觉目标跟踪的统一流程并引入持续学习机制,有效解决了性能下降问题,同时揭示了网络容量与模态差异对性能的影响规律。
English Summary: This paper introduces UniBench300, a unified benchmark for multi-modal visual object tracking that addresses performance degradation by reformulating the unification process in a serial format and incorporating continual learning principles, while also revealing correlations between network capacity and modality discrepancies.

Authors:Ryan Ramos, Vladan Stojnić, Giorgos Kordopatis-Zilos, Yuta Nakashima, Giorgos Tolias, Noa Garcia
Title: Processing and acquisition traces in visual encoders: What does CLIP know about your camera?
Abstract:
Prior work has analyzed the robustness of visual encoders to image transformations and corruptions, particularly in cases where such alterations are not seen during training. When this occurs, they introduce a form of distribution shift at test time, often leading to performance degradation. The primary focus has been on severe corruptions that, when applied aggressively, distort useful signals necessary for accurate semantic predictions. We take a different perspective by analyzing parameters of the image acquisition process and transformations that may be subtle or even imperceptible to the human eye. We find that such parameters are systematically encoded in the learned visual representations and can be easily recovered. More strikingly, their presence can have a profound impact, either positively or negatively, on semantic predictions. This effect depends on whether there is a strong correlation or anti-correlation between semantic labels and these acquisition-based or processing-based labels. Our code and data are available at: https://github.com/ryan-caesar-ramos/visual-encoder-traces
中文摘要:本研究发现,图像采集过程中细微甚至难以察觉的参数会被系统性地编码到视觉表征中,这些参数与语义标签的相关性会显著影响语义预测结果,产生积极或消极作用。
English Summary: This study reveals that subtle, often imperceptible image acquisition parameters are systematically encoded in visual representations and can significantly influence semantic predictions depending on their correlation with semantic labels.

Authors:Farid Tasharofi, Fuxin Fan, Melika Qahqaie, Mareike Thies, Andreas Maier
Title: FIND-Net -- Fourier-Integrated Network with Dictionary Kernels for Metal Artifact Reduction
Abstract:
Metal artifacts, caused by high-density metallic implants in computed tomography (CT) imaging, severely degrade image quality, complicating diagnosis and treatment planning. While existing deep learning algorithms have achieved notable success in Metal Artifact Reduction (MAR), they often struggle to suppress artifacts while preserving structural details. To address this challenge, we propose FIND-Net (Fourier-Integrated Network with Dictionary Kernels), a novel MAR framework that integrates frequency and spatial domain processing to achieve superior artifact suppression and structural preservation. FIND-Net incorporates Fast Fourier Convolution (FFC) layers and trainable Gaussian filtering, treating MAR as a hybrid task operating in both spatial and frequency domains. This approach enhances global contextual understanding and frequency selectivity, effectively reducing artifacts while maintaining anatomical structures. Experiments on synthetic datasets show that FIND-Net achieves statistically significant improvements over state-of-the-art MAR methods, with a 3.07% MAE reduction, 0.18% SSIM increase, and 0.90% PSNR improvement, confirming robustness across varying artifact complexities. Furthermore, evaluations on real-world clinical CT scans confirm FIND-Net's ability to minimize modifications to clean anatomical regions while effectively suppressing metal-induced distortions. These findings highlight FIND-Net's potential for advancing MAR performance, offering superior structural preservation and improved clinical applicability. Code is available at https://github.com/Farid-Tasharofi/FIND-Net
中文: FIND-Net是一种创新的金属伪影减少框架,通过整合频域和空间域处理,在CT成像中实现了优异的伪影抑制和结构保留效果,相比现有方法展现出显著提升。
English: FIND-Net is a novel metal artifact reduction framework that integrates frequency and spatial domain processing to achieve superior artifact suppression and structural preservation in CT imaging, demonstrating significant improvements over existing methods.

Authors:Xinyi Wang, Angeliki Katsenou, David Bull
Title: DIVA-VQA: Detecting Inter-frame Variations in UGC Video Quality
Abstract:
The rapid growth of user-generated (video) content (UGC) has driven increased demand for research on no-reference (NR) perceptual video quality assessment (VQA). NR-VQA is a key component for large-scale video quality monitoring in social media and streaming applications where a pristine reference is not available. This paper proposes a novel NR-VQA model based on spatio-temporal fragmentation driven by inter-frame variations. By leveraging these inter-frame differences, the model progressively analyses quality-sensitive regions at multiple levels: frames, patches, and fragmented frames. It integrates frames, fragmented residuals, and fragmented frames aligned with residuals to effectively capture global and local information. The model extracts both 2D and 3D features in order to characterize these spatio-temporal variations. Experiments conducted on five UGC datasets and against state-of-the-art models ranked our proposed method among the top 2 in terms of average rank correlation (DIVA-VQA-L: 0.898 and DIVA-VQA-B: 0.886). The improved performance is offered at a low runtime complexity, with DIVA-VQA-B ranked top and DIVA-VQA-L third on average compared to the fastest existing NR-VQA method. Code and models are publicly available at: https://github.com/xinyiW915/DIVA-VQA.
中文: 本文提出了一种基于帧间变化驱动的时空碎片化无参考视频质量评估模型,通过多层级分析有效捕捉全局与局部质量特征,在多个用户生成内容数据集上以低计算复杂度实现了领先性能。
English: This paper introduces a novel no-reference video quality assessment model that utilizes spatio-temporal fragmentation based on inter-frame variations to effectively capture global and local quality features, achieving top-tier performance on multiple UGC datasets with low computational complexity.

Authors:Zheng Qin, Ruobing Zheng, Yabing Wang, Tianqi Li, Yi Yuan, Jingdong Chen, Le Wang
Title: HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs
Abstract:
While Multimodal Large Language Models (MLLMs) show immense promise for achieving truly human-like interactions, progress is hindered by the lack of fine-grained evaluation frameworks for human-centered scenarios, encompassing both the understanding of complex human intentions and the provision of empathetic, context-aware responses. Here we introduce HumanSense, a comprehensive benchmark designed to evaluate the human-centered perception and interaction capabilities of MLLMs, with a particular focus on deep understanding of extended multimodal contexts and the formulation of rational feedback. Our evaluation reveals that leading MLLMs still have considerable room for improvement, particularly for advanced interaction-oriented tasks. Supplementing visual input with audio and text information yields substantial improvements, and Omni-modal models show advantages on these tasks. Furthermore, we argue that appropriate feedback stems from a contextual analysis of the interlocutor's needs and emotions, with reasoning ability serving as the key to unlocking it. Accordingly, we employ a multi-stage, modality-progressive reinforcement learning to enhance the reasoning abilities of an Omni model, achieving substantial gains on evaluation results. Additionally, we observe that successful reasoning processes exhibit highly consistent thought patterns. By designing corresponding prompts, we also enhance the performance of non-reasoning models in a training-free manner. Project page: \textcolor{brightpink}https://digital-avatar.github.io/ai/HumanSense/

Authors:Humza Naveed, Xina Zeng, Mitch Bryson, Nagita Mehrseresht
Title: Adapting SAM via Cross-Entropy Masking for Class Imbalance in Remote Sensing Change Detection
Abstract:
Foundational models have achieved significant success in diverse domains of computer vision. They learn general representations that are easily transferable to tasks not seen during training. One such foundational model is Segment anything model (SAM), which can accurately segment objects in images. We propose adapting the SAM encoder via fine-tuning for remote sensing change detection (RSCD) along with spatial-temporal feature enhancement (STFE) and multi-scale decoder fusion (MSDF) to detect changes robustly at multiple scales. Additionally, we propose a novel cross-entropy masking (CEM) loss to handle high class imbalance in change detection datasets. Our method outperforms state-of-the-art (SOTA) methods on four change detection datasets, Levir-CD, WHU-CD, CLCD, and S2Looking. We achieved 2.5% F1-score improvement on a large complex S2Looking dataset. The code is available at: https://github.com/humza909/SAM-CEM-CD
中文摘要:本研究通过空间-时间特征增强和多尺度解码器融合改进SAM模型用于遥感变化检测,并采用新型交叉熵掩码损失处理类别不平衡,在多个数据集上实现了最优性能。
English Summary: This study enhances the Segment Anything Model (SAM) for remote sensing change detection by integrating spatial-temporal feature enhancement and multi-scale decoder fusion, achieving state-of-the-art performance with a novel cross-entropy masking loss to address class imbalance.

Authors:Philipp Wolters, Johannes Gilg, Torben Teepe, Gerhard Rigoll
Title: SpaRC-AD: A Baseline for Radar-Camera Fusion in End-to-End Autonomous Driving
Abstract:
End-to-end autonomous driving systems promise stronger performance through unified optimization of perception, motion forecasting, and planning. However, vision-based approaches face fundamental limitations in adverse weather conditions, partial occlusions, and precise velocity estimation - critical challenges in safety-sensitive scenarios where accurate motion understanding and long-horizon trajectory prediction are essential for collision avoidance. To address these limitations, we propose SpaRC-AD, a query-based end-to-end camera-radar fusion framework for planning-oriented autonomous driving. Through sparse 3D feature alignment, and doppler-based velocity estimation, we achieve strong 3D scene representations for refinement of agent anchors, map polylines and motion modelling. Our method achieves strong improvements over the state-of-the-art vision-only baselines across multiple autonomous driving tasks, including 3D detection (+4.8% mAP), multi-object tracking (+8.3% AMOTA), online mapping (+1.8% mAP), motion prediction (-4.0% mADE), and trajectory planning (-0.1m L2 and -9% TPC). We achieve both spatial coherence and temporal consistency on multiple challenging benchmarks, including real-world open-loop nuScenes, long-horizon T-nuScenes, and closed-loop simulator Bench2Drive. We show the effectiveness of radar-based fusion in safety-critical scenarios where accurate motion understanding and long-horizon trajectory prediction are essential for collision avoidance. The source code of all experiments is available at https://phi-wol.github.io/sparcad/

Authors:Boyi Zheng, Qing Liu
Title: PSScreen: Partially Supervised Multiple Retinal Disease Screening
Abstract:
Leveraging multiple partially labeled datasets to train a model for multiple retinal disease screening reduces the reliance on fully annotated datasets, but remains challenging due to significant domain shifts across training datasets from various medical sites, and the label absent issue for partial classes. To solve these challenges, we propose PSScreen, a novel Partially Supervised multiple retinal disease Screening model. Our PSScreen consists of two streams and one learns deterministic features and the other learns probabilistic features via uncertainty injection. Then, we leverage the textual guidance to decouple two types of features into disease-wise features and align them via feature distillation to boost the domain generalization ability. Meanwhile, we employ pseudo label consistency between two streams to address the label absent issue and introduce a self-distillation to transfer task-relevant semantics about known classes from the deterministic to the probabilistic stream to further enhance the detection performances. Experiments show that our PSScreen significantly enhances the detection performances on six retinal diseases and the normal state averagely and achieves state-of-the-art results on both in-domain and out-of-domain datasets. Codes are available at https://github.com/boyiZheng99/PSScreen.
Chinese: PSScreen是一种新颖的部分监督模型,通过结合确定性特征流和概率性特征流,并利用文本引导和伪标签一致性,有效解决了多疾病视网膜筛查中的领域偏移和标签缺失问题,在多个数据集上实现了最优性能。
English: PSScreen is a novel partially supervised model that tackles domain shifts and missing labels in multi-disease retinal screening by combining deterministic and probabilistic feature streams with textual guidance and pseudo-label consistency, achieving state-of-the-art performance across various datasets.

Authors:Yangjie Xiao, Ke Zhang, Jiacun Wang, Xin Sheng, Yurong Guo, Meijuan Chen, Zehua Ren, Zhaoye Zheng, Zhenbing Zhao
Title: A Segmentation-driven Editing Method for Bolt Defect Augmentation and Detection
Abstract:
Bolt defect detection is critical to ensure the safety of transmission lines. However, the scarcity of defect images and imbalanced data distributions significantly limit detection performance. To address this problem, we propose a segmentationdriven bolt defect editing method (SBDE) to augment the dataset. First, a bolt attribute segmentation model (Bolt-SAM) is proposed, which enhances the segmentation of complex bolt attributes through the CLAHE-FFT Adapter (CFA) and Multipart- Aware Mask Decoder (MAMD), generating high-quality masks for subsequent editing tasks. Second, a mask optimization module (MOD) is designed and integrated with the image inpainting model (LaMa) to construct the bolt defect attribute editing model (MOD-LaMa), which converts normal bolts into defective ones through attribute editing. Finally, an editing recovery augmentation (ERA) strategy is proposed to recover and put the edited defect bolts back into the original inspection scenes and expand the defect detection dataset. We constructed multiple bolt datasets and conducted extensive experiments. Experimental results demonstrate that the bolt defect images generated by SBDE significantly outperform state-of-the-art image editing models, and effectively improve the performance of bolt defect detection, which fully verifies the effectiveness and application potential of the proposed method. The code of the project is available at https://github.com/Jay-xyj/SBDE.
中文: 提出的SBDE方法通过分割驱动的编辑和数据增强技术生成高质量缺陷螺栓图像,有效提升了螺栓缺陷检测性能,显著优于现有先进模型。
English: The proposed SBDE method enhances bolt defect detection by generating high-quality defective images through segmentation-driven editing and dataset augmentation, significantly improving detection performance over existing models.

Authors:Hanna Herasimchyk, Robin Labryga, Tomislav Prusina
Title: Multi-Label Plant Species Prediction with Metadata-Enhanced Multi-Head Vision Transformers
Abstract:
We present a multi-head vision transformer approach for multi-label plant species prediction in vegetation plot images, addressing the PlantCLEF 2025 challenge. The task involves training models on single-species plant images while testing on multi-species quadrat images, creating a drastic domain shift. Our methodology leverages a pre-trained DINOv2 Vision Transformer Base (ViT-B/14) backbone with multiple classification heads for species, genus, and family prediction, utilizing taxonomic hierarchies. Key contributions include multi-scale tiling to capture plants at different scales, dynamic threshold optimization based on mean prediction length, and ensemble strategies through bagging and Hydra model architectures. The approach incorporates various inference techniques including image cropping to remove non-plant artifacts, top-n filtering for prediction constraints, and logit thresholding strategies. Experiments were conducted on approximately 1.4 million training images covering 7,806 plant species. Results demonstrate strong performance, making our submission 3rd best on the private leaderboard. Our code is available at https://github.com/geranium12/plant-clef-2025/tree/v1.0.0.
中文: 我们提出了一种多头视觉变换器方法,通过多尺度分块和集成策略解决植被图像中的多标签植物物种识别问题,在PlantCLEF 2025挑战赛中取得了第三名的成绩。
English: We propose a multi-head vision transformer method for multi-label plant species identification in vegetation images, addressing domain shift through multi-scale tiling and ensemble strategies, achieving third place in the PlantCLEF 2025 challenge.

Authors:Baichen Liu, Qi Lyu, Xudong Wang, Jiahua Dong, Lianqing Liu, Zhi Han
Title: CRISP: Contrastive Residual Injection and Semantic Prompting for Continual Video Instance Segmentation
Abstract:
Continual video instance segmentation demands both the plasticity to absorb new object categories and the stability to retain previously learned ones, all while preserving temporal consistency across frames. In this work, we introduce Contrastive Residual Injection and Semantic Prompting (CRISP), an earlier attempt tailored to address the instance-wise, category-wise, and task-wise confusion in continual video instance segmentation. For instance-wise learning, we model instance tracking and construct instance correlation loss, which emphasizes the correlation with the prior query space while strengthening the specificity of the current task query. For category-wise learning, we build an adaptive residual semantic prompt (ARSP) learning framework, which constructs a learnable semantic residual prompt pool generated by category text and uses an adjustive query-prompt matching mechanism to build a mapping relationship between the query of the current task and the semantic residual prompt. Meanwhile, a semantic consistency loss based on the contrastive learning is introduced to maintain semantic coherence between object queries and residual prompts during incremental training. For task-wise learning, to ensure the correlation at the inter-task level within the query space, we introduce a concise yet powerful initialization strategy for incremental prompts. Extensive experiments on YouTube-VIS-2019 and YouTube-VIS-2021 datasets demonstrate that CRISP significantly outperforms existing continual segmentation methods in the long-term continual video instance segmentation task, avoiding catastrophic forgetting and effectively improving segmentation and classification performance. The code is available at https://github.com/01upup10/CRISP.
中文摘要:本研究提出的CRISP方法通过对比残差注入和语义提示技术,解决了持续视频实例分割中的实例级、类别级和任务级混淆问题,在基准数据集上显著优于现有方法,有效避免了灾难性遗忘并提升了分割与分类性能。
English Summary: The study introduces CRISP, a method that tackles instance-wise, category-wise, and task-wise confusion in continual video instance segmentation by employing contrastive residual injection and semantic prompting, achieving superior performance on benchmark datasets while preventing catastrophic forgetting.

Authors:Keishi Ishihara, Kento Sasaki, Tsubasa Takahashi, Daiki Shiono, Yu Yamaguchi
Title: STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes
Abstract:
Vision-Language Models (VLMs) have been applied to autonomous driving to support decision-making in complex real-world scenarios. However, their training on static, web-sourced image-text pairs fundamentally limits the precise spatiotemporal reasoning required to understand and predict dynamic traffic scenes. We address this critical gap with STRIDE-QA, a large-scale visual question answering (VQA) dataset for physically grounded reasoning from an ego-centric perspective. Constructed from 100 hours of multi-sensor driving data in Tokyo, capturing diverse and challenging conditions, STRIDE-QA is the largest VQA dataset for spatiotemporal reasoning in urban driving, offering 16 million QA pairs over 285K frames. Grounded by dense, automatically generated annotations including 3D bounding boxes, segmentation masks, and multi-object tracks, the dataset uniquely supports both object-centric and ego-centric reasoning through three novel QA tasks that require spatial localization and temporal prediction. Our benchmarks demonstrate that existing VLMs struggle significantly, achieving near-zero scores on prediction consistency. In contrast, VLMs fine-tuned on STRIDE-QA exhibit dramatic performance gains, achieving 55% success in spatial localization and 28% consistency in future motion prediction, compared to near-zero scores from general-purpose VLMs. Therefore, STRIDE-QA establishes a comprehensive foundation for developing more reliable VLMs for safety-critical autonomous systems.

Authors:Chaesong Park, Eunbin Seo, Jihyeon Hwang, Jongwoo Lim
Title: SC-Lane: Slope-aware and Consistent Road Height Estimation Framework for 3D Lane Detection
Abstract:
In this paper, we introduce SC-Lane, a novel slope-aware and temporally consistent heightmap estimation framework for 3D lane detection. Unlike previous approaches that rely on fixed slope anchors, SC-Lane adaptively determines the fusion of slope-specific height features, improving robustness to diverse road geometries. To achieve this, we propose a Slope-Aware Adaptive Feature module that dynamically predicts the appropriate weights from image cues for integrating multi-slope representations into a unified heightmap. Additionally, a Height Consistency Module enforces temporal coherence, ensuring stable and accurate height estimation across consecutive frames, which is crucial for real-world driving scenarios. To evaluate the effectiveness of SC-Lane, we employ three standardized metrics-Mean Absolute Error(MAE), Root Mean Squared Error (RMSE), and threshold-based accuracy-which, although common in surface and depth estimation, have been underutilized for road height assessment. Using the LiDAR-derived heightmap dataset introduced in prior work [20], we benchmark our method under these metrics, thereby establishing a rigorous standard for future comparisons. Extensive experiments on the OpenLane benchmark demonstrate that SC-Lane significantly improves both height estimation and 3D lane detection, achieving state-of-the-art performance with an F-score of 64.3%, outperforming existing methods by a notable margin. For detailed results and a demonstration video, please refer to our project page:https://parkchaesong.github.io/sclane/

Authors:Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, Haoang Li
Title: ReconVLA: Reconstructive Vision-Language-Action Model as Effective Robot Perceiver
Abstract:
Recent advances in Vision-Language-Action (VLA) models have enabled robotic agents to integrate multimodal understanding with action execution. However, our empirical analysis reveals that current VLAs struggle to allocate visual attention to target regions. Instead, visual attention is always dispersed. To guide the visual attention grounding on the correct target, we propose ReconVLA, a reconstructive VLA model with an implicit grounding paradigm. Conditioned on the model's visual outputs, a diffusion transformer aims to reconstruct the gaze region of the image, which corresponds to the target manipulated objects. This process prompts the VLA model to learn fine-grained representations and accurately allocate visual attention, thus effectively leveraging task-specific visual information and conducting precise manipulation. Moreover, we curate a large-scale pretraining dataset comprising over 100k trajectories and 2 million data samples from open-source robotic datasets, further boosting the model's generalization in visual reconstruction. Extensive experiments in simulation and the real world demonstrate the superiority of our implicit grounding method, showcasing its capabilities of precise manipulation and generalization. Our project page is https://zionchow.github.io/ReconVLA/.
中文摘要:提出的ReconVLA模型通过隐式定位范式,利用扩散变换器重建注视区域,使视觉-语言-动作模型能够实现精确的视觉注意力分配和操作控制。
English Summary: The proposed ReconVLA model introduces an implicit grounding paradigm using a diffusion transformer to reconstruct gaze regions, enabling precise visual attention allocation and manipulation in Vision-Language-Action models.

Authors:Zhaoming Kong, Jiahuan Zhang, Xiaowei Yang
Title: Efficient Image Denoising Using Global and Local Circulant Representation
Abstract:
The advancement of imaging devices and countless image data generated everyday impose an increasingly high demand on efficient and effective image denoising. In this paper, we present a computationally simple denoising algorithm, termed Haar-tSVD, aiming to explore the nonlocal self-similarity prior and leverage the connection between principal component analysis (PCA) and the Haar transform under circulant representation. We show that global and local patch correlations can be effectively captured through a unified tensor-singular value decomposition (t-SVD) projection with the Haar transform. This results in a one-step, highly parallelizable filtering method that eliminates the need for learning local bases to represent image patches, striking a balance between denoising speed and performance. Furthermore, we introduce an adaptive noise estimation scheme based on a CNN estimator and eigenvalue analysis to enhance the robustness and adaptability of the proposed method. Experiments on different real-world denoising tasks validate the efficiency and effectiveness of Haar-tSVD for noise removal and detail preservation. Datasets, code and results are publicly available at https://github.com/ZhaomingKong/Haar-tSVD.
Chinese: 本文提出Haar-tSVD算法,通过哈尔变换和张量奇异值分解有效捕捉图像全局与局部相关性,无需学习局部基即可实现一步式并行滤波,结合自适应噪声估计在去噪速度与效果间取得平衡。
English: This paper introduces Haar-tSVD, a computationally efficient image denoising algorithm that utilizes the Haar transform and tensor-SVD to capture global and local correlations without learning local bases, achieving a balance between speed and performance through a parallelizable one-step filtering method and adaptive noise estimation.

Authors:Tao Huang, Hongbo Pan, Nanxi Zhou, Shun Zhou
Title: A Sub-Pixel Multimodal Optical Remote Sensing Images Matching Method
Abstract:
High-accuracy matching of multimodal optical images is the basis of geometric processing. However, the image matching accuracy is usually degraded by the nonlinear radiation and geometric deformation differences caused by different spectral responses. To address these problems, we proposed a phase consistency weighted least absolute deviation (PCWLAD) sub-pixel template matching method to improve the matching accuracy of multimodal optical images. This method consists of two main steps: coarse matching with the structural similarity index measure (SSIM) and fine matching with WLAD. In the coarse matching step, PCs are calculated without a noise filter to preserve the original structural details, and template matching is performed using the SSIM. In the fine matching step, we applied the radiometric and geometric transformation models between two multimodal PC templates based on the coarse matching. Furthermore, mutual structure filtering is adopted in the model to mitigate the impact of noise within the corresponding templates on the structural consistency, and the WLAD criterion is used to estimate the sub-pixel offset. To evaluate the performance of PCWLAD, we created three types of image datasets: visible to infrared Landsat images, visible to near-infrared close-range images, and visible to infrared uncrewed aerial vehicle (UAV) images. PCWLAD outperformed existing state-of-the-art eight methods in terms of correct matching rate (CMR) and root mean square error (RMSE) and reached an average matching accuracy of approximately 0.4 pixels across all three datasets. Our software and datasets are publicly available at https://github.com/huangtaocsu/PCWLAD.
中文: 本研究提出的PCWLAD方法通过结构相似性粗匹配和加权最小绝对偏差精匹配的两步策略,有效提升了多模态光学影像的配准精度,在三种测试数据集上均达到约0.4像素的平均匹配精度。
English: The proposed PCWLAD method enhances multimodal optical image matching accuracy by combining structural similarity for coarse alignment and weighted least absolute deviation for fine-tuning, achieving sub-pixel precision across diverse datasets.

Authors:Xinan Zhang, Haolin Wang, Yung-An Hsieh, Zhongyu Yang, Anthony Yezzi, Yi-Chang Tsai
Title: Deep Learning for Crack Detection: A Review of Learning Paradigms, Generalizability, and Datasets
Abstract:
Crack detection plays a crucial role in civil infrastructures, including inspection of pavements, buildings, etc., and deep learning has significantly advanced this field in recent years. While numerous technical and review papers exist in this domain, emerging trends are reshaping the landscape. These shifts include transitions in learning paradigms (from fully supervised learning to semi-supervised, weakly-supervised, unsupervised, few-shot, domain adaptation and fine-tuning foundation models), improvements in generalizability (from single-dataset performance to cross-dataset evaluation), and diversification in dataset acquisition (from RGB images to specialized sensor-based data). In this review, we systematically analyze these trends and highlight representative works. Additionally, we introduce a new annotated dataset collected with 3D laser scans, 3DCrack, to support future research and conduct extensive benchmarking experiments to establish baselines for commonly used deep learning methodologies, including recent foundation models. Our findings provide insights into the evolving methodologies and future directions in deep learning-based crack detection. Project page: https://github.com/nantonzhang/Awesome-Crack-Detection
中文: 本综述系统分析了基于深度学习的裂缝检测发展趋势,涵盖学习范式演进、泛化能力提升及数据采集多样化,并引入新型3D数据集和基础模型基准测试,为未来研究提供方向指引。
English: This review analyzes key trends in deep learning-based crack detection, including evolving learning paradigms, enhanced generalizability, and diversified data acquisition, while introducing a new 3D dataset and benchmarking foundation models to guide future research.

Authors:Arianna Bunnell, Devon Cataldi, Yannik Glaser, Thomas K. Wolfgruber, Steven Heymsfield, Alan B. Zonderman, Thomas L. Kelly, Peter Sadowski, John A. Shepherd
Title: Deep Learning Enables Large-Scale Shape and Appearance Modeling in Total-Body DXA Imaging
Abstract:
Total-body dual X-ray absorptiometry (TBDXA) imaging is a relatively low-cost whole-body imaging modality, widely used for body composition assessment. We develop and validate a deep learning method for automatic fiducial point placement on TBDXA scans using 1,683 manually-annotated TBDXA scans. The method achieves 99.5% percentage correct keypoints in an external testing dataset. To demonstrate the value for shape and appearance modeling (SAM), our method is used to place keypoints on 35,928 scans for five different TBDXA imaging modes, then associations with health markers are tested in two cohorts not used for SAM model generation using two-sample Kolmogorov-Smirnov tests. SAM feature distributions associated with health biomarkers are shown to corroborate existing evidence and generate new hypotheses on body composition and shape's relationship to various frailty, metabolic, inflammation, and cardiometabolic health markers. Evaluation scripts, model weights, automatic point file generation code, and triangulation files are available at https://github.com/hawaii-ai/dxa-pointplacement.
中文:开发了一种深度学习方法来在全身双能X射线吸收测定扫描中自动放置基准点,该方法准确率高,并支持形状和外观建模,揭示了身体成分与多种健康指标之间的关联。
English: A deep learning method was developed to automatically place fiducial points on total-body dual X-ray absorptiometry scans, achieving high accuracy and enabling shape and appearance modeling that reveals associations between body composition and various health markers.

Authors:Kaixin Peng, Mengyang Zhao, Haiyang Yu, Teng Fu, Bin Li
Title: Interpretable Oracle Bone Script Decipherment through Radical and Pictographic Analysis with LVLMs
Abstract:
As the oldest mature writing system, Oracle Bone Script (OBS) has long posed significant challenges for archaeological decipherment due to its rarity, abstractness, and pictographic diversity. Current deep learning-based methods have made exciting progress on the OBS decipherment task, but existing approaches often ignore the intricate connections between glyphs and the semantics of OBS. This results in limited generalization and interpretability, especially when addressing zero-shot settings and undeciphered OBS. To this end, we propose an interpretable OBS decipherment method based on Large Vision-Language Models, which synergistically combines radical analysis and pictograph-semantic understanding to bridge the gap between glyphs and meanings of OBS. Specifically, we propose a progressive training strategy that guides the model from radical recognition and analysis to pictographic analysis and mutual analysis, thus enabling reasoning from glyph to meaning. We also design a Radical-Pictographic Dual Matching mechanism informed by the analysis results, significantly enhancing the model's zero-shot decipherment performance. To facilitate model training, we propose the Pictographic Decipherment OBS Dataset, which comprises 47,157 Chinese characters annotated with OBS images and pictographic analysis texts. Experimental results on public benchmarks demonstrate that our approach achieves state-of-the-art Top-10 accuracy and superior zero-shot decipherment capabilities. More importantly, our model delivers logical analysis processes, possibly providing archaeologically valuable reference results for undeciphered OBS, and thus has potential applications in digital humanities and historical research. The dataset and code will be released in https://github.com/PKXX1943/PD-OBS.
中文摘要:本研究提出了一种基于大型视觉语言模型的可解释甲骨文破译方法,通过结合部首分析和象形语义理解,显著提升了零样本破译能力,并为未解读甲骨文提供了考古学参考价值。
English Summary: This study introduces an interpretable decipherment method for Oracle Bone Script using Large Vision-Language Models, combining radical and pictographic analysis to enhance zero-shot performance and provide archaeologically valuable insights.

Authors:David Dinkevich, Matan Levy, Omri Avrahami, Dvir Samuel, Dani Lischinski
Title: Story2Board: A Training-Free Approach for Expressive Storyboard Generation
Abstract:
We present Story2Board, a training-free framework for expressive storyboard generation from natural language. Existing methods narrowly focus on subject identity, overlooking key aspects of visual storytelling such as spatial composition, background evolution, and narrative pacing. To address this, we introduce a lightweight consistency framework composed of two components: Latent Panel Anchoring, which preserves a shared character reference across panels, and Reciprocal Attention Value Mixing, which softly blends visual features between token pairs with strong reciprocal attention. Together, these mechanisms enhance coherence without architectural changes or fine-tuning, enabling state-of-the-art diffusion models to generate visually diverse yet consistent storyboards. To structure generation, we use an off-the-shelf language model to convert free-form stories into grounded panel-level prompts. To evaluate, we propose the Rich Storyboard Benchmark, a suite of open-domain narratives designed to assess layout diversity and background-grounded storytelling, in addition to consistency. We also introduce a new Scene Diversity metric that quantifies spatial and pose variation across storyboards. Our qualitative and quantitative results, as well as a user study, show that Story2Board produces more dynamic, coherent, and narratively engaging storyboards than existing baselines.

Authors:Chengtao Lv, Bilang Zhang, Yang Yong, Ruihao Gong, Yushi Huang, Shiqiao Gu, Jiajun Wu, Yumeng Shi, Jinyang Guo, Wenya Wang
Title: LLMC+: Benchmarking Vision-Language Model Compression with a Plug-and-play Toolkit
Abstract:
Large Vision-Language Models (VLMs) exhibit impressive multi-modal capabilities but suffer from prohibitive computational and memory demands, due to their long visual token sequences and massive parameter sizes. To address these issues, recent works have proposed training-free compression methods. However, existing efforts often suffer from three major limitations: (1) Current approaches do not decompose techniques into comparable modules, hindering fair evaluation across spatial and temporal redundancy. (2) Evaluation confined to simple single-turn tasks, failing to reflect performance in realistic scenarios. (3) Isolated use of individual compression techniques, without exploring their joint potential. To overcome these gaps, we introduce LLMC+, a comprehensive VLM compression benchmark with a versatile, plug-and-play toolkit. LLMC+ supports over 20 algorithms across five representative VLM families and enables systematic study of token-level and model-level compression. Our benchmark reveals that: (1) Spatial and temporal redundancies demand distinct technical strategies. (2) Token reduction methods degrade significantly in multi-turn dialogue and detail-sensitive tasks. (3) Combining token and model compression achieves extreme compression with minimal performance loss. We believe LLMC+ will facilitate fair evaluation and inspire future research in efficient VLM. Our code is available at https://github.com/ModelTC/LightCompress.
中文: 大型视觉语言模型因长视觉标记和大参数量导致计算效率低下,为此我们推出LLMC+压缩基准,通过结合标记与模型级压缩技术,在最小性能损失下实现高效压缩。
English: Large Vision-Language Models face computational inefficiency due to long visual tokens and large parameters, prompting the development of LLMC+, a comprehensive compression benchmark that combines token and model-level techniques to achieve high compression with minimal performance loss.

Authors:Shuting He, Peilin Ji, Yitong Yang, Changshuo Wang, Jiayi Ji, Yinglin Wang, Henghui Ding
Title: A Survey on 3D Gaussian Splatting Applications: Segmentation, Editing, and Generation
Abstract:
3D Gaussian Splatting (3DGS) has recently emerged as a powerful alternative to Neural Radiance Fields (NeRF) for 3D scene representation, offering high-fidelity photorealistic rendering with real-time performance. Beyond novel view synthesis, the explicit and compact nature of 3DGS enables a wide range of downstream applications that require geometric and semantic understanding. This survey provides a comprehensive overview of recent progress in 3DGS applications. It first introduces 2D foundation models that support semantic understanding and control in 3DGS applications, followed by a review of NeRF-based methods that inform their 3DGS counterparts. We then categorize 3DGS applications into segmentation, editing, generation, and other functional tasks. For each, we summarize representative methods, supervision strategies, and learning paradigms, highlighting shared design principles and emerging trends. Commonly used datasets and evaluation protocols are also summarized, along with comparative analyses of recent methods across public benchmarks. To support ongoing research and development, a continually updated repository of papers, code, and resources is maintained at https://github.com/heshuting555/Awesome-3DGS-Applications.
中文: 3D高斯泼溅作为神经辐射场的实时高保真替代方案,凭借其显式表示推动了分割、编辑等多样化应用,本综述系统分类了相关方法并持续更新资源库。
English: 3D Gaussian Splatting has emerged as a real-time, high-fidelity alternative to NeRF, enabling diverse applications like segmentation and editing through its explicit representation, with this survey systematically categorizing methods and maintaining updated resources.

Authors:Geonhee Sim, Gyeongsik Moon
Title: PERSONA: Personalized Whole-Body 3D Avatar with Pose-Driven Deformations from a Single Image
Abstract:
Two major approaches exist for creating animatable human avatars. The first, a 3D-based approach, optimizes a NeRF- or 3DGS-based avatar from videos of a single person, achieving personalization through a disentangled identity representation. However, modeling pose-driven deformations, such as non-rigid cloth deformations, requires numerous pose-rich videos, which are costly and impractical to capture in daily life. The second, a diffusion-based approach, learns pose-driven deformations from large-scale in-the-wild videos but struggles with identity preservation and pose-dependent identity entanglement. We present PERSONA, a framework that combines the strengths of both approaches to obtain a personalized 3D human avatar with pose-driven deformations from a single image. PERSONA leverages a diffusion-based approach to generate pose-rich videos from the input image and optimizes a 3D avatar based on them. To ensure high authenticity and sharp renderings across diverse poses, we introduce balanced sampling and geometry-weighted optimization. Balanced sampling oversamples the input image to mitigate identity shifts in diffusion-generated training videos. Geometry-weighted optimization prioritizes geometry constraints over image loss, preserving rendering quality in diverse poses.

Authors:Luca Eyring, Shyamgopal Karthik, Alexey Dosovitskiy, Nataniel Ruiz, Zeynep Akata
Title: Noise Hypernetworks: Amortizing Test-Time Compute in Diffusion Models
Abstract:
The new paradigm of test-time scaling has yielded remarkable breakthroughs in Large Language Models (LLMs) (e.g. reasoning models) and in generative vision models, allowing models to allocate additional computation during inference to effectively tackle increasingly complex problems. Despite the improvements of this approach, an important limitation emerges: the substantial increase in computation time makes the process slow and impractical for many applications. Given the success of this paradigm and its growing usage, we seek to preserve its benefits while eschewing the inference overhead. In this work we propose one solution to the critical problem of integrating test-time scaling knowledge into a model during post-training. Specifically, we replace reward guided test-time noise optimization in diffusion models with a Noise Hypernetwork that modulates initial input noise. We propose a theoretically grounded framework for learning this reward-tilted distribution for distilled generators, through a tractable noise-space objective that maintains fidelity to the base model while optimizing for desired characteristics. We show that our approach recovers a substantial portion of the quality gains from explicit test-time optimization at a fraction of the computational cost. Code is available at https://github.com/ExplainableML/HyperNoise
中文摘要:本研究提出的噪声超网络技术通过在训练后阶段调节初始噪声,将测试时扩展的优势融入扩散模型中,以理论支撑的框架实现质量显著提升,同时大幅降低计算成本。
English summary: The proposed Noise Hypernetwork technique integrates test-time scaling benefits into diffusion models during post-training, achieving significant quality improvements with minimal computational overhead by modulating initial noise through a theoretically grounded framework.

Authors:Tianqi Xiang, Yi Li, Qixiang Zhang, Xiaomeng Li
Title: MOC: Meta-Optimized Classifier for Few-Shot Whole Slide Image Classification
Abstract:
Recent advances in histopathology vision-language foundation models (VLFMs) have shown promise in addressing data scarcity for whole slide image (WSI) classification via zero-shot adaptation. However, these methods remain outperformed by conventional multiple instance learning (MIL) approaches trained on large datasets, motivating recent efforts to enhance VLFM-based WSI classification through fewshot learning paradigms. While existing few-shot methods improve diagnostic accuracy with limited annotations, their reliance on conventional classifier designs introduces critical vulnerabilities to data scarcity. To address this problem, we propose a Meta-Optimized Classifier (MOC) comprising two core components: (1) a meta-learner that automatically optimizes a classifier configuration from a mixture of candidate classifiers and (2) a classifier bank housing diverse candidate classifiers to enable a holistic pathological interpretation. Extensive experiments demonstrate that MOC outperforms prior arts in multiple few-shot benchmarks. Notably, on the TCGA-NSCLC benchmark, MOC improves AUC by 10.4% over the state-of-the-art few-shot VLFM-based methods, with gains up to 26.25% under 1-shot conditions, offering a critical advancement for clinical deployments where diagnostic training data is severely limited. Code is available at https://github.com/xmed-lab/MOC.
中文: 提出的元优化分类器(MOC)通过从多样化分类器库中自动选择最优配置,显著提升了少样本全切片图像分类性能,在训练数据极度匮乏的临床场景中突破性地超越现有方法。
English: The proposed Meta-Optimized Classifier (MOC) enhances few-shot WSI classification by automatically selecting optimal classifier configurations from a diverse bank, achieving significant performance improvements over existing methods, particularly in data-scarce clinical scenarios.

Authors:Yaohui Wang, Di Yang, Xinyuan Chen, Francois Bremond, Yu Qiao, Antitza Dantcheva
Title: LIA-X: Interpretable Latent Portrait Animator
Abstract:
We introduce LIA-X, a novel interpretable portrait animator designed to transfer facial dynamics from a driving video to a source portrait with fine-grained control. LIA-X is an autoencoder that models motion transfer as a linear navigation of motion codes in latent space. Crucially, it incorporates a novel Sparse Motion Dictionary that enables the model to disentangle facial dynamics into interpretable factors. Deviating from previous 'warp-render' approaches, the interpretability of the Sparse Motion Dictionary allows LIA-X to support a highly controllable 'edit-warp-render' strategy, enabling precise manipulation of fine-grained facial semantics in the source portrait. This helps to narrow initial differences with the driving video in terms of pose and expression. Moreover, we demonstrate the scalability of LIA-X by successfully training a large-scale model with approximately 1 billion parameters on extensive datasets. Experimental results show that our proposed method outperforms previous approaches in both self-reenactment and cross-reenactment tasks across several benchmarks. Additionally, the interpretable and controllable nature of LIA-X supports practical applications such as fine-grained, user-guided image and video editing, as well as 3D-aware portrait video manipulation.
Chinese: LIA-X 是一种新型可解释肖像动画生成器,通过线性运动编码导航和稀疏运动词典将驱动视频的面部动态迁移到源肖像中,实现精细控制,在重演任务中优于现有方法,并能支持精确的用户引导编辑。
English: LIA-X is a novel interpretable portrait animator that transfers facial dynamics from a driving video to a source portrait using a linear motion code navigation and a Sparse Motion Dictionary for fine-grained control, outperforming previous methods in reenactment tasks and enabling precise user-guided editing.

Authors:Benjamin Adjadj, Pierre-Antoine Bannier, Guillaume Horent, Sebastien Mandela, Aurore Lyon, Kathryn Schutte, Ulysse Marteau, Valentin Gaury, Laura Dumont, Thomas Mathieu, MOSAIC consortium, Reda Belbahri, Benoît Schmauch, Eric Durand, Katharina Von Loga, Lucie Gillet
Title: Towards Comprehensive Cellular Characterisation of H&E slides
Abstract:
Cell detection, segmentation and classification are essential for analyzing tumor microenvironments (TME) on hematoxylin and eosin (H&E) slides. Existing methods suffer from poor performance on understudied cell types (rare or not present in public datasets) and limited cross-domain generalization. To address these shortcomings, we introduce HistoPLUS, a state-of-the-art model for cell analysis, trained on a novel curated pan-cancer dataset of 108,722 nuclei covering 13 cell types. In external validation across 4 independent cohorts, HistoPLUS outperforms current state-of-the-art models in detection quality by 5.2% and overall F1 classification score by 23.7%, while using 5x fewer parameters. Notably, HistoPLUS unlocks the study of 7 understudied cell types and brings significant improvements on 8 of 13 cell types. Moreover, we show that HistoPLUS robustly transfers to two oncology indications unseen during training. To support broader TME biomarker research, we release the model weights and inference code at https://github.com/owkin/histoplus/.
中文摘要:HistoPLUS是一种先进模型,通过显著提升细胞检测与分类精度并减少参数使用,实现了对罕见细胞类型的可靠分析,同时展现出卓越的跨领域泛化能力。
English Summary: HistoPLUS is a state-of-the-art model that significantly improves cell detection and classification accuracy while using fewer parameters, enabling robust analysis of understudied cell types and demonstrating strong cross-domain generalization.

Authors:Xiaojiao Xiao, Jianfeng Zhao, Qinmin Vivian Hu, Guanghui Wang
Title: T-CACE: A Time-Conditioned Autoregressive Contrast Enhancement Multi-Task Framework for Contrast-Free Liver MRI Synthesis, Segmentation, and Diagnosis
Abstract:
Magnetic resonance imaging (MRI) is a leading modality for the diagnosis of liver cancer, significantly improving the classification of the lesion and patient outcomes. However, traditional MRI faces challenges including risks from contrast agent (CA) administration, time-consuming manual assessment, and limited annotated datasets. To address these limitations, we propose a Time-Conditioned Autoregressive Contrast Enhancement (T-CACE) framework for synthesizing multi-phase contrast-enhanced MRI (CEMRI) directly from non-contrast MRI (NCMRI). T-CACE introduces three core innovations: a conditional token encoding (CTE) mechanism that unifies anatomical priors and temporal phase information into latent representations; and a dynamic time-aware attention mask (DTAM) that adaptively modulates inter-phase information flow using a Gaussian-decayed attention mechanism, ensuring smooth and physiologically plausible transitions across phases. Furthermore, a constraint for temporal classification consistency (TCC) aligns the lesion classification output with the evolution of the physiological signal, further enhancing diagnostic reliability. Extensive experiments on two independent liver MRI datasets demonstrate that T-CACE outperforms state-of-the-art methods in image synthesis, segmentation, and lesion classification. This framework offers a clinically relevant and efficient alternative to traditional contrast-enhanced imaging, improving safety, diagnostic efficiency, and reliability for the assessment of liver lesion. The implementation of T-CACE is publicly available at: https://github.com/xiaojiao929/T-CACE.
中文: T-CACE框架通过创新的时间建模和分类一致性机制,直接从非增强MRI合成多期相增强MRI,为肝脏病变诊断提供了更安全可靠的解决方案。
English: The T-CACE framework synthesizes multi-phase contrast-enhanced MRI from non-contrast MRI, improving diagnostic safety and accuracy for liver lesions through innovative temporal modeling and classification consistency.

Authors:Yachao Liang, Min Yu, Gang Li, Jianguo Jiang, Boquan Li, Feng Yu, Ning Zhang, Xiang Meng, Weiqing Huang
Title: SpeechForensics: Audio-Visual Speech Representation Learning for Face Forgery Detection
Abstract:
Detection of face forgery videos remains a formidable challenge in the field of digital forensics, especially the generalization to unseen datasets and common perturbations. In this paper, we tackle this issue by leveraging the synergy between audio and visual speech elements, embarking on a novel approach through audio-visual speech representation learning. Our work is motivated by the finding that audio signals, enriched with speech content, can provide precise information effectively reflecting facial movements. To this end, we first learn precise audio-visual speech representations on real videos via a self-supervised masked prediction task, which encodes both local and global semantic information simultaneously. Then, the derived model is directly transferred to the forgery detection task. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods in terms of cross-dataset generalization and robustness, without the participation of any fake video in model training. Code is available at https://github.com/Eleven4AI/SpeechForensics.
中文: 本文提出了一种新颖的视听语音表征学习方法,通过利用音频与视觉的协同作用来检测人脸伪造视频,无需在训练中使用伪造视频即可显著提升跨数据集泛化能力和鲁棒性。
English: This paper introduces a novel audio-visual speech representation learning method for detecting face forgery videos, which enhances cross-dataset generalization and robustness by leveraging synchronized audio-visual cues without using fake videos during training.

Authors:Lingyu Chen, Yawen Zeng, Yue Wang, Peng Wan, Guo-chen Ning, Hongen Liao, Daoqiang Zhang, Fang Chen
Title: COME: Dual Structure-Semantic Learning with Collaborative MoE for Universal Lesion Detection Across Heterogeneous Ultrasound Datasets
Abstract:
Conventional single-dataset training often fails with new data distributions, especially in ultrasound (US) image analysis due to limited data, acoustic shadows, and speckle noise. Therefore, constructing a universal framework for multi-heterogeneous US datasets is imperative. However, a key challenge arises: how to effectively mitigate inter-dataset interference while preserving dataset-specific discriminative features for robust downstream task? Previous approaches utilize either a single source-specific decoder or a domain adaptation strategy, but these methods experienced a decline in performance when applied to other domains. Considering this, we propose a Universal Collaborative Mixture of Heterogeneous Source-Specific Experts (COME). Specifically, COME establishes dual structure-semantic shared experts that create a universal representation space and then collaborate with source-specific experts to extract discriminative features through providing complementary features. This design enables robust generalization by leveraging cross-datasets experience distributions and providing universal US priors for small-batch or unseen data scenarios. Extensive experiments under three evaluation modes (single-dataset, intra-organ, and inter-organ integration datasets) demonstrate COME's superiority, achieving significant mean AP improvements over state-of-the-art methods. Our project is available at: https://universalcome.github.io/UniversalCOME/.

Authors:Weigao Sun, Jiaxi Hu, Yucheng Zhou, Jusen Du, Disen Lan, Kexin Wang, Tong Zhu, Xiaoye Qu, Yu Zhang, Xiaoyu Mo, Daizong Liu, Yuxuan Liang, Wenliang Chen, Guoqi Li, Yu Cheng
Title: Speed Always Wins: A Survey on Efficient Architectures for Large Language Models
Abstract:
Large Language Models (LLMs) have delivered impressive results in language understanding, generation, reasoning, and pushes the ability boundary of multimodal models. Transformer models, as the foundation of modern LLMs, offer a strong baseline with excellent scaling properties. However, the traditional transformer architecture requires substantial computations and poses significant obstacles for large-scale training and practical deployment. In this survey, we offer a systematic examination of innovative LLM architectures that address the inherent limitations of transformers and boost the efficiency. Starting from language modeling, this survey covers the background and technical details of linear and sparse sequence modeling methods, efficient full attention variants, sparse mixture-of-experts, hybrid model architectures incorporating the above techniques, and emerging diffusion LLMs. Additionally, we discuss applications of these techniques to other modalities and consider their wider implications for developing scalable, resource-aware foundation models. By grouping recent studies into the above category, this survey presents a blueprint of modern efficient LLM architectures, and we hope this could help motivate future research toward more efficient, versatile AI systems.
中文: 本综述系统梳理了克服传统Transformer计算局限的创新大语言模型架构,涵盖线性序列建模和稀疏注意力等技术,旨在提升模型效率与可扩展性。
English: This survey systematically reviews innovative Large Language Model architectures that overcome the computational limitations of traditional transformers, covering techniques like linear sequence modeling and sparse attention to enhance efficiency and scalability.

Authors:Shenxing Wei, Jinxi Li, Yafei Yang, Siyuan Zhou, Bo Yang
Title: RayletDF: Raylet Distance Fields for Generalizable 3D Surface Reconstruction from Point Clouds or Gaussians
Abstract:
In this paper, we present a generalizable method for 3D surface reconstruction from raw point clouds or pre-estimated 3D Gaussians by 3DGS from RGB images. Unlike existing coordinate-based methods which are often computationally intensive when rendering explicit surfaces, our proposed method, named RayletDF, introduces a new technique called raylet distance field, which aims to directly predict surface points from query rays. Our pipeline consists of three key modules: a raylet feature extractor, a raylet distance field predictor, and a multi-raylet blender. These components work together to extract fine-grained local geometric features, predict raylet distances, and aggregate multiple predictions to reconstruct precise surface points. We extensively evaluate our method on multiple public real-world datasets, demonstrating superior performance in surface reconstruction from point clouds or 3D Gaussians. Most notably, our method achieves exceptional generalization ability, successfully recovering 3D surfaces in a single-forward pass across unseen datasets in testing.
Chinese: 本文提出RayletDF方法,通过光线元距离场直接从查询光线预测表面点,实现了从点云或3D高斯的高效三维表面重建,在多个数据集上展现出卓越性能和强大泛化能力。
English: This paper introduces RayletDF, a novel method for efficient 3D surface reconstruction from point clouds or 3D Gaussians that uses a raylet distance field to directly predict surface points, demonstrating superior performance and exceptional generalization across diverse datasets.

Authors:Xuhong Huang, Shiqi Liu, Kai Zhang, Ying Tai, Jian Yang, Hui Zeng, Lei Zhang
Title: Reverse Convolution and Its Applications to Image Restoration
Abstract:
Convolution and transposed convolution are fundamental operators widely used in neural networks. However, transposed convolution (a.k.a. deconvolution) does not serve as a true inverse of convolution due to inherent differences in their mathematical formulations. To date, no reverse convolution operator has been established as a standard component in neural architectures. In this paper, we propose a novel depthwise reverse convolution operator as an initial attempt to effectively reverse depthwise convolution by formulating and solving a regularized least-squares optimization problem. We thoroughly investigate its kernel initialization, padding strategies, and other critical aspects to ensure its effective implementation. Building upon this operator, we further construct a reverse convolution block by combining it with layer normalization, 1$\times$1 convolution, and GELU activation, forming a Transformer-like structure. The proposed operator and block can directly replace conventional convolution and transposed convolution layers in existing architectures, leading to the development of ConverseNet. Corresponding to typical image restoration models such as DnCNN, SRResNet and USRNet, we train three variants of ConverseNet for Gaussian denoising, super-resolution and deblurring, respectively. Extensive experiments demonstrate the effectiveness of the proposed reverse convolution operator as a basic building module. We hope this work could pave the way for developing new operators in deep model design and applications.
中文: 本文提出了一种新颖的深度可分离逆卷积算子,通过正则化最小二乘优化有效逆转深度可分离卷积,构建的ConverseNet在图像去噪、超分辨率等修复任务中展现出卓越性能。
English: This paper introduces a novel depthwise reverse convolution operator that effectively reverses depthwise convolution through a regularized least-squares optimization, forming ConverseNet which demonstrates superior performance in image restoration tasks like denoising and super-resolution.

Authors:Valentin Boussot, Jean-Louis Dillenseger
Title: KonfAI: A Modular and Fully Configurable Framework for Deep Learning in Medical Imaging
Abstract:
KonfAI is a modular, extensible, and fully configurable deep learning framework specifically designed for medical imaging tasks. It enables users to define complete training, inference, and evaluation workflows through structured YAML configuration files, without modifying the underlying code. This declarative approach enhances reproducibility, transparency, and experimental traceability while reducing development time. Beyond the capabilities of standard pipelines, KonfAI provides native abstractions for advanced strategies including patch-based learning, test-time augmentation, model ensembling, and direct access to intermediate feature representations for deep supervision. It also supports complex multi-model training setups such as generative adversarial architectures. Thanks to its modular and extensible architecture, KonfAI can easily accommodate custom models, loss functions, and data processing components. The framework has been successfully applied to segmentation, registration, and image synthesis tasks, and has contributed to top-ranking results in several international medical imaging challenges. KonfAI is open source and available at \href{https://github.com/vboussot/KonfAI}{https://github.com/vboussot/KonfAI}.
中文: KonfAI是一个专为医学影像设计的可配置深度学习框架,通过YAML配置文件实现可复现的工作流程,支持基于patch学习和多模型训练等高级功能,已在多项国际竞赛中取得领先成绩。
English: KonfAI is a highly configurable deep learning framework for medical imaging that enables reproducible workflows via YAML configurations and supports advanced strategies like patch-based learning and multi-model training, achieving top results in international challenges.

Authors:Zijian Song, Sihan Qin, Tianshui Chen, Liang Lin, Guangrun Wang
Title: Physical Autoregressive Model for Robotic Manipulation without Action Pretraining
Abstract:
The scarcity of manipulation data has motivated the use of pretrained large models from other modalities in robotics. In this work, we build upon autoregressive video generation models to propose a Physical Autoregressive Model (PAR), where physical tokens combine frames and actions to represent the joint evolution of the robot and its environment. PAR leverages the world knowledge embedded in video pretraining to understand physical dynamics without requiring action pretraining, enabling accurate video prediction and consistent action trajectories. It also adopts a DiT-based de-tokenizer to model frames and actions as continuous tokens, mitigating quantization errors and facilitating mutual enhancement. Furthermore, we incorporate a causal mask with inverse kinematics, parallel training, and the KV-cache mechanism to further improve performance and efficiency. Experiments on the ManiSkill benchmark show that PAR achieves a 100\% success rate on the PushCube task, matches the performance of action-pretrained baselines on other tasks, and accurately predicts future videos with tightly aligned action trajectories. These findings underscore a promising direction for robotic manipulation by transferring world knowledge from autoregressive video pretraining. The project page is here: https://hcplab-sysu.github.io/PhysicalAutoregressiveModel/
中文摘要:物理自回归模型(PAR)通过视频预训练理解机器人动力学而无需动作预训练,采用连续标记建模和优化推理机制,在操作任务中实现了最先进的性能表现。
English Summary: The Physical Autoregressive Model (PAR) leverages video pretraining to understand robotic dynamics without action-specific training, achieving state-of-the-art performance on manipulation tasks through continuous token modeling and optimized inference mechanisms.

Authors:Jinxi Li, Ziyang Song, Bo Yang
Title: TRACE: Learning 3D Gaussian Physical Dynamics from Multi-view Videos
Abstract:
In this paper, we aim to model 3D scene geometry, appearance, and physical information just from dynamic multi-view videos in the absence of any human labels. By leveraging physics-informed losses as soft constraints or integrating simple physics models into neural nets, existing works often fail to learn complex motion physics, or doing so requires additional labels such as object types or masks. We propose a new framework named TRACE to model the motion physics of complex dynamic 3D scenes. The key novelty of our method is that, by formulating each 3D point as a rigid particle with size and orientation in space, we directly learn a translation rotation dynamics system for each particle, explicitly estimating a complete set of physical parameters to govern the particle's motion over time. Extensive experiments on three existing dynamic datasets and one newly created challenging synthetic datasets demonstrate the extraordinary performance of our method over baselines in the task of future frame extrapolation. A nice property of our framework is that multiple objects or parts can be easily segmented just by clustering the learned physical parameters.
中文: 本文提出TRACE框架,通过将三维点视为具有物理属性的刚性粒子来学习运动规律,在动态场景预测中表现卓越,并能通过物理参数聚类实现对象分割。
English: This paper introduces TRACE, a novel framework that models 3D scene dynamics by treating each point as a rigid particle and learning its physical parameters, achieving superior performance in future frame prediction and enabling object segmentation through parameter clustering.

Authors:Nahyuk Lee, Juhong Min, Junhong Lee, Chunghyun Park, Minsu Cho
Title: Combinative Matching for Geometric Shape Assembly
Abstract:
This paper introduces a new shape-matching methodology, combinative matching, to combine interlocking parts for geometric shape assembly. Previous methods for geometric assembly typically rely on aligning parts by finding identical surfaces between the parts as in conventional shape matching and registration. In contrast, we explicitly model two distinct properties of interlocking shapes: 'identical surface shape' and 'opposite volume occupancy.' Our method thus learns to establish correspondences across regions where their surface shapes appear identical but their volumes occupy the inverted space to each other. To facilitate this process, we also learn to align regions in rotation by estimating their shape orientations via equivariant neural networks. The proposed approach significantly reduces local ambiguities in matching and allows a robust combination of parts in assembly. Experimental results on geometric assembly benchmarks demonstrate the efficacy of our method, consistently outperforming the state of the art. Project page: https://nahyuklee.github.io/cmnet.

Authors:Dianyi Wang, Siyuan Wang, Zejun Li, Yikun Wang, Yitong Li, Duyu Tang, Xiaoyu Shen, Xuanjing Huang, Zhongyu Wei
Title: MoIIE: Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models
Abstract:
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across multi-modal tasks by scaling model size and training data. However, these dense LVLMs incur significant computational costs and motivate the exploration of sparse Mixture of Experts (MoE) architectures. While MoE improve parameter efficiency, effectively applying MoE to simultaneously model modality-specific features and cross-modal associations in LVLMs remains challenging. In this work, we propose to incorporate Mixture of Intra- and Inter-Modality Experts (MoIIE) to LVLMs. For each token, expert routing is guided by its modality, directing tokens to their respective intra-modality experts as well as a shared pool of inter-modality experts, enabling the model to jointly learn rich intra-modal features and cross-modal interactions. We further introduce an effective and straightforward two-stage training strategy, which facilitates the direct activation of both MoE and multi-modal capabilities. Extensive experiments across different data scales and LLM backbone demonstrate the effectiveness, efficiency and generality of our approach. Notably, our MoIIE models with 5.5B and 11.3B activated parameters match or even surpass the performance of existing advanced open-source MoE-LLMs based multi-modal models that involve more activated parameters. The code is available at https://github.com/AlenjandroWang/MoIIE.
中文: 本文提出MoIIE架构,通过混合模态内与模态间专家机制,在大规模视觉语言模型中同时建模模态特定特征和跨模态关联,以更少的激活参数实现了优越性能。
English: This paper introduces MoIIE, a Mixture of Intra- and Inter-Modality Experts architecture for Large Vision-Language Models that efficiently models both modality-specific features and cross-modal interactions, achieving competitive performance with fewer activated parameters.

Authors:Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, Wei Li
Title: Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory
Abstract:
We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update its long-term memory. Beyond episodic memory, it also develops semantic memory, enabling it to accumulate world knowledge over time. Its memory is organized in an entity-centric, multimodal format, allowing deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn, iterative reasoning and retrieves relevant information from memory to accomplish the task. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a new long-video question answering benchmark. M3-Bench comprises 100 newly recorded real-world videos captured from a robot's perspective (M3-Bench-robot) and 920 web-sourced videos across diverse scenarios (M3-Bench-web). We annotate question-answer pairs designed to test key capabilities essential for agent applications, such as human understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances the multimodal agents toward more human-like long-term memory and provides insights into their practical design. Model, code and data are available at https://github.com/bytedance-seed/m3-agent
中文: M3-Agent是一种具备长期记忆的多模态智能体框架,能通过实时感知构建知识并自主推理,在专业评测基准上显著优于现有最强基线模型。
English: M3-Agent is a multimodal framework with long-term memory that processes real-time sensory inputs to build knowledge and autonomously perform reasoning, outperforming top baselines on specialized benchmarks.

Authors:Shekhnaz Idrissova, Islem Rekik
Title: Multimodal Sheaf-based Network for Glioblastoma Molecular Subtype Prediction
Abstract:
Glioblastoma is a highly invasive brain tumor with rapid progression rates. Recent studies have shown that glioblastoma molecular subtype classification serves as a significant biomarker for effective targeted therapy selection. However, this classification currently requires invasive tissue extraction for comprehensive histopathological analysis. Existing multimodal approaches combining MRI and histopathology images are limited and lack robust mechanisms for preserving shared structural information across modalities. In particular, graph-based models often fail to retain discriminative features within heterogeneous graphs, and structural reconstruction mechanisms for handling missing or incomplete modality data are largely underexplored. To address these limitations, we propose a novel sheaf-based framework for structure-aware and consistent fusion of MRI and histopathology data. Our model outperforms baseline methods and demonstrates robustness in incomplete or missing data scenarios, contributing to the development of virtual biopsy tools for rapid diagnostics. Our source code is available at https://github.com/basiralab/MMSN/.
中文: 提出的基于层结构的框架有效融合了MRI和组织病理学数据,以改进胶质母细胞瘤亚型分类,其性能优于现有方法并在数据不完整时表现出稳健性,推动了快速诊断的虚拟活检工具发展。
English: The proposed sheaf-based framework effectively fuses MRI and histopathology data to enhance glioblastoma subtype classification, outperforming existing methods and showing robustness with incomplete data, advancing virtual biopsy tools for rapid diagnosis.

Authors:Devvrat Joshi, Islem Rekik
Title: NEURAL: Attention-Guided Pruning for Unified Multimodal Resource-Constrained Clinical Evaluation
Abstract:
The rapid growth of multimodal medical imaging data presents significant storage and transmission challenges, particularly in resource-constrained clinical settings. We propose NEURAL, a novel framework that addresses this by using semantics-guided data compression. Our approach repurposes cross-attention scores between the image and its radiological report from a fine-tuned generative vision-language model to structurally prune chest X-rays, preserving only diagnostically critical regions. This process transforms the image into a highly compressed, graph representation. This unified graph-based representation fuses the pruned visual graph with a knowledge graph derived from the clinical report, creating a universal data structure that simplifies downstream modeling. Validated on the MIMIC-CXR and CheXpert Plus dataset for pneumonia detection, NEURAL achieves a 93.4-97.7\% reduction in image data size while maintaining a high diagnostic performance of 0.88-0.95 AUC, outperforming other baseline models that use uncompressed data. By creating a persistent, task-agnostic data asset, NEURAL resolves the trade-off between data size and clinical utility, enabling efficient workflows and teleradiology without sacrificing performance. Our NEURAL code is available at https://github.com/basiralab/NEURAL.
Chinese: NEURAL框架通过语义引导的压缩技术将胸部X光转换为高度压缩的图表示,在肺炎检测中实现93.4-97.7%的数据缩减,同时保持0.88-0.95 AUC的高诊断性能。
English: NEURAL is a novel framework that uses semantics-guided compression to transform chest X-rays into highly compressed graph representations, achieving 93.4-97.7% data reduction while maintaining high diagnostic performance (0.88-0.95 AUC) for pneumonia detection.

Authors:Xingyilang Yin, Qi Zhang, Jiahao Chang, Ying Feng, Qingnan Fan, Xi Yang, Chi-Man Pun, Huaqi Zhang, Xiaodong Cun
Title: GSFixer: Improving 3D Gaussian Splatting with Reference-Guided Video Diffusion Priors
Abstract:
Reconstructing 3D scenes using 3D Gaussian Splatting (3DGS) from sparse views is an ill-posed problem due to insufficient information, often resulting in noticeable artifacts. While recent approaches have sought to leverage generative priors to complete information for under-constrained regions, they struggle to generate content that remains consistent with input observations. To address this challenge, we propose GSFixer, a novel framework designed to improve the quality of 3DGS representations reconstructed from sparse inputs. The core of our approach is the reference-guided video restoration model, built upon a DiT-based video diffusion model trained on paired artifact 3DGS renders and clean frames with additional reference-based conditions. Considering the input sparse views as references, our model integrates both 2D semantic features and 3D geometric features of reference views extracted from the visual geometry foundation model, enhancing the semantic coherence and 3D consistency when fixing artifact novel views. Furthermore, considering the lack of suitable benchmarks for 3DGS artifact restoration evaluation, we present DL3DV-Res which contains artifact frames rendered using low-quality 3DGS. Extensive experiments demonstrate our GSFixer outperforms current state-of-the-art methods in 3DGS artifact restoration and sparse-view 3D reconstruction. Project page: https://github.com/GVCLab/GSFixer.
中文: GSFixer是一种新颖框架,通过基于参考视图的视频修复模型整合语义和几何特征,有效提升稀疏视图下3D高斯泼溅重建的质量,在伪影修复和三维重建方面优于现有方法。
English: GSFixer is a novel framework that enhances 3D Gaussian Splatting reconstructions from sparse views by integrating semantic and geometric features through a reference-guided video restoration model, outperforming existing methods in artifact restoration and 3D reconstruction.

Authors:Hao Xu, Arbind Agrahari Baniya, Sam Wells, Mohamed Reda Bouadjenek, Richard Dazely, Sunil Aryal
Title: TOTNet: Occlusion-Aware Temporal Tracking for Robust Ball Detection in Sports Videos
Abstract:
Robust ball tracking under occlusion remains a key challenge in sports video analysis, affecting tasks like event detection and officiating. We present TOTNet, a Temporal Occlusion Tracking Network that leverages 3D convolutions, visibility-weighted loss, and occlusion augmentation to improve performance under partial and full occlusions. Developed in collaboration with Paralympics Australia, TOTNet is designed for real-world sports analytics. We introduce TTA, a new occlusion-rich table tennis dataset collected from professional-level Paralympic matches, comprising 9,159 samples with 1,996 occlusion cases. Evaluated on four datasets across tennis, badminton, and table tennis, TOTNet significantly outperforms prior state-of-the-art methods, reducing RMSE from 37.30 to 7.19 and improving accuracy on fully occluded frames from 0.63 to 0.80. These results demonstrate TOTNets effectiveness for offline sports analytics in fast-paced scenarios. Code and data access:\href{https://github.com/AugustRushG/TOTNet}{AugustRushG/TOTNet}.
中文: TOTNet是一种时间遮挡追踪网络,通过与澳大利亚残奥委会合作开发,利用3D卷积和针对性训练技术显著提升了遮挡情况下的球体追踪性能,在多项运动数据集中实现了最优表现。
English: TOTNet, a Temporal Occlusion Tracking Network developed with Paralympics Australia, significantly enhances ball tracking under occlusion using 3D convolutions and specialized training techniques, achieving state-of-the-art performance across multiple sports datasets.

Authors:Shengjun Zhu, Siyu Liu, Runqing Xiong, Liping Zheng, Duo Ma, Rongshang Chen, Jiaxin Cai
Title: Multi-Contrast Fusion Module: An attention mechanism integrating multi-contrast features for fetal torso plane classification
Abstract:
Purpose: Prenatal ultrasound is a key tool in evaluating fetal structural development and detecting abnormalities, contributing to reduced perinatal complications and improved neonatal survival. Accurate identification of standard fetal torso planes is essential for reliable assessment and personalized prenatal care. However, limitations such as low contrast and unclear texture details in ultrasound imaging pose significant challenges for fine-grained anatomical recognition. Methods: We propose a novel Multi-Contrast Fusion Module (MCFM) to enhance the model's ability to extract detailed information from ultrasound images. MCFM operates exclusively on the lower layers of the neural network, directly processing raw ultrasound data. By assigning attention weights to image representations under different contrast conditions, the module enhances feature modeling while explicitly maintaining minimal parameter overhead. Results: The proposed MCFM was evaluated on a curated dataset of fetal torso plane ultrasound images. Experimental results demonstrate that MCFM substantially improves recognition performance, with a minimal increase in model complexity. The integration of multi-contrast attention enables the model to better capture subtle anatomical structures, contributing to higher classification accuracy and clinical reliability. Conclusions: Our method provides an effective solution for improving fetal torso plane recognition in ultrasound imaging. By enhancing feature representation through multi-contrast fusion, the proposed approach supports clinicians in achieving more accurate and consistent diagnoses, demonstrating strong potential for clinical adoption in prenatal screening. The codes are available at https://github.com/sysll/MCFM.
中文: 本研究提出的多对比度融合模块(MCFM)通过优化特征提取,在几乎不增加模型复杂度的前提下显著提升了胎儿躯干平面超声图像的识别性能,有助于提高临床诊断的准确性和可靠性。
English: The study introduces a Multi-Contrast Fusion Module (MCFM) that enhances fetal torso plane recognition in ultrasound imaging by improving feature extraction with minimal added complexity, leading to higher diagnostic accuracy and clinical reliability.

Authors:Jingwei Liu, Ling Yang, Hao Luo, Fan Wang, Hongyan Li, Mengdi Wang
Title: Preacher: Paper-to-Video Agentic System
Abstract:
The paper-to-video task converts a research paper into a structured video abstract, distilling key concepts, methods, and conclusions into an accessible, well-organized format. While state-of-the-art video generation models demonstrate potential, they are constrained by limited context windows, rigid video duration constraints, limited stylistic diversity, and an inability to represent domain-specific knowledge. To address these limitations, we introduce Preacher, the first paper-to-video agentic system. Preacher employs a topdown approach to decompose, summarize, and reformulate the paper, followed by bottom-up video generation, synthesizing diverse video segments into a coherent abstract. To align cross-modal representations, we define key scenes and introduce a Progressive Chain of Thought (P-CoT) for granular, iterative planning. Preacher successfully generates high-quality video abstracts across five research fields, demonstrating expertise beyond current video generation models. Code will be released at: https://github.com/Gen-Verse/Paper2Video
中文: 本文提出首个论文转视频系统Preacher,通过自上而下的分解重构与自下而上的视频合成,结合渐进式思维链实现跨模态对齐,能够生成超越现有模型的高质量学术视频摘要。
English: The paper introduces Preacher, an agentic system that overcomes limitations of current video generation models by employing top-down decomposition and bottom-up synthesis with Progressive Chain of Thought planning to create high-quality video abstracts from research papers.

Authors:Heyi Sun, Cong Wang, Tian-Xing Xu, Jingwei Huang, Di Kang, Chunchao Guo, Song-Hai Zhang
Title: SVG-Head: Hybrid Surface-Volumetric Gaussians for High-Fidelity Head Reconstruction and Real-Time Editing
Abstract:
Creating high-fidelity and editable head avatars is a pivotal challenge in computer vision and graphics, boosting many AR/VR applications. While recent advancements have achieved photorealistic renderings and plausible animation, head editing, especially real-time appearance editing, remains challenging due to the implicit representation and entangled modeling of the geometry and global appearance. To address this, we propose Surface-Volumetric Gaussian Head Avatar (SVG-Head), a novel hybrid representation that explicitly models the geometry with 3D Gaussians bound on a FLAME mesh and leverages disentangled texture images to capture the global appearance. Technically, it contains two types of Gaussians, in which surface Gaussians explicitly model the appearance of head avatars using learnable texture images, facilitating real-time texture editing, while volumetric Gaussians enhance the reconstruction quality of non-Lambertian regions (e.g., lips and hair). To model the correspondence between 3D world and texture space, we provide a mesh-aware Gaussian UV mapping method, which leverages UV coordinates given by the FLAME mesh to obtain sharp texture images and real-time rendering speed. A hierarchical optimization strategy is further designed to pursue the optimal performance in both reconstruction quality and editing flexibility. Experiments on the NeRSemble dataset show that SVG-Head not only generates high-fidelity rendering results, but also is the first method to obtain explicit texture images for Gaussian head avatars and support real-time appearance editing.

Authors:Jiwon Kim, Pureum Kim, SeonHwa Kim, Soobin Park, Eunju Cha, Kyong Hwan Jin
Title: Dual Recursive Feedback on Generation and Appearance Latents for Pose-Robust Text-to-Image Diffusion
Abstract:
Recent advancements in controllable text-to-image (T2I) diffusion models, such as Ctrl-X and FreeControl, have demonstrated robust spatial and appearance control without requiring auxiliary module training. However, these models often struggle to accurately preserve spatial structures and fail to capture fine-grained conditions related to object poses and scene layouts. To address these challenges, we propose a training-free Dual Recursive Feedback (DRF) system that properly reflects control conditions in controllable T2I models. The proposed DRF consists of appearance feedback and generation feedback that recursively refines the intermediate latents to better reflect the given appearance information and the user's intent. This dual-update mechanism guides latent representations toward reliable manifolds, effectively integrating structural and appearance attributes. Our approach enables fine-grained generation even between class-invariant structure-appearance fusion, such as transferring human motion onto a tiger's form. Extensive experiments demonstrate the efficacy of our method in producing high-quality, semantically coherent, and structurally consistent image generations. Our source code is available at https://github.com/jwonkm/DRF.
中文: 本文提出了一种无需训练的双重递归反馈(DRF)系统,通过递归优化潜在表示来增强可控文本到图像模型,从而更好地融合结构和外观属性,实现细粒度生成并提升空间准确性。
English: This paper introduces a training-free Dual Recursive Feedback (DRF) system that enhances controllable text-to-image models by recursively refining latent representations to better integrate structural and appearance attributes, enabling fine-grained generation and improved spatial accuracy.

Authors:Fengyi Wu, Yifei Dong, Zhi-Qi Cheng, Yilong Dai, Guangyu Chen, Hang Wang, Qi Dai, Alexander G. Hauptmann
Title: GoViG: Goal-Conditioned Visual Navigation Instruction Generation
Abstract:
We introduce Goal-Conditioned Visual Navigation Instruction Generation (GoViG), a new task that aims to autonomously generate precise and contextually coherent navigation instructions solely from egocentric visual observations of initial and goal states. Unlike conventional approaches that rely on structured inputs such as semantic annotations or environmental maps, GoViG exclusively leverages raw egocentric visual data, substantially improving its adaptability to unseen and unstructured environments. Our method addresses this task by decomposing it into two interconnected subtasks: (1) visual forecasting, which predicts intermediate visual states bridging the initial and goal views; and (2) instruction generation, which synthesizes linguistically coherent instructions grounded in both observed and anticipated visuals. These subtasks are integrated within an autoregressive multimodal large language model trained with tailored objectives to ensure spatial accuracy and linguistic clarity. Furthermore, we introduce two complementary multimodal reasoning strategies, one-pass and interleaved reasoning, to mimic incremental human cognitive processes during navigation. To evaluate our method, we propose the R2R-Goal dataset, combining diverse synthetic and real-world trajectories. Empirical results demonstrate significant improvements over state-of-the-art methods, achieving superior BLEU-4 and CIDEr scores along with robust cross-domain generalization.
中文摘要:GoViG是一项仅通过初始与目标位置的原始视觉观测自主生成导航指令的新任务,它通过视觉预测与指令生成的双子任务框架,在跨领域环境中实现了卓越的适应性和评估指标提升。
English Summary: GoViG is a novel task that generates navigation instructions using only raw egocentric visual inputs from start to goal positions, employing visual forecasting and instruction generation within a multimodal model to achieve superior adaptability and performance metrics.

Authors:Moinak Bhattacharya, Gagandeep Singh, Shubham Jain, Prateek Prasanna
Title: GazeLT: Visual attention-guided long-tailed disease classification in chest radiographs
Abstract:
In this work, we present GazeLT, a human visual attention integration-disintegration approach for long-tailed disease classification. A radiologist's eye gaze has distinct patterns that capture both fine-grained and coarser level disease related information. While interpreting an image, a radiologist's attention varies throughout the duration; it is critical to incorporate this into a deep learning framework to improve automated image interpretation. Another important aspect of visual attention is that apart from looking at major/obvious disease patterns, experts also look at minor/incidental findings (few of these constituting long-tailed classes) during the course of image interpretation. GazeLT harnesses the temporal aspect of the visual search process, via an integration and disintegration mechanism, to improve long-tailed disease classification. We show the efficacy of GazeLT on two publicly available datasets for long-tailed disease classification, namely the NIH-CXR-LT (n=89237) and the MIMIC-CXR-LT (n=111898) datasets. GazeLT outperforms the best long-tailed loss by 4.1% and the visual attention-based baseline by 21.7% in average accuracy metrics for these datasets. Our code is available at https://github.com/lordmoinak1/gazelt.
中文: GazeLT通过整合与分解机制利用放射科医师视觉注意力的时间特征来改进长尾疾病分类,在NIH-CXR-LT和MIMIC-CXR-LT数据集上展现出优于现有方法的性能。
English: GazeLT introduces an integration-disintegration approach that leverages radiologists' temporal visual attention patterns to enhance long-tailed disease classification, demonstrating superior performance on NIH-CXR-LT and MIMIC-CXR-LT datasets.

Authors:Yuji Wang, Moran Li, Xiaobin Hu, Ran Yi, Jiangning Zhang, Chengming Xu, Weijian Cao, Yabiao Wang, Chengjie Wang, Lizhuang Ma
Title: From Large Angles to Consistent Faces: Identity-Preserving Video Generation via Mixture of Facial Experts
Abstract:
Current video generation models struggle with identity preservation under large facial angles, primarily facing two challenges: the difficulty in exploring an effective mechanism to integrate identity features into DiT structure, and the lack of targeted coverage of large facial angles in existing open-source video datasets. To address these, we present two key innovations. First, we introduce a Mixture of Facial Experts (MoFE) that dynamically combines complementary cues from three specialized experts, each designed to capture distinct but mutually reinforcing aspects of facial attributes. The identity expert captures cross-pose identity-sensitive features, the semantic expert extracts high-level visual semantxics, and the detail expert preserves pixel-level features (e.g., skin texture, color gradients). Furthermore, to mitigate dataset limitations, we have tailored a data processing pipeline centered on two key aspects: Face Constraints and Identity Consistency. Face Constraints ensure facial angle diversity and a high proportion of facial regions, while Identity Consistency preserves coherent person-specific features across temporal sequences, collectively addressing the scarcity of large facial angles and identity-stable training data in existing datasets. Leveraging this pipeline, we have curated and refined a Large Face Angles (LFA) Dataset from existing open-source human video datasets, comprising 460K video clips with annotated facial angles. Experimental results on the LFA benchmark demonstrate that our method, empowered by the LFA dataset, significantly outperforms prior SOTA methods in face similarity, face FID, and CLIP semantic alignment. The code and dataset will be made publicly available at https://github.com/rain152/LFA-Video-Generation.
中文: 针对视频生成中大幅面部角度下身份保持的难题,本研究提出了混合面部专家(MoFE)机制和专门的大角度面部(LFA)数据集,在面部相似度和语义对齐方面显著超越了现有最优方法。
English: To tackle identity preservation challenges in video generation with large facial angles, this study introduces a Mixture of Facial Experts (MoFE) mechanism and a specialized Large Face Angles (LFA) dataset, significantly improving face similarity and semantic alignment over prior methods.

Authors:Wen Huang, Jiarui Yang, Tao Dai, Jiawei Li, Shaoxiong Zhan, Bin Wang, Shu-Tao Xia
Title: RelayFormer: A Unified Local-Global Attention Framework for Scalable Image and Video Manipulation Localization
Abstract:
Visual manipulation localization (VML) -- across both images and videos -- is a crucial task in digital forensics that involves identifying tampered regions in visual content. However, existing methods often lack cross-modal generalization and struggle to handle high-resolution or long-duration inputs efficiently. We propose RelayFormer, a unified and modular architecture for visual manipulation localization across images and videos. By leveraging flexible local units and a Global-Local Relay Attention (GLoRA) mechanism, it enables scalable, resolution-agnostic processing with strong generalization. Our framework integrates seamlessly with existing Transformer-based backbones, such as ViT and SegFormer, via lightweight adaptation modules that require only minimal architectural changes, ensuring compatibility without disrupting pretrained representations. Furthermore, we design a lightweight, query-based mask decoder that supports one-shot inference across video sequences with linear complexity. Extensive experiments across multiple benchmarks demonstrate that our approach achieves state-of-the-art localization performance, setting a new baseline for scalable and modality-agnostic VML. Code is available at: https://github.com/WenOOI/RelayFormer.
Chinese: RelayFormer通过将输入分割为固定大小的子图像并引入全局-局部中继注意力机制,有效解决了视觉篡改定位中的分辨率多样性和模态差异问题,在多个基准测试中以高效方式实现了最先进的性能。
English: RelayFormer is a unified framework that addresses resolution diversity and modality gaps in visual manipulation localization by using fixed-size sub-images and a global-local relay attention mechanism, achieving state-of-the-art performance efficiently across various benchmarks.

Authors:Shuai Tan, Biao Gong, Zhuoxin Liu, Yan Wang, Xi Chen, Yifan Feng, Hengshuang Zhao
Title: Animate-X++: Universal Character Image Animation with Dynamic Backgrounds
Abstract:
Character image animation, which generates high-quality videos from a reference image and target pose sequence, has seen significant progress in recent years. However, most existing methods only apply to human figures, which usually do not generalize well on anthropomorphic characters commonly used in industries like gaming and entertainment. Furthermore, previous methods could only generate videos with static backgrounds, which limits the realism of the videos. For the first challenge, our in-depth analysis suggests to attribute this limitation to their insufficient modeling of motion, which is unable to comprehend the movement pattern of the driving video, thus imposing a pose sequence rigidly onto the target character. To this end, this paper proposes Animate-X++, a universal animation framework based on DiT for various character types, including anthropomorphic characters. To enhance motion representation, we introduce the Pose Indicator, which captures comprehensive motion pattern from the driving video through both implicit and explicit manner. The former leverages CLIP visual features of a driving video to extract its gist of motion, like the overall movement pattern and temporal relations among motions, while the latter strengthens the generalization of DiT by simulating possible inputs in advance that may arise during inference. For the second challenge, we introduce a multi-task training strategy that jointly trains the animation and TI2V tasks. Combined with the proposed partial parameter training, this approach achieves not only character animation but also text-driven background dynamics, making the videos more realistic. Moreover, we introduce a new Animated Anthropomorphic Benchmark (A2Bench) to evaluate the performance of Animate-X++ on universal and widely applicable animation images. Extensive experiments demonstrate the superiority and effectiveness of Animate-X++.
中文: Animate-X++提出了一种通用动画框架,通过姿态指示器增强运动建模以适用于多种角色类型,并采用多任务训练实现动态背景,经新基准和大量实验验证了其优越性。
English: Animate-X++ is a universal animation framework that overcomes limitations in existing methods by introducing a Pose Indicator for enhanced motion comprehension across various character types and a multi-task training strategy for dynamic backgrounds, validated through a new benchmark and extensive experiments.

Authors:Haoxiang Shi, Xiang Deng, Zaijing Li, Gongwei Chen, Yaowei Wang, Liqiang Nie
Title: DAgger Diffusion Navigation: DAgger Boosted Diffusion Policy for Vision-Language Navigation
Abstract:
Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural language instructions through free-form 3D spaces. Existing VLN-CE approaches typically use a two-stage waypoint planning framework, where a high-level waypoint predictor generates the navigable waypoints, and then a navigation planner suggests the intermediate goals in the high-level action space. However, this two-stage decomposition framework suffers from: (1) global sub-optimization due to the proxy objective in each stage, and (2) a performance bottleneck caused by the strong reliance on the quality of the first-stage predicted waypoints. To address these limitations, we propose DAgger Diffusion Navigation (DifNav), an end-to-end optimized VLN-CE policy that unifies the traditional two stages, i.e. waypoint generation and planning, into a single diffusion policy. Notably, DifNav employs a conditional diffusion policy to directly model multi-modal action distributions over future actions in continuous navigation space, eliminating the need for a waypoint predictor while enabling the agent to capture multiple possible instruction-following behaviors. To address the issues of compounding error in imitation learning and enhance spatial reasoning in long-horizon navigation tasks, we employ DAgger for online policy training and expert trajectory augmentation, and use the aggregated data to further fine-tune the policy. This approach significantly improves the policy's robustness and its ability to recover from error states. Extensive experiments on benchmark datasets demonstrate that, even without a waypoint predictor, the proposed method substantially outperforms previous state-of-the-art two-stage waypoint-based models in terms of navigation performance. Our code is available at: https://github.com/Tokishx/DifNav.
中文:提出的DAgger扩散导航(DifNav)是一种端到端的连续环境视觉语言导航策略,它将路径点生成与规划统一为单一扩散模型,无需独立路径点预测器即可建模多模态动作,并通过DAgger训练增强鲁棒性和导航性能。
English: The proposed DAgger Diffusion Navigation (DifNav) is an end-to-end VLN-CE policy that unifies waypoint generation and planning into a single diffusion model, eliminating the need for a separate waypoint predictor while enabling multi-modal action modeling and improved robustness through DAgger training.

Authors:Badi Li, Ren-jie Lu, Yu Zhou, Jingke Meng, Wei-shi Zheng
Title: Distilling LLM Prior to Flow Model for Generalizable Agent's Imagination in Object Goal Navigation
Abstract:
The Object Goal Navigation (ObjectNav) task challenges agents to locate a specified object in an unseen environment by imagining unobserved regions of the scene. Prior approaches rely on deterministic and discriminative models to complete semantic maps, overlooking the inherent uncertainty in indoor layouts and limiting their ability to generalize to unseen environments. In this work, we propose GOAL, a generative flow-based framework that models the semantic distribution of indoor environments by bridging observed regions with LLM-enriched full-scene semantic maps. During training, spatial priors inferred from large language models (LLMs) are encoded as two-dimensional Gaussian fields and injected into target maps, distilling rich contextual knowledge into the flow model and enabling more generalizable completions. Extensive experiments demonstrate that GOAL achieves state-of-the-art performance on MP3D and Gibson, and shows strong generalization in transfer settings to HM3D. Codes and pretrained models are available at https://github.com/Badi-Li/GOAL.
中文: GOAL框架采用基于流的生成方法,利用LLM增强的语义地图建模环境不确定性,在多个基准测试中实现了最先进的物体导航性能并展现出强大的泛化能力。
English: The GOAL framework introduces a generative flow-based approach that leverages LLM-enriched semantic maps to model environmental uncertainties, achieving state-of-the-art performance and strong generalization in ObjectNav tasks across multiple benchmarks.

Authors:Guangxun Zhu, Shiyu Fan, Hang Dai, Edmond S. L. Ho
Title: Waymo-3DSkelMo: A Multi-Agent 3D Skeletal Motion Dataset for Pedestrian Interaction Modeling in Autonomous Driving
Abstract:
Large-scale high-quality 3D motion datasets with multi-person interactions are crucial for data-driven models in autonomous driving to achieve fine-grained pedestrian interaction understanding in dynamic urban environments. However, existing datasets mostly rely on estimating 3D poses from monocular RGB video frames, which suffer from occlusion and lack of temporal continuity, thus resulting in unrealistic and low-quality human motion. In this paper, we introduce Waymo-3DSkelMo, the first large-scale dataset providing high-quality, temporally coherent 3D skeletal motions with explicit interaction semantics, derived from the Waymo Perception dataset. Our key insight is to utilize 3D human body shape and motion priors to enhance the quality of the 3D pose sequences extracted from the raw LiDRA point clouds. The dataset covers over 14,000 seconds across more than 800 real driving scenarios, including rich interactions among an average of 27 agents per scene (with up to 250 agents in the largest scene). Furthermore, we establish 3D pose forecasting benchmarks under varying pedestrian densities, and the results demonstrate its value as a foundational resource for future research on fine-grained human behavior understanding in complex urban environments. The dataset and code will be available at https://github.com/GuangxunZhu/Waymo-3DSkelMo
中文: 本文推出Waymo-3DSkelMo首个大规模高质量3D骨骼运动数据集,通过LiDAR点云提取具有交互语义的时序连贯动作,解决了现有运动捕捉技术的局限性,为复杂城市场景中行人行为理解建立了新基准。
English: This paper introduces Waymo-3DSkelMo, the first large-scale dataset providing high-quality 3D skeletal motions with interaction semantics derived from LiDAR data, addressing limitations of existing motion capture methods and establishing benchmarks for pedestrian behavior understanding in urban environments.

Authors:El Mustapha Mansouri
Title: Autonomous AI Bird Feeder for Backyard Biodiversity Monitoring
Abstract:
This paper presents a low cost, on premise system for autonomous backyard bird monitoring in Belgian urban gardens. A motion triggered IP camera uploads short clips via FTP to a local server, where frames are sampled and birds are localized with Detectron2; cropped regions are then classified by an EfficientNet-B3 model fine tuned on a 40-species Belgian subset derived from a larger Kaggle corpus. All processing runs on commodity hardware without a discrete GPU, preserving privacy and avoiding cloud fees. The physical feeder uses small entry ports (30 mm) to exclude pigeons and reduce nuisance triggers. Detector-guided cropping improves classification accuracy over raw-frame classification. The classifier attains high validation performance on the curated subset (about 99.5 percent) and delivers practical field accuracy (top-1 about 88 percent) on held-out species, demonstrating feasibility for citizen-science-grade biodiversity logging at home.
中文: 本研究提出了一种低成本、本地部署的比利时城市花园鸟类自主监测系统,通过运动触发摄像头和本地优化模型处理,在保护隐私和避免云费用的同时实现了高精度监测。
English: This study introduces a low-cost, on-premise system for autonomous bird monitoring in Belgian urban gardens, using motion-triggered cameras and local processing with optimized models to achieve high accuracy while ensuring privacy and avoiding cloud fees.

Authors:Kang Ni, Minrui Zou, Yuxuan Li, Xiang Li, Kehua Guo, Ming-Ming Cheng, Yimian Dai
Title: DenoDet V2: Phase-Amplitude Cross Denoising for SAR Object Detection
Abstract:
One of the primary challenges in Synthetic Aperture Radar (SAR) object detection lies in the pervasive influence of coherent noise. As a common practice, most existing methods, whether handcrafted approaches or deep learning-based methods, employ the analysis or enhancement of object spatial-domain characteristics to achieve implicit denoising. In this paper, we propose DenoDet V2, which explores a completely novel and different perspective to deconstruct and modulate the features in the transform domain via a carefully designed attention architecture. Compared to DenoDet V1, DenoDet V2 is a major advancement that exploits the complementary nature of amplitude and phase information through a band-wise mutual modulation mechanism, which enables a reciprocal enhancement between phase and amplitude spectra. Extensive experiments on various SAR datasets demonstrate the state-of-the-art performance of DenoDet V2. Notably, DenoDet V2 achieves a significant 0.8\% improvement on SARDet-100K dataset compared to DenoDet V1, while reducing the model complexity by half. The code is available at https://github.com/GrokCV/GrokSAR.
中文: DenoDet V2 通过精心设计的注意力架构在变换域中解构和调制特征,利用幅相信息的互补性实现相互增强,在降低模型复杂度的同时显著提升了SAR目标检测性能。
English: DenoDet V2 introduces a novel approach to SAR object detection by modulating features in the transform domain using a band-wise mutual modulation mechanism, achieving state-of-the-art performance with reduced model complexity.

Authors:Kumar Abhishek, Jeremy Kawahara, Ghassan Hamarneh
Title: What Can We Learn from Inter-Annotator Variability in Skin Lesion Segmentation?
Abstract:
Medical image segmentation exhibits intra- and inter-annotator variability due to ambiguous object boundaries, annotator preferences, expertise, and tools, among other factors. Lesions with ambiguous boundaries, e.g., spiculated or infiltrative nodules, or irregular borders per the ABCD rule, are particularly prone to disagreement and are often associated with malignancy. In this work, we curate IMA++, the largest multi-annotator skin lesion segmentation dataset, on which we conduct an in-depth study of variability due to annotator, malignancy, tool, and skill factors. We find a statistically significant (p<0.001) association between inter-annotator agreement (IAA), measured using Dice, and the malignancy of skin lesions. We further show that IAA can be accurately predicted directly from dermoscopic images, achieving a mean absolute error of 0.108. Finally, we leverage this association by utilizing IAA as a "soft" clinical feature within a multi-task learning objective, yielding a 4.2% improvement in balanced accuracy averaged across multiple model architectures and across IMA++ and four public dermoscopic datasets. The code is available at https://github.com/sfu-mial/skin-IAV.
Chinese: 本研究推出了最大的多标注者皮肤病变分割数据集IMA++,揭示了标注者间一致性与病变恶性程度之间的显著关联,并证明将该一致性作为临床特征可有效提升多个数据集的诊断准确性。
English: This study introduces IMA++, the largest multi-annotator skin lesion segmentation dataset, revealing a significant link between inter-annotator agreement and lesion malignancy and demonstrating that leveraging this agreement as a clinical feature improves diagnostic accuracy across multiple datasets.

Authors:Md Rezwanul Haque, Md. Milon Islam, S M Taslim Uddin Raju, Fakhri Karray
Title: A Signer-Invariant Conformer and Multi-Scale Fusion Transformer for Continuous Sign Language Recognition
Abstract:
Continuous Sign Language Recognition (CSLR) faces multiple challenges, including significant inter-signer variability and poor generalization to novel sentence structures. Traditional solutions frequently fail to handle these issues efficiently. For overcoming these constraints, we propose a dual-architecture framework. For the Signer-Independent (SI) challenge, we propose a Signer-Invariant Conformer that combines convolutions with multi-head self-attention to learn robust, signer-agnostic representations from pose-based skeletal keypoints. For the Unseen-Sentences (US) task, we designed a Multi-Scale Fusion Transformer with a novel dual-path temporal encoder that captures both fine-grained posture dynamics, enabling the model's ability to comprehend novel grammatical compositions. Experiments on the challenging Isharah-1000 dataset establish a new standard for both CSLR benchmarks. The proposed conformer architecture achieves a Word Error Rate (WER) of 13.07% on the SI challenge, a reduction of 13.53% from the state-of-the-art. On the US task, the transformer model scores a WER of 47.78%, surpassing previous work. In the SignEval 2025 CSLR challenge, our team placed 2nd in the US task and 4th in the SI task, demonstrating the performance of these models. The findings validate our key hypothesis: that developing task-specific networks designed for the particular challenges of CSLR leads to considerable performance improvements and establishes a new baseline for further research. The source code is available at: https://github.com/rezwanh001/MSLR-Pose86K-CSLR-Isharah.
中文: 本研究提出一种双架构框架用于连续手语识别,通过手语者无关Conformer解决手语者独立性问题,并采用多尺度融合Transformer处理未知句式任务,在Isharah-1000数据集上取得最优性能,验证了任务专用网络设计的有效性。
English: This study introduces a dual-architecture framework for Continuous Sign Language Recognition, employing a Signer-Invariant Conformer for signer-independent challenges and a Multi-Scale Fusion Transformer for unseen-sentence tasks, achieving state-of-the-art performance on the Isharah-1000 dataset and validating task-specific network designs.

Authors:Md. Milon Islam, Md Rezwanul Haque, S M Taslim Uddin Raju, Fakhri Karray
Title: FusionEnsemble-Net: An Attention-Based Ensemble of Spatiotemporal Networks for Multimodal Sign Language Recognition
Abstract:
Accurate recognition of sign language in healthcare communication poses a significant challenge, requiring frameworks that can accurately interpret complex multimodal gestures. To deal with this, we propose FusionEnsemble-Net, a novel attention-based ensemble of spatiotemporal networks that dynamically fuses visual and motion data to enhance recognition accuracy. The proposed approach processes RGB video and range Doppler map radar modalities synchronously through four different spatiotemporal networks. For each network, features from both modalities are continuously fused using an attention-based fusion module before being fed into an ensemble of classifiers. Finally, the outputs of these four different fused channels are combined in an ensemble classification head, thereby enhancing the model's robustness. Experiments demonstrate that FusionEnsemble-Net outperforms state-of-the-art approaches with a test accuracy of 99.44% on the large-scale MultiMeDaLIS dataset for Italian Sign Language. Our findings indicate that an ensemble of diverse spatiotemporal networks, unified by attention-based fusion, yields a robust and accurate framework for complex, multimodal isolated gesture recognition tasks. The source code is available at: https://github.com/rezwanh001/Multimodal-Isolated-Italian-Sign-Language-Recognition.
Chinese: FusionEnsemble-Net提出了一种基于注意力的时空网络集成方法,动态融合视觉与运动数据,在意大利手语识别中以99.44%的准确率超越了现有最优方法。
English: FusionEnsemble-Net introduces an attention-based ensemble of spatiotemporal networks that dynamically fuses visual and motion data, achieving 99.44% accuracy in Italian Sign Language recognition and outperforming existing methods.

Authors:Yifan Jiang, Ahmad Shariftabrizi, Venkata SK. Manem
Title: Lung-DDPM+: Efficient Thoracic CT Image Synthesis using Diffusion Probabilistic Model
Abstract:
Generative artificial intelligence (AI) has been playing an important role in various domains. Leveraging its high capability to generate high-fidelity and diverse synthetic data, generative AI is widely applied in diagnostic tasks, such as lung cancer diagnosis using computed tomography (CT). However, existing generative models for lung cancer diagnosis suffer from low efficiency and anatomical imprecision, which limit their clinical applicability. To address these drawbacks, we propose Lung-DDPM+, an improved version of our previous model, Lung-DDPM. This novel approach is a denoising diffusion probabilistic model (DDPM) guided by nodule semantic layouts and accelerated by a pulmonary DPM-solver, enabling the method to focus on lesion areas while achieving a better trade-off between sampling efficiency and quality. Evaluation results on the public LIDC-IDRI dataset suggest that the proposed method achieves 8$\times$ fewer FLOPs (floating point operations per second), 6.8$\times$ lower GPU memory consumption, and 14$\times$ faster sampling compared to Lung-DDPM. Moreover, it maintains comparable sample quality to both Lung-DDPM and other state-of-the-art (SOTA) generative models in two downstream segmentation tasks. We also conducted a Visual Turing Test by an experienced radiologist, showing the advanced quality and fidelity of synthetic samples generated by the proposed method. These experimental results demonstrate that Lung-DDPM+ can effectively generate high-quality thoracic CT images with lung nodules, highlighting its potential for broader applications, such as general tumor synthesis and lesion generation in medical imaging. The code and pretrained models are available at https://github.com/Manem-Lab/Lung-DDPM-PLUS.
中文:提出的Lung-DDPM+模型在生成用于诊断的合成肺部CT图像时,显著提升了效率和解剖精度,在保持高质量样本的同时实现了计算性能的大幅提升。
English: The proposed Lung-DDPM+ model significantly enhances efficiency and anatomical precision in generating synthetic lung CT images for diagnostic applications, achieving major improvements in computational performance while maintaining high sample quality.

Authors:Dongwoo Kang, Akhil Perincherry, Zachary Coalson, Aiden Gabriel, Stefan Lee, Sanghyun Hong
Title: Harnessing Input-Adaptive Inference for Efficient VLN
Abstract:
An emerging paradigm in vision-and-language navigation (VLN) is the use of history-aware multi-modal transformer models. Given a language instruction, these models process observation and navigation history to predict the most appropriate action for an agent. While they have significantly improved performance, the scale of these models can be a bottleneck in practical settings with limited computational resources. In this work, we propose a novel input-adaptive navigation method to enhance VLN model efficiency. We first show that existing input-adaptive mechanisms fail to reduce computations without substantial performance degradation. To address this, we introduce three adaptive algorithms, each deployed at a different level: (1) To improve spatial efficiency, we selectively process panoramic views at each observation of an agent. (2) To improve intra-model efficiency, we propose importance-based adaptive thresholding for the early-exit methods. (3) To improve temporal efficiency, we implement a caching mechanism that prevents reprocessing of views previously seen by the agent. In evaluations on seven VLN benchmarks, we demonstrate over a 2$\times$ reduction in computation across three off-the-shelf agents in both standard and continuous environments. Our code is publicly available at https://github.com/secure-ai-systems-group/adaptive-vision-and-language-navigation.
中文: 本文提出了一种输入自适应的导航方法,通过空间、模型内和时间三个层面的优化,显著提升了视觉与语言导航模型的效率,在多个基准测试中实现了计算量减少两倍以上的效果。
English: This paper introduces an input-adaptive navigation method that enhances the efficiency of vision-and-language navigation models through spatial, intra-model, and temporal optimizations, achieving over a twofold reduction in computations across multiple benchmarks.

Authors:Jeffri Murrugarra-LLerena, Haoran Niu, K. Suzanne Barber, Hal Daumé, Yang Trista Cao, Paola Cascante-Bonilla
Title: Beyond Blanket Masking: Examining Granularity for Privacy Protection in Images Captured by Blind and Low Vision Users
Abstract:
As visual assistant systems powered by visual language models (VLMs) become more prevalent, concerns over user privacy have grown, particularly for blind and low vision users who may unknowingly capture personal private information in their images. Existing privacy protection methods rely on coarse-grained segmentation, which uniformly masks entire private objects, often at the cost of usability. In this work, we propose FiGPriv, a fine-grained privacy protection framework that selectively masks only high-risk private information while preserving low-risk information. Our approach integrates fine-grained segmentation with a data-driven risk scoring mechanism. We evaluate our framework using the BIV-Priv-Seg dataset and show that FiG-Priv preserves +26% of image content, enhancing the ability of VLMs to provide useful responses by 11% and identify the image content by 45%, while ensuring privacy protection. Project Page: https://artcs1.github.io/VLMPrivacy/

Authors:Fengxian Ji, Jingpu Yang, Zirui Song, Yuanxi Wang, Zhexuan Cui, Yuke Li, Qian Jiang, Miao Fang, Xiuying Chen
Title: FineState-Bench: A Comprehensive Benchmark for Fine-Grained State Control in GUI Agents
Abstract:
With the rapid advancement of generative artificial intelligence technology, Graphical User Interface (GUI) agents have demonstrated tremendous potential for autonomously managing daily tasks through natural language instructions. However, current evaluation frameworks for GUI agents suffer from fundamental flaws: existing benchmarks overly focus on coarse-grained task completion while neglecting fine-grained control capabilities crucial for real-world applications. To address this, we introduce FineState-Bench, the first evaluation and diagnostic standard for fine-grained GUI proxy operations, designed to quantify fine-grained control. This multi-platform (desktop, Web, mobile) framework includes 2257 task benchmarks in four components and uses a four-phase indicator for comprehensive perception-to-control assessment. To analyze perception and positioning for refined operations, we developed the plug-and-play Visual Diagnostic Assistant (VDA), enabling the first quantitative decoupling analysis of these capabilities. Experimental results on our benchmark show that the most advanced models achieve only 32.8% fine-grained interaction accuracy. Using our VDA in controlled experiments, quantifying the impact of visual capabilities, we showed that ideal visual localization boosts Gemini-2.5-Flash's success rate by 14.9\%. Our diagnostic framework confirms for the first time that the primary bottleneck for current GUI proxies is basic visual positioning capability.All resources are fully open-source. github: https://github.com/AnonymousThewarehouse/FineState-Bench huggingface: https://huggingface.co/datasets/Willtime2006/Static-FineBench
中文:FineState-Bench推出了首个细粒度GUI代理操作评估框架,发现当前模型仅实现32.8%的交互精度,并确认视觉定位能力是主要性能瓶颈。
English: FineState-Bench introduces the first evaluation framework for fine-grained GUI agent operations, revealing that current models achieve only 32.8% interaction accuracy and identifying visual positioning as the primary performance bottleneck.

Authors:Yoni Schirris, Eric Marcus, Jonas Teuwen, Hugo Horlings, Efstratios Gavves
Title: From Explainable to Explained AI: Ideas for Falsifying and Quantifying Explanations
Abstract:
Explaining deep learning models is essential for clinical integration of medical image analysis systems. A good explanation highlights if a model depends on spurious features that undermines generalization and harms a subset of patients or, conversely, may present novel biological insights. Although techniques like GradCAM can identify influential features, they are measurement tools that do not themselves form an explanation. We propose a human-machine-VLM interaction system tailored to explaining classifiers in computational pathology, including multi-instance learning for whole-slide images. Our proof of concept comprises (1) an AI-integrated slide viewer to run sliding-window experiments to test claims of an explanation, and (2) quantification of an explanation's predictiveness using general-purpose vision-language models. The results demonstrate that this allows us to qualitatively test claims of explanations and can quantifiably distinguish competing explanations. This offers a practical path from explainable AI to explained AI in digital pathology and beyond. Code and prompts are available at https://github.com/nki-ai/x2x.
中文摘要:本研究提出了一种人机-VLM交互系统,用于解释计算病理学中的深度学习分类器,通过定性测试和定量比较解释,推动从可解释AI向已解释AI的演进。
English Summary: This study introduces a human-machine-VLM interaction system for explaining deep learning classifiers in computational pathology, enabling qualitative testing and quantitative comparison of explanations to advance from explainable to explained AI.

Authors:Asim Ukaye, Numan Saeed, Karthik Nandakumar
Title: FIVA: Federated Inverse Variance Averaging for Universal CT Segmentation with Uncertainty Estimation
Abstract:
Different CT segmentation datasets are typically obtained from different scanners under different capture settings and often provide segmentation labels for a limited and often disjoint set of organs. Using these heterogeneous data effectively while preserving patient privacy can be challenging. This work presents a novel federated learning approach to achieve universal segmentation across diverse abdominal CT datasets by utilizing model uncertainty for aggregation and predictive uncertainty for inference. Our approach leverages the inherent noise in stochastic mini-batch gradient descent to estimate a distribution over the model weights to provide an on-the-go uncertainty over the model parameters at the client level. The parameters are then aggregated at the server using the additional uncertainty information using a Bayesian-inspired inverse-variance aggregation scheme. Furthermore, the proposed method quantifies prediction uncertainty by propagating the uncertainty from the model weights, providing confidence measures essential for clinical decision-making. In line with recent work shown, predictive uncertainty is utilized in the inference stage to improve predictive performance. Experimental evaluations demonstrate the effectiveness of this approach in improving both the quality of federated aggregation and uncertainty-weighted inference compared to previously established baselines. The code for this work is made available at: https://github.com/asimukaye/fiva
中文: 本研究提出一种新颖的联邦学习方法,利用模型不确定性和预测不确定性来提升跨异构腹部CT数据集的通用分割效果,在保护患者隐私的同时显著改善了聚合质量和推理性能。
English: This study introduces a novel federated learning method that employs model and predictive uncertainty to enhance universal segmentation across heterogeneous abdominal CT datasets, improving both aggregation quality and inference performance while ensuring patient privacy.

Authors:Maria Boyko, Aleksandra Beliaeva, Dmitriy Kornilov, Alexander Bernstein, Maxim Sharaev
Title: impuTMAE: Multi-modal Transformer with Masked Pre-training for Missing Modalities Imputation in Cancer Survival Prediction
Abstract:
The use of diverse modalities, such as omics, medical images, and clinical data can not only improve the performance of prognostic models but also deepen an understanding of disease mechanisms and facilitate the development of novel treatment approaches. However, medical data are complex, often incomplete, and contains missing modalities, making effective handling its crucial for training multimodal models. We introduce impuTMAE, a novel transformer-based end-to-end approach with an efficient multimodal pre-training strategy. It learns inter- and intra-modal interactions while simultaneously imputing missing modalities by reconstructing masked patches. Our model is pre-trained on heterogeneous, incomplete data and fine-tuned for glioma survival prediction using TCGA-GBM/LGG and BraTS datasets, integrating five modalities: genetic (DNAm, RNA-seq), imaging (MRI, WSI), and clinical data. By addressing missing data during pre-training and enabling efficient resource utilization, impuTMAE surpasses prior multimodal approaches, achieving state-of-the-art performance in glioma patient survival prediction. Our code is available at https://github.com/maryjis/mtcp
中文摘要:impuTMAE模型通过基于Transformer的架构在预训练中学习多模态交互并填补缺失数据,在胶质瘤生存预测中实现了最优性能。
English Summary: The impuTMAE model introduces a transformer-based approach that handles missing medical data by learning multimodal interactions during pre-training, achieving state-of-the-art performance in glioma survival prediction.

Authors:Yanhui Li, Yunkang Cao, Chengliang Liu, Yuan Xiong, Xinghui Dong, Chao Huang
Title: IAD-R1: Reinforcing Consistent Reasoning in Industrial Anomaly Detection
Abstract:
Industrial anomaly detection is a critical component of modern manufacturing, yet the scarcity of defective samples restricts traditional detection methods to scenario-specific applications. Although Vision-Language Models (VLMs) demonstrate significant advantages in generalization capabilities, their performance in industrial anomaly detection remains limited. To address this challenge, we propose IAD-R1, a universal post-training framework applicable to VLMs of different architectures and parameter scales, which substantially enhances their anomaly detection capabilities. IAD-R1 employs a two-stage training strategy: the Perception Activation Supervised Fine-Tuning (PA-SFT) stage utilizes a meticulously constructed high-quality Chain-of-Thought dataset (Expert-AD) for training, enhancing anomaly perception capabilities and establishing reasoning-to-answer correlations; the Structured Control Group Relative Policy Optimization (SC-GRPO) stage employs carefully designed reward functions to achieve a capability leap from "Anomaly Perception" to "Anomaly Interpretation". Experimental results demonstrate that IAD-R1 achieves significant improvements across 7 VLMs, the largest improvement was on the DAGM dataset, with average accuracy 43.3% higher than the 0.5B baseline. Notably, the 0.5B parameter model trained with IAD-R1 surpasses commercial models including GPT-4.1 and Claude-Sonnet-4 in zero-shot settings, demonstrating the effectiveness and superiority of IAD-R1. The dataset, code, and all model weights will be publicly available at https://github.com/Yanhui-Lee/IAD-R1.
中文摘要:提出的IAD-R1框架通过两阶段训练策略显著提升了视觉语言模型在工业异常检测中的能力,在零样本设置下性能超越包括GPT-4.1在内的商业模型。
English Summary: The proposed IAD-R1 framework significantly enhances industrial anomaly detection in Vision-Language Models through a two-stage training approach, achieving superior performance over commercial models in zero-shot settings.

Authors:Xingle Xu, Yongkang Liu, Dexian Cai, Shi Feng, Xiaocui Yang, Daling Wang, Yifei Zhang
Title: MoLAN: A Unified Modality-Aware Noise Dynamic Editing Framework for Multimodal Sentiment Analysis
Abstract:
Multimodal Sentiment Analysis aims to integrate information from various modalities, such as audio, visual, and text, to make complementary predictions. However, it often struggles with irrelevant or misleading visual and auditory information. Most existing approaches typically treat the entire modality information (e.g., a whole image, audio segment, or text paragraph) as an independent unit for feature enhancement or denoising. They often suppress the redundant and noise information at the risk of losing critical information. To address this challenge, we propose MoLAN, a unified ModaLity-aware noise dynAmic editiNg framework. Specifically, MoLAN performs modality-aware blocking by dividing the features of each modality into multiple blocks. Each block is then dynamically assigned a distinct denoising strength based on its noise level and semantic relevance, enabling fine-grained noise suppression while preserving essential multimodal information. Notably, MoLAN is a unified and flexible framework that can be seamlessly integrated into a wide range of multimodal models. Building upon this framework, we further introduce MoLAN+, a new multimodal sentiment analysis approach. Experiments across five models and four datasets demonstrate the broad effectiveness of the MoLAN framework. Extensive evaluations show that MoLAN+ achieves the state-of-the-art performance. The code is publicly available at https://github.com/betterfly123/MoLAN-Framework.
Chinese: 摘要提出了MoLAN框架,通过将多模态特征分块并动态分配去噪强度,精细消除噪声同时保留关键信息,其扩展方法MoLAN+在多个模型和数据集上实现了最优性能。
English: The abstract introduces MoLAN, a unified framework that dynamically edits noise in multimodal sentiment analysis by dividing each modality into blocks and applying tailored denoising strengths, with MoLAN+ achieving state-of-the-art results across multiple models and datasets.

Authors:Ya Zou, Jingfeng Yao, Siyuan Yu, Shuai Zhang, Wenyu Liu, Xinggang Wang
Title: Turbo-VAED: Fast and Stable Transfer of Video-VAEs to Mobile Devices
Abstract:
There is a growing demand for deploying large generative AI models on mobile devices. For recent popular video generative models, however, the Variational AutoEncoder (VAE) represents one of the major computational bottlenecks. Both large parameter sizes and mismatched kernels cause out-of-memory errors or extremely slow inference on mobile devices. To address this, we propose a low-cost solution that efficiently transfers widely used video VAEs to mobile devices. (1) We analyze redundancy in existing VAE architectures and get empirical design insights. By integrating 3D depthwise separable convolutions into our model, we significantly reduce the number of parameters. (2) We observe that the upsampling techniques in mainstream video VAEs are poorly suited to mobile hardware and form the main bottleneck. In response, we propose a decoupled 3D pixel shuffle scheme that slashes end-to-end delay. Building upon these, we develop a universal mobile-oriented VAE decoder, Turbo-VAED. (3) We propose an efficient VAE decoder training method. Since only the decoder is used during deployment, we distill it to Turbo-VAED instead of retraining the full VAE, enabling fast mobile adaptation with minimal performance loss. To our knowledge, our method enables real-time 720p video VAE decoding on mobile devices for the first time. This approach is widely applicable to most video VAEs. When integrated into four representative models, with training cost as low as $95, it accelerates original VAEs by up to 84.5x at 720p resolution on GPUs, uses as low as 17.5% of original parameter count, and retains 96.9% of the original reconstruction quality. Compared to mobile-optimized VAEs, Turbo-VAED achieves a 2.9x speedup in FPS and better reconstruction quality on the iPhone 16 Pro. The code and models will soon be available at https://github.com/hustvl/Turbo-VAED.
中文: 该研究提出了Turbo-VAED,一种面向移动设备的VAE解码器,通过3D深度可分离卷积和解耦像素重组技术,大幅减少参数和延迟,首次在移动端实现720p视频实时解码且性能损失极小。
English: The study introduces Turbo-VAED, a mobile-optimized VAE decoder that reduces parameters and latency through 3D depthwise convolutions and a decoupled pixel shuffle, enabling real-time 720p video decoding on mobile devices with minimal performance loss.

Authors:Maxim A. Patratskiy, Alexey K. Kovalev, Aleksandr I. Panov
Title: Spatial Traces: Enhancing VLA Models with Spatial-Temporal Understanding
Abstract:
Vision-Language-Action models have demonstrated remarkable capabilities in predicting agent movements within virtual environments and real-world scenarios based on visual observations and textual instructions. Although recent research has focused on enhancing spatial and temporal understanding independently, this paper presents a novel approach that integrates both aspects through visual prompting. We introduce a method that projects visual traces of key points from observations onto depth maps, enabling models to capture both spatial and temporal information simultaneously. The experiments in SimplerEnv show that the mean number of tasks successfully solved increased for 4% compared to SpatialVLA and 19% compared to TraceVLA. Furthermore, we show that this enhancement can be achieved with minimal training data, making it particularly valuable for real-world applications where data collection is challenging. The project page is available at https://ampiromax.github.io/ST-VLA.

Authors:Kaiwen Huang, Tao Zhou, Huazhu Fu, Yizhe Zhang, Yi Zhou, Xiao-Jun Wu
Title: Uncertainty-aware Cross-training for Semi-supervised Medical Image Segmentation
Abstract:
Semi-supervised learning has gained considerable popularity in medical image segmentation tasks due to its capability to reduce reliance on expert-examined annotations. Several mean-teacher (MT) based semi-supervised methods utilize consistency regularization to effectively leverage valuable information from unlabeled data. However, these methods often heavily rely on the student model and overlook the potential impact of cognitive biases within the model. Furthermore, some methods employ co-training using pseudo-labels derived from different inputs, yet generating high-confidence pseudo-labels from perturbed inputs during training remains a significant challenge. In this paper, we propose an Uncertainty-aware Cross-training framework for semi-supervised medical image Segmentation (UC-Seg). Our UC-Seg framework incorporates two distinct subnets to effectively explore and leverage the correlation between them, thereby mitigating cognitive biases within the model. Specifically, we present a Cross-subnet Consistency Preservation (CCP) strategy to enhance feature representation capability and ensure feature consistency across the two subnets. This strategy enables each subnet to correct its own biases and learn shared semantics from both labeled and unlabeled data. Additionally, we propose an Uncertainty-aware Pseudo-label Generation (UPG) component that leverages segmentation results and corresponding uncertainty maps from both subnets to generate high-confidence pseudo-labels. We extensively evaluate the proposed UC-Seg on various medical image segmentation tasks involving different modality images, such as MRI, CT, ultrasound, colonoscopy, and so on. The results demonstrate that our method achieves superior segmentation accuracy and generalization performance compared to other state-of-the-art semi-supervised methods. Our code will be released at https://github.com/taozh2017/UCSeg.
中文: 本文提出UC-Seg框架,通过双网络协同训练减少认知偏差并生成高置信度伪标签,在多种医学影像分割任务中实现了优于现有方法的准确性和泛化性能。
English: This paper introduces UC-Seg, an uncertainty-aware cross-training framework for semi-supervised medical image segmentation that mitigates model biases through dual-subnet collaboration and generates high-confidence pseudo-labels, achieving superior accuracy across multiple imaging modalities.

Authors:Yuhao Wang, Wei Xi
Title: UniConvNet: Expanding Effective Receptive Field while Maintaining Asymptotically Gaussian Distribution for ConvNets of Any Scale
Abstract:
Convolutional neural networks (ConvNets) with large effective receptive field (ERF), still in their early stages, have demonstrated promising effectiveness while constrained by high parameters and FLOPs costs and disrupted asymptotically Gaussian distribution (AGD) of ERF. This paper proposes an alternative paradigm: rather than merely employing extremely large ERF, it is more effective and efficient to expand the ERF while maintaining AGD of ERF by proper combination of smaller kernels, such as $7\times{7}$, $9\times{9}$, $11\times{11}$. This paper introduces a Three-layer Receptive Field Aggregator and designs a Layer Operator as the fundamental operator from the perspective of receptive field. The ERF can be expanded to the level of existing large-kernel ConvNets through the stack of proposed modules while maintaining AGD of ERF. Using these designs, we propose a universal model for ConvNet of any scale, termed UniConvNet. Extensive experiments on ImageNet-1K, COCO2017, and ADE20K demonstrate that UniConvNet outperforms state-of-the-art CNNs and ViTs across various vision recognition tasks for both lightweight and large-scale models with comparable throughput. Surprisingly, UniConvNet-T achieves $84.2\%$ ImageNet top-1 accuracy with $30M$ parameters and $5.1G$ FLOPs. UniConvNet-XL also shows competitive scalability to big data and large models, acquiring $88.4\%$ top-1 accuracy on ImageNet. Code and models are publicly available at https://github.com/ai-paperwithcode/UniConvNet.
中文: 本文提出的UniConvNet通过巧妙组合较小卷积核来扩展有效感受野并保持其渐近高斯分布,在多种视觉任务中以高效计算实现了超越现有最优模型的性能表现。
English: This paper introduces UniConvNet, a novel convolutional neural network that efficiently expands the effective receptive field while preserving its asymptotically Gaussian distribution through strategic combinations of smaller kernels, achieving state-of-the-art performance across multiple vision tasks with competitive computational efficiency.

Authors:Elman Ghazaei, Erchan Aptoula
Title: Text-conditioned State Space Model For Domain-generalized Change Detection Visual Question Answering
Abstract:
The Earth's surface is constantly changing, and detecting these changes provides valuable insights that benefit various aspects of human society. While traditional change detection methods have been employed to detect changes from bi-temporal images, these approaches typically require expert knowledge for accurate interpretation. To enable broader and more flexible access to change information by non-expert users, the task of Change Detection Visual Question Answering (CDVQA) has been introduced. However, existing CDVQA methods have been developed under the assumption that training and testing datasets share similar distributions. This assumption does not hold in real-world applications, where domain shifts often occur. In this paper, the CDVQA task is revisited with a focus on addressing domain shift. To this end, a new multi-modal and multi-domain dataset, BrightVQA, is introduced to facilitate domain generalization research in CDVQA. Furthermore, a novel state space model, termed Text-Conditioned State Space Model (TCSSM), is proposed. The TCSSM framework is designed to leverage both bi-temporal imagery and geo-disaster-related textual information in an unified manner to extract domain-invariant features across domains. Input-dependent parameters existing in TCSSM are dynamically predicted by using both bi-temporal images and geo-disaster-related description, thereby facilitating the alignment between bi-temporal visual data and the associated textual descriptions. Extensive experiments are conducted to evaluate the proposed method against state-of-the-art models, and superior performance is consistently demonstrated. The code and dataset will be made publicly available upon acceptance at https://github.com/Elman295/TCSSM.
中文: 本文针对变化检测视觉问答中的领域偏移问题,提出了文本条件状态空间模型和BrightVQA数据集,通过动态对齐双时相图像与文本描述实现了优越性能。
English: This paper introduces a novel Text-Conditioned State Space Model (TCSSM) and BrightVQA dataset to address domain shift in Change Detection Visual Question Answering, achieving superior performance by dynamically aligning bi-temporal imagery with textual descriptions.

Authors:Bin Ren, Xiaoshui Huang, Mengyuan Liu, Hong Liu, Fabio Poiesi, Nicu Sebe, Guofeng Mei
Title: Masked Clustering Prediction for Unsupervised Point Cloud Pre-training
Abstract:
Vision transformers (ViTs) have recently been widely applied to 3D point cloud understanding, with masked autoencoding as the predominant pre-training paradigm. However, the challenge of learning dense and informative semantic features from point clouds via standard ViTs remains underexplored. We propose MaskClu, a novel unsupervised pre-training method for ViTs on 3D point clouds that integrates masked point modeling with clustering-based learning. MaskClu is designed to reconstruct both cluster assignments and cluster centers from masked point clouds, thus encouraging the model to capture dense semantic information. Additionally, we introduce a global contrastive learning mechanism that enhances instance-level feature learning by contrasting different masked views of the same point cloud. By jointly optimizing these complementary objectives, i.e., dense semantic reconstruction, and instance-level contrastive learning. MaskClu enables ViTs to learn richer and more semantically meaningful representations from 3D point clouds. We validate the effectiveness of our method via multiple 3D tasks, including part segmentation, semantic segmentation, object detection, and classification, where MaskClu sets new competitive results. The code and models will be released at:https://github.com/Amazingren/maskclu.
中文: MaskClu是一种新颖的无监督预训练方法,通过结合掩码建模与聚类及对比学习,使视觉变换器能从三维点云中学习密集语义信息,并在多项三维任务中取得了领先性能。
English: MaskClu is a novel unsupervised pre-training method for vision transformers on 3D point clouds that combines masked modeling with clustering and contrastive learning to capture dense semantic information, achieving state-of-the-art results across multiple 3D understanding tasks.

Authors:Chaoyi Wang, Yifan Yang, Jun Pei, Lijie Xia, Jianpo Liu, Xiaobing Yuan, Xinhan Di
Title: Preview WB-DH: Towards Whole Body Digital Human Bench for the Generation of Whole-body Talking Avatar Videos
Abstract:
Creating realistic, fully animatable whole-body avatars from a single portrait is challenging due to limitations in capturing subtle expressions, body movements, and dynamic backgrounds. Current evaluation datasets and metrics fall short in addressing these complexities. To bridge this gap, we introduce the Whole-Body Benchmark Dataset (WB-DH), an open-source, multi-modal benchmark designed for evaluating whole-body animatable avatar generation. Key features include: (1) detailed multi-modal annotations for fine-grained guidance, (2) a versatile evaluation framework, and (3) public access to the dataset and tools at https://github.com/deepreasonings/WholeBodyBenchmark.
中文: 本文提出WB-DH基准数据集,通过提供多模态标注和开源评估框架,解决了可动画全身虚拟形象评估中的现有不足。
English: This paper introduces the WB-DH benchmark dataset to address the limitations in evaluating animatable whole-body avatars by providing multi-modal annotations and an open-source evaluation framework.

Authors:Yuqi Peng, Lingtao Zheng, Yufeng Yang, Yi Huang, Mingfu Yan, Jianzhuang Liu, Shifeng Chen
Title: TARA: Token-Aware LoRA for Composable Personalization in Diffusion Models
Abstract:
Personalized text-to-image generation aims to synthesize novel images of a specific subject or style using only a few reference images. Recent methods based on Low-Rank Adaptation (LoRA) enable efficient single-concept customization by injecting lightweight, concept-specific adapters into pre-trained diffusion models. However, combining multiple LoRA modules for multi-concept generation often leads to identity missing and visual feature leakage. In this work, we identify two key issues behind these failures: (1) token-wise interference among different LoRA modules, and (2) spatial misalignment between the attention map of a rare token and its corresponding concept-specific region. To address these issues, we propose Token-Aware LoRA (TARA), which introduces a token mask to explicitly constrain each module to focus on its associated rare token to avoid interference, and a training objective that encourages the spatial attention of a rare token to align with its concept region. Our method enables training-free multi-concept composition by directly injecting multiple independently trained TARA modules at inference time. Experimental results demonstrate that TARA enables efficient multi-concept inference and effectively preserving the visual identity of each concept by avoiding mutual interference between LoRA modules. The code and models are available at https://github.com/YuqiPeng77/TARA.
中文: TARA通过引入令牌掩码和空间对齐训练,有效防止多个LoRA模块间的相互干扰,实现无需训练的多概念图像生成并保持各概念的视觉特征。
English: TARA introduces token masking and spatial alignment training to prevent interference between LoRA modules, enabling training-free multi-concept image generation while preserving each concept's visual identity.

Authors:Shi-Chen Zhang, Yunheng Li, Yu-Huan Wu, Qibin Hou, Ming-Ming Cheng
Title: Revisiting Efficient Semantic Segmentation: Learning Offsets for Better Spatial and Class Feature Alignment
Abstract:
Semantic segmentation is fundamental to vision systems requiring pixel-level scene understanding, yet deploying it on resource-constrained devices demands efficient architectures. Although existing methods achieve real-time inference through lightweight designs, we reveal their inherent limitation: misalignment between class representations and image features caused by a per-pixel classification paradigm. With experimental analysis, we find that this paradigm results in a highly challenging assumption for efficient scenarios: Image pixel features should not vary for the same category in different images. To address this dilemma, we propose a coupled dual-branch offset learning paradigm that explicitly learns feature and class offsets to dynamically refine both class representations and spatial image features. Based on the proposed paradigm, we construct an efficient semantic segmentation network, OffSeg. Notably, the offset learning paradigm can be adopted to existing methods with no additional architectural changes. Extensive experiments on four datasets, including ADE20K, Cityscapes, COCO-Stuff-164K, and Pascal Context, demonstrate consistent improvements with negligible parameters. For instance, on the ADE20K dataset, our proposed offset learning paradigm improves SegFormer-B0, SegNeXt-T, and Mask2Former-Tiny by 2.7%, 1.9%, and 2.6% mIoU, respectively, with only 0.1-0.2M additional parameters required.
Chinese: 该研究揭示了现有语义分割方法中类别表示与图像特征间的错位问题,提出一种耦合双分支偏移学习范式来动态优化二者,在多个数据集上以极少的参数增量实现了稳定的性能提升。
English: The study identifies a misalignment between class representations and image features in existing semantic segmentation methods and proposes a coupled dual-branch offset learning paradigm to dynamically refine both, achieving consistent performance improvements with minimal additional parameters across multiple datasets.

Authors:Zunjie Xiao, Xiao Wu, Tianhang Liu, Lingxi Hu, Yinling Zhang, Xiaoqing Zhang, Risa Higashita, Jiang Liu
Title: Adaptive Confidence-Wise Loss for Improved Lens Structure Segmentation in AS-OCT
Abstract:
Precise lens structure segmentation is essential for the design of intraocular lenses (IOLs) in cataract surgery. Existing deep segmentation networks typically weight all pixels equally under cross-entropy (CE) loss, overlooking the fact that sub-regions of lens structures are inhomogeneous (e.g., some regions perform better than others) and that boundary regions often suffer from poor segmentation calibration at the pixel level. Clinically, experts annotate different sub-regions of lens structures with varying confidence levels, considering factors such as sub-region proportions, ambiguous boundaries, and lens structure shapes. Motivated by this observation, we propose an Adaptive Confidence-Wise (ACW) loss to group each lens structure sub-region into different confidence sub-regions via a confidence threshold from the unique region aspect, aiming to exploit the potential of expert annotation confidence prior. Specifically, ACW clusters each target region into low-confidence and high-confidence groups and then applies a region-weighted loss to reweigh each confidence group. Moreover, we design an adaptive confidence threshold optimization algorithm to adjust the confidence threshold of ACW dynamically. Additionally, to better quantify the miscalibration errors in boundary region segmentation, we propose a new metric, termed Boundary Expected Calibration Error (BECE). Extensive experiments on a clinical lens structure AS-OCT dataset and other multi-structure datasets demonstrate that our ACW significantly outperforms competitive segmentation loss methods across different deep segmentation networks (e.g., MedSAM). Notably, our method surpasses CE with 6.13% IoU gain, 4.33% DSC increase, and 4.79% BECE reduction in lens structure segmentation under U-Net. The code of this paper is available at https://github.com/XiaoLing12138/Adaptive-Confidence-Wise-Loss.
Chinese: 本文提出了一种自适应置信度感知(ACW)损失函数,通过利用专家标注的置信度先验动态加权不同子区域并优化边界校准,在眼内透镜结构分割中显著提升了分割精度和边界校准效果。
English: This paper introduces an Adaptive Confidence-Wise (ACW) loss function that leverages expert annotation confidence levels to improve intraocular lens structure segmentation by dynamically weighting sub-regions and optimizing boundary calibration, achieving significant performance gains over traditional methods.

Authors:Ouyang Xu, Baoming Zhang, Ruiyu Mao, Yunhui Guo
Title: SafeFix: Targeted Model Repair via Controlled Image Generation
Abstract:
Deep learning models for visual recognition often exhibit systematic errors due to underrepresented semantic subpopulations. Although existing debugging frameworks can pinpoint these failures by identifying key failure attributes, repairing the model effectively remains difficult. Current solutions often rely on manually designed prompts to generate synthetic training images -- an approach prone to distribution shift and semantic errors. To overcome these challenges, we introduce a model repair module that builds on an interpretable failure attribution pipeline. Our approach uses a conditional text-to-image model to generate semantically faithful and targeted images for failure cases. To preserve the quality and relevance of the generated samples, we further employ a large vision-language model (LVLM) to filter the outputs, enforcing alignment with the original data distribution and maintaining semantic consistency. By retraining vision models with this rare-case-augmented synthetic dataset, we significantly reduce errors associated with rare cases. Our experiments demonstrate that this targeted repair strategy improves model robustness without introducing new bugs. Code is available at https://github.com/oxu2/SafeFix
中文摘要:本文提出一种针对性模型修复方法,通过条件文本-图像生成器和大型视觉语言模型为代表性不足的故障案例生成语义一致的训练图像,在保持模型鲁棒性的同时显著减少了识别错误。
English Summary: This paper introduces a targeted model repair method that uses a conditional text-to-image generator and a large vision-language model to create semantically consistent training images for underrepresented failure cases, effectively reducing recognition errors while maintaining model robustness.

Authors:Qi Zheng, Li-Heng Chen, Chenlong He, Neil Berkbeck, Yilin Wang, Balu Adsumilli, Alan C. Bovik, Yibo Fan, Zhengzhong Tu
Title: Subjective and Objective Quality Assessment of Banding Artifacts on Compressed Videos
Abstract:
Although there have been notable advancements in video compression technologies in recent years, banding artifacts remain a serious issue affecting the quality of compressed videos, particularly on smooth regions of high-definition videos. Noticeable banding artifacts can severely impact the perceptual quality of videos viewed on a high-end HDTV or high-resolution screen. Hence, there is a pressing need for a systematic investigation of the banding video quality assessment problem for advanced video codecs. Given that the existing publicly available datasets for studying banding artifacts are limited to still picture data only, which cannot account for temporal banding dynamics, we have created a first-of-a-kind open video dataset, dubbed LIVE-YT-Banding, which consists of 160 videos generated by four different compression parameters using the AV1 video codec. A total of 7,200 subjective opinions are collected from a cohort of 45 human subjects. To demonstrate the value of this new resources, we tested and compared a variety of models that detect banding occurrences, and measure their impact on perceived quality. Among these, we introduce an effective and efficient new no-reference (NR) video quality evaluator which we call CBAND. CBAND leverages the properties of the learned statistics of natural images expressed in the embeddings of deep neural networks. Our experimental results show that the perceptual banding prediction performance of CBAND significantly exceeds that of previous state-of-the-art models, and is also orders of magnitude faster. Moreover, CBAND can be employed as a differentiable loss function to optimize video debanding models. The LIVE-YT-Banding database, code, and pre-trained model are all publically available at https://github.com/uniqzheng/CBAND.
Chinese: 尽管视频压缩技术有所进步,带状伪影仍然是影响视频质量的主要问题,为此我们创建了LIVE-YT-Banding数据集并开发了CBAND模型,该无参考评估模型在检测和评估带状伪影方面显著优于现有方法。
English: Despite advances in video compression, banding artifacts persist as a significant quality issue, leading to the creation of the LIVE-YT-Banding dataset and the development of CBAND, an efficient no-reference model that outperforms existing methods in detecting and assessing these artifacts.

Authors:Yimeng Geng, Mingyang Zhao, Fan Xu, Guanglin Cao, Gaofeng Meng, Hongbin Liu
Title: PADReg: Physics-Aware Deformable Registration Guided by Contact Force for Ultrasound Sequences
Abstract:
Ultrasound deformable registration estimates spatial transformations between pairs of deformed ultrasound images, which is crucial for capturing biomechanical properties and enhancing diagnostic accuracy in diseases such as thyroid nodules and breast cancer. However, ultrasound deformable registration remains highly challenging, especially under large deformation. The inherently low contrast, heavy noise and ambiguous tissue boundaries in ultrasound images severely hinder reliable feature extraction and correspondence matching. Existing methods often suffer from poor anatomical alignment and lack physical interpretability. To address the problem, we propose PADReg, a physics-aware deformable registration framework guided by contact force. PADReg leverages synchronized contact force measured by robotic ultrasound systems as a physical prior to constrain the registration. Specifically, instead of directly predicting deformation fields, we first construct a pixel-wise stiffness map utilizing the multi-modal information from contact force and ultrasound images. The stiffness map is then combined with force data to estimate a dense deformation field, through a lightweight physics-aware module inspired by Hooke's law. This design enables PADReg to achieve physically plausible registration with better anatomical alignment than previous methods relying solely on image similarity. Experiments on in-vivo datasets demonstrate that it attains a HD95 of 12.90, which is 21.34\% better than state-of-the-art methods. The source code is available at https://github.com/evelynskip/PADReg.
中文: 提出的PADReg框架利用接触力作为物理先验,通过刚度映射和胡克定律原理改进了超声可变形配准,实现了更优的解剖对齐效果,其HD95指标比现有最优方法提升了21.34%。
English: The proposed PADReg framework uses contact force as a physical prior to enhance ultrasound deformable registration, achieving superior anatomical alignment and a 21.34% improvement in HD95 over existing methods by incorporating stiffness mapping and Hooke's law principles.

Authors:Jiahua Dong, Hui Yin, Wenqi Liang, Hanbin Zhao, Henghui Ding, Nicu Sebe, Salman Khan, Fahad Shahbaz Khan
Title: Hierarchical Visual Prompt Learning for Continual Video Instance Segmentation
Abstract:
Video instance segmentation (VIS) has gained significant attention for its capability in tracking and segmenting object instances across video frames. However, most of the existing VIS approaches unrealistically assume that the categories of object instances remain fixed over time. Moreover, they experience catastrophic forgetting of old classes when required to continuously learn object instances belonging to new categories. To resolve these challenges, we develop a novel Hierarchical Visual Prompt Learning (HVPL) model that overcomes catastrophic forgetting of previous categories from both frame-level and video-level perspectives. Specifically, to mitigate forgetting at the frame level, we devise a task-specific frame prompt and an orthogonal gradient correction (OGC) module. The OGC module helps the frame prompt encode task-specific global instance information for new classes in each individual frame by projecting its gradients onto the orthogonal feature space of old classes. Furthermore, to address forgetting at the video level, we design a task-specific video prompt and a video context decoder. This decoder first embeds structural inter-class relationships across frames into the frame prompt features, and then propagates task-specific global video contexts from the frame prompt features to the video prompt. Through rigorous comparisons, our HVPL model proves to be more effective than baseline approaches. The code is available at https://github.com/JiahuaDong/HVPL.
中文: 提出的分层视觉提示学习模型通过结合帧级正交梯度校正和视频级上下文传播机制,有效解决了视频实例分割中的灾难性遗忘问题。
English: The proposed Hierarchical Visual Prompt Learning (HVPL) model effectively addresses catastrophic forgetting in video instance segmentation by employing frame-level and video-level prompts with orthogonal gradient correction and context propagation mechanisms.

Authors:Honglei Xu, Zhilu Zhang, Junjie Fan, Xiaohe Wu, Wangmeng Zuo
Title: SelfHVD: Self-Supervised Handheld Video Deblurring for Mobile Phones
Abstract:
Shooting video with a handheld mobile phone, the most common photographic device, often results in blurry frames due to shaking hands and other instability factors. Although previous video deblurring methods have achieved impressive progress, they still struggle to perform satisfactorily on real-world handheld video due to the blur domain gap between training and testing data. To address the issue, we propose a self-supervised method for handheld video deblurring, which is driven by sharp clues in the video. First, to train the deblurring model, we extract the sharp clues from the video and take them as misalignment labels of neighboring blurry frames. Second, to improve the model's ability, we propose a novel Self-Enhanced Video Deblurring (SEVD) method to create higher-quality paired video data. Third, we propose a Self-Constrained Spatial Consistency Maintenance (SCSCM) method to regularize the model, preventing position shifts between the output and input frames. Moreover, we construct a synthetic and a real-world handheld video dataset for handheld video deblurring. Extensive experiments on these two and other common real-world datasets demonstrate that our method significantly outperforms existing self-supervised ones. The code and datasets are publicly available at https://github.com/cshonglei/SelfHVD.
中文摘要:本文提出一种自监督手持视频去模糊方法,通过提取视频中的清晰线索并采用自增强训练策略,在真实手持视频数据集上显著优于现有方法。
English Summary: This paper introduces a self-supervised video deblurring method that leverages sharp video clues and proposes novel techniques to enhance model training and maintain spatial consistency, demonstrating superior performance on real-world handheld videos.

Authors:Deheng Ye, Fangyun Zhou, Jiacheng Lv, Jianqi Ma, Jun Zhang, Junyan Lv, Junyou Li, Minwen Deng, Mingyu Yang, Qiang Fu, Wei Yang, Wenkai Lv, Yangbin Yu, Yewen Wang, Yonghang Guan, Zhihao Hu, Zhongbin Fang, Zhongqian Sun
Title: Yan: Foundational Interactive Video Generation
Abstract:
We present Yan, a foundational framework for interactive video generation, covering the entire pipeline from simulation and generation to editing. Specifically, Yan comprises three core modules. AAA-level Simulation: We design a highly-compressed, low-latency 3D-VAE coupled with a KV-cache-based shift-window denoising inference process, achieving real-time 1080P/60FPS interactive simulation. Multi-Modal Generation: We introduce a hierarchical autoregressive caption method that injects game-specific knowledge into open-domain multi-modal video diffusion models (VDMs), then transforming the VDM into a frame-wise, action-controllable, real-time infinite interactive video generator. Notably, when the textual and visual prompts are sourced from different domains, the model demonstrates strong generalization, allowing it to blend and compose the style and mechanics across domains flexibly according to user prompts. Multi-Granularity Editing: We propose a hybrid model that explicitly disentangles interactive mechanics simulation from visual rendering, enabling multi-granularity video content editing during interaction through text. Collectively, Yan offers an integration of these modules, pushing interactive video generation beyond isolated capabilities toward a comprehensive AI-driven interactive creation paradigm, paving the way for the next generation of creative tools, media, and entertainment. The project page is: https://greatx3.github.io/Yan/.

Authors:Wenwen Yu, Zhibo Yang, Yuliang Liu, Xiang Bai
Title: DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in document understanding. However, their reasoning processes remain largely black-box, making it difficult to ensure reliability and trustworthiness, especially in high-stakes domains such as legal, financial, and medical document analysis. Existing methods use fixed Chain-of-Thought (CoT) reasoning with supervised fine-tuning (SFT) but suffer from catastrophic forgetting, poor adaptability, and limited generalization across domain tasks. In this paper, we propose DocThinker, a rule-based Reinforcement Learning (RL) framework for dynamic inference-time reasoning. Instead of relying on static CoT templates, DocThinker autonomously refines reasoning strategies via policy learning, generating explainable intermediate results, including structured reasoning processes, rephrased questions, regions of interest (RoI) supporting the answer, and the final answer. By integrating multi-objective rule-based rewards and KL-constrained optimization, our method mitigates catastrophic forgetting and enhances both adaptability and transparency. Extensive experiments on multiple benchmarks demonstrate that DocThinker significantly improves generalization while producing more explainable and human-understandable reasoning steps. Our findings highlight RL as a powerful alternative for enhancing explainability and adaptability in MLLM-based document understanding. Code will be available at https://github.com/wenwenyu/DocThinker.
中文: DocThinker提出了一种基于规则的强化学习框架,通过动态优化推理策略来增强多模态文档理解的可解释性与适应性,同时有效缓解灾难性遗忘问题。
English: DocThinker introduces a rule-based reinforcement learning framework that dynamically refines reasoning strategies during inference, enhancing explainability and adaptability in multimodal document understanding while mitigating catastrophic forgetting.

Authors:Jingyun Liang, Jingkai Zhou, Shikai Li, Chenjie Cao, Lei Sun, Yichen Qian, Weihua Chen, Fan Wang
Title: RealisMotion: Decomposed Human Motion Control and Video Generation in the World Space
Abstract:
Generating human videos with realistic and controllable motions is a challenging task. While existing methods can generate visually compelling videos, they lack separate control over four key video elements: foreground subject, background video, human trajectory and action patterns. In this paper, we propose a decomposed human motion control and video generation framework that explicitly decouples motion from appearance, subject from background, and action from trajectory, enabling flexible mix-and-match composition of these elements. Concretely, we first build a ground-aware 3D world coordinate system and perform motion editing directly in the 3D space. Trajectory control is implemented by unprojecting edited 2D trajectories into 3D with focal-length calibration and coordinate transformation, followed by speed alignment and orientation adjustment; actions are supplied by a motion bank or generated via text-to-motion methods. Then, based on modern text-to-video diffusion transformer models, we inject the subject as tokens for full attention, concatenate the background along the channel dimension, and add motion (trajectory and action) control signals by addition. Such a design opens up the possibility for us to generate realistic videos of anyone doing anything anywhere. Extensive experiments on benchmark datasets and real-world cases demonstrate that our method achieves state-of-the-art performance on both element-wise controllability and overall video quality.
中文: 本文提出了一种分解框架,通过分别控制前景主体、背景、轨迹和动作来生成逼真的人体视频,在控制性和视频质量方面均达到了最先进的性能。
English: This paper introduces a decomposed framework for generating realistic human videos by separately controlling foreground subjects, backgrounds, trajectories, and actions, achieving state-of-the-art performance in both controllability and video quality.

Authors:Tuo Liu, Qinghan Yang, Yu Zhang, Rongjun Ge, Yang Chen, Guangquan Zhou
Title: Think as Cardiac Sonographers: Marrying SAM with Left Ventricular Indicators Measurements According to Clinical Guidelines
Abstract:
Left ventricular (LV) indicator measurements following clinical echocardiog-raphy guidelines are important for diagnosing cardiovascular disease. Alt-hough existing algorithms have explored automated LV quantification, they can struggle to capture generic visual representations due to the normally small training datasets. Therefore, it is necessary to introduce vision founda-tional models (VFM) with abundant knowledge. However, VFMs represented by the segment anything model (SAM) are usually suitable for segmentation but incapable of identifying key anatomical points, which are critical in LV indicator measurements. In this paper, we propose a novel framework named AutoSAME, combining the powerful visual understanding of SAM with seg-mentation and landmark localization tasks simultaneously. Consequently, the framework mimics the operation of cardiac sonographers, achieving LV indi-cator measurements consistent with clinical guidelines. We further present fil-tered cross-branch attention (FCBA) in AutoSAME, which leverages relatively comprehensive features in the segmentation to enhance the heatmap regression (HR) of key points from the frequency domain perspective, optimizing the vis-ual representation learned by the latter. Moreover, we propose spatial-guided prompt alignment (SGPA) to automatically generate prompt embeddings guid-ed by spatial properties of LV, thereby improving the accuracy of dense pre-dictions by prior spatial knowledge. The extensive experiments on an echocar-diography dataset demonstrate the efficiency of each design and the superiori-ty of our AutoSAME in LV segmentation, landmark localization, and indicator measurements. The code will be available at https://github.com/QC-LIU-1997/AutoSAME.
中文摘要:本文提出AutoSAME框架,将SAM的视觉理解能力与分割和关键点定位任务相结合,通过创新的注意力机制和空间引导提示自动完成左心室指标测量,实验结果验证了其在各项任务中的优越性能。
English Summary: The paper introduces AutoSAME, a framework that integrates SAM's visual capabilities with segmentation and landmark localization to automate left ventricular indicator measurements in line with clinical guidelines, enhancing accuracy through novel attention mechanisms and spatial-guided prompts.

Authors:Wenhao Liang, Wei Emma Zhang, Lin Yue, Miao Xu, Olaf Maennel, Weitong Chen
Title: Calibration Attention: Instance-wise Temperature Scaling for Vision Transformers
Abstract:
Probability calibration is critical when Vision Transformers are deployed in risk-sensitive applications. The standard fix, post-hoc temperature scaling, uses a single global scalar and requires a held-out validation set. We introduce Calibration Attention (CalAttn), a drop-in module that learns an adaptive, per-instance temperature directly from the ViT's CLS token. Across CIFAR-10/100, MNIST, Tiny-ImageNet, and ImageNet-1K, CalAttn reduces calibration error by up to 4x on ViT-224, DeiT, and Swin, while adding under 0.1 percent additional parameters. The learned temperatures cluster tightly around 1.0, in contrast to the large global values used by standard temperature scaling. CalAttn is simple, efficient, and architecture-agnostic, and yields more trustworthy probabilities without sacrificing accuracy. Code: [https://github.com/EagleAdelaide/CalibrationAttention-CalAttn-](https://github.com/EagleAdelaide/CalibrationAttention-CalAttn-)
Chinese: 校准注意力(CalAttn)是一种创新模块,使视觉变换器能够直接从CLS标记中学习自适应、逐实例的温度缩放,在保持精度的同时以极小的参数开销将校准误差降低高达4倍。
English: Calibration Attention (CalAttn) is a novel module that enables Vision Transformers to learn adaptive, per-instance temperature scaling directly from the CLS token, achieving up to 4x reduction in calibration error with minimal parameter overhead while maintaining accuracy.

Authors:Christophe EL Zeinaty, Wassim Hamidouche, Glenn Herrou, Daniel Menard
Title: Designing Object Detection Models for TinyML: Foundations, Comparative Analysis, Challenges, and Emerging Solutions
Abstract:
Object detection (OD) has become vital for numerous computer vision applications, but deploying it on resource-constrained IoT devices presents a significant challenge. These devices, often powered by energy-efficient microcontrollers, struggle to handle the computational load of deep learning-based OD models. This issue is compounded by the rapid proliferation of IoT devices, predicted to surpass 150 billion by 2030. TinyML offers a compelling solution by enabling OD on ultra-low-power devices, paving the way for efficient and real-time processing at the edge. Although numerous survey papers have been published on this topic, they often overlook the optimization challenges associated with deploying OD models in TinyML environments. To address this gap, this survey paper provides a detailed analysis of key optimization techniques for deploying OD models on resource-constrained devices. These techniques include quantization, pruning, knowledge distillation, and neural architecture search. Furthermore, we explore both theoretical approaches and practical implementations, bridging the gap between academic research and real-world edge artificial intelligence deployment. Finally, we compare the key performance indicators (KPIs) of existing OD implementations on microcontroller devices, highlighting the achieved maturity level of these solutions in terms of both prediction accuracy and efficiency. We also provide a public repository to continually track developments in this fast-evolving field: https://github.com/christophezei/Optimizing-Object-Detection-Models-for-TinyML-A-Comprehensive-Survey.
中文: 本综述通过分析量化和剪枝等关键技术,解决了在资源受限的TinyML设备上部署目标检测模型的优化挑战,同时比较了性能指标并建立了持续更新的公共资源库。
English: This survey addresses the optimization challenges of deploying object detection models on resource-constrained TinyML devices by analyzing techniques like quantization and pruning, while comparing performance metrics and maintaining a public repository for ongoing developments.

Authors:Seonyoung Kim, Dongil Kim
Title: MoSSDA: A Semi-Supervised Domain Adaptation Framework for Multivariate Time-Series Classification using Momentum Encoder
Abstract:
Deep learning has emerged as the most promising approach in various fields; however, when the distributions of training and test data are different (domain shift), the performance of deep learning models can degrade. Semi-supervised domain adaptation (SSDA) is a major approach for addressing this issue, assuming that a fully labeled training set (source domain) is available, but the test set (target domain) provides labels only for a small subset. In this study, we propose a novel two-step momentum encoder-utilized SSDA framework, MoSSDA, for multivariate time-series classification. Time series data are highly sensitive to noise, and sequential dependencies cause domain shifts resulting in critical performance degradation. To obtain a robust, domain-invariant and class-discriminative representation, MoSSDA employs a domain-invariant encoder to learn features from both source and target domains. Subsequently, the learned features are fed to a mixup-enhanced positive contrastive module consisting of an online momentum encoder. The final classifier is trained with learned features that exhibit consistency and discriminability with limited labeled target domain data, without data augmentation. We applied a two-stage process by separating the gradient flow between the encoders and the classifier to obtain rich and complex representations. Through extensive experiments on six diverse datasets, MoSSDA achieved state-of-the-art performance for three different backbones and various unlabeled ratios in the target domain data. The Ablation study confirms that each module, including two-stage learning, is effective in improving the performance. Our code is available at https://github.com/seonyoungKimm/MoSSDA
中文: 提出的MoSSDA框架通过两步动量编码器,利用对比学习和两阶段训练获取领域不变特征,有效解决了多元时间序列分类中的领域偏移问题,并在多个数据集上实现了最优性能。
English: The proposed MoSSDA framework addresses domain shift in multivariate time-series classification by employing a two-step momentum encoder to learn domain-invariant features through contrastive learning and two-stage training, achieving state-of-the-art performance across diverse datasets.

Authors:Shuting He, Guangquan Jie, Changshuo Wang, Yun Zhou, Shuming Hu, Guanbin Li, Henghui Ding
Title: ReferSplat: Referring Segmentation in 3D Gaussian Splatting
Abstract:
We introduce Referring 3D Gaussian Splatting Segmentation (R3DGS), a new task that aims to segment target objects in a 3D Gaussian scene based on natural language descriptions, which often contain spatial relationships or object attributes. This task requires the model to identify newly described objects that may be occluded or not directly visible in a novel view, posing a significant challenge for 3D multi-modal understanding. Developing this capability is crucial for advancing embodied AI. To support research in this area, we construct the first R3DGS dataset, Ref-LERF. Our analysis reveals that 3D multi-modal understanding and spatial relationship modeling are key challenges for R3DGS. To address these challenges, we propose ReferSplat, a framework that explicitly models 3D Gaussian points with natural language expressions in a spatially aware paradigm. ReferSplat achieves state-of-the-art performance on both the newly proposed R3DGS task and 3D open-vocabulary segmentation benchmarks. Dataset and code are available at https://github.com/heshuting555/ReferSplat.
Chinese: 我们提出了R3DGS这一基于自然语言描述分割3D对象的新任务,并开发了ReferSplat空间感知框架,在该任务和3D开放词汇分割基准上实现了最先进的性能。
English: We introduce R3DGS, a novel task for segmenting 3D objects using natural language descriptions, and propose ReferSplat, a spatially-aware framework that achieves state-of-the-art performance on this task and 3D open-vocabulary segmentation benchmarks.

Authors:Kaijun Wang, Liqin Lu, Mingyu Liu, Jianuo Jiang, Zeju Li, Bolin Zhang, Wancai Zheng, Xinyi Yu, Hao Chen, Chunhua Shen
Title: ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks
Abstract:
Language-guided long-horizon mobile manipulation has long been a grand challenge in embodied semantic reasoning, generalizable manipulation, and adaptive locomotion. Three fundamental limitations hinder progress: First, although large language models have improved spatial reasoning and task planning through semantic priors, existing implementations remain confined to tabletop scenarios, failing to address the constrained perception and limited actuation ranges of mobile platforms. Second, current manipulation strategies exhibit insufficient generalization when confronted with the diverse object configurations encountered in open-world environments. Third, while crucial for practical deployment, the dual requirement of maintaining high platform maneuverability alongside precise end-effector control in unstructured settings remains understudied. In this work, we present ODYSSEY, a unified mobile manipulation framework for agile quadruped robots equipped with manipulators, which seamlessly integrates high-level task planning with low-level whole-body control. To address the challenge of egocentric perception in language-conditioned tasks, we introduce a hierarchical planner powered by a vision-language model, enabling long-horizon instruction decomposition and precise action execution. At the control level, our novel whole-body policy achieves robust coordination across challenging terrains. We further present the first benchmark for long-horizon mobile manipulation, evaluating diverse indoor and outdoor scenarios. Through successful sim-to-real transfer, we demonstrate the system's generalization and robustness in real-world deployments, underscoring the practicality of legged manipulators in unstructured environments. Our work advances the feasibility of generalized robotic assistants capable of complex, dynamic tasks. Our project page: https://kaijwang.github.io/odyssey.github.io/
中文摘要:ODYSSEY框架通过分层规划与全身控制的结合,使四足机器人能够在非结构化环境中稳健地执行语言引导的复杂移动操作任务。
English Summary: The ODYSSEY framework integrates hierarchical planning with whole-body control to enable legged robots to perform complex language-guided mobile manipulation tasks robustly in unstructured environments.

Authors:Wenyi Mo, Ying Ba, Tianyu Zhang, Yalong Bai, Biye Li
Title: Learning User Preferences for Image Generation Model
Abstract:
User preference prediction requires a comprehensive and accurate understanding of individual tastes. This includes both surface-level attributes, such as color and style, and deeper content-related aspects, such as themes and composition. However, existing methods typically rely on general human preferences or assume static user profiles, often neglecting individual variability and the dynamic, multifaceted nature of personal taste. To address these limitations, we propose an approach built upon Multimodal Large Language Models, introducing contrastive preference loss and preference tokens to learn personalized user preferences from historical interactions. The contrastive preference loss is designed to effectively distinguish between user ''likes'' and ''dislikes'', while the learnable preference tokens capture shared interest representations among existing users, enabling the model to activate group-specific preferences and enhance consistency across similar users. Extensive experiments demonstrate our model outperforms other methods in preference prediction accuracy, effectively identifying users with similar aesthetic inclinations and providing more precise guidance for generating images that align with individual tastes. The project page is \texttt{https://learn-user-pref.github.io/}.

Authors:Weijia Wu, Chen Gao, Joya Chen, Kevin Qinghong Lin, Qingwei Meng, Yiming Zhang, Yuke Qiu, Hong Zhou, Mike Zheng Shou
Title: Reinforcement Learning in Vision: A Survey
Abstract:
Recent advances at the intersection of reinforcement learning (RL) and visual intelligence have enabled agents that not only perceive complex visual scenes but also reason, generate, and act within them. This survey offers a critical and up-to-date synthesis of the field. We first formalize visual RL problems and trace the evolution of policy-optimization strategies from RLHF to verifiable reward paradigms, and from Proximal Policy Optimization to Group Relative Policy Optimization. We then organize more than 200 representative works into four thematic pillars: multi-modal large language models, visual generation, unified model frameworks, and vision-language-action models. For each pillar we examine algorithmic design, reward engineering, benchmark progress, and we distill trends such as curriculum-driven training, preference-aligned diffusion, and unified reward modeling. Finally, we review evaluation protocols spanning set-level fidelity, sample-level preference, and state-level stability, and we identify open challenges that include sample efficiency, generalization, and safe deployment. Our goal is to provide researchers and practitioners with a coherent map of the rapidly expanding landscape of visual RL and to highlight promising directions for future inquiry. Resources are available at: https://github.com/weijiawu/Awesome-Visual-Reinforcement-Learning.
中文摘要:本综述系统整合了视觉强化学习领域的最新进展,涵盖从策略优化到多模态模型的四大研究支柱,并分析了评估体系与现存挑战,为研究者提供领域发展图谱。
English Summary: This survey synthesizes recent advances in visual reinforcement learning, covering policy evolution, thematic pillars like multimodal models and vision-language-action systems, while addressing evaluation protocols and open challenges.

Authors:Md Meftahul Ferdaus, Mahdi Abdelguerfi, Elias Ioup, Steven Sloan, Kendall N. Niles, Ken Pathak
Title: KARMA: Efficient Structural Defect Segmentation via Kolmogorov-Arnold Representation Learning
Abstract:
Semantic segmentation of structural defects in civil infrastructure remains challenging due to variable defect appearances, harsh imaging conditions, and significant class imbalance. Current deep learning methods, despite their effectiveness, typically require millions of parameters, rendering them impractical for real-time inspection systems. We introduce KARMA (Kolmogorov-Arnold Representation Mapping Architecture), a highly efficient semantic segmentation framework that models complex defect patterns through compositions of one-dimensional functions rather than conventional convolutions. KARMA features three technical innovations: (1) a parameter-efficient Tiny Kolmogorov-Arnold Network (TiKAN) module leveraging low-rank factorization for KAN-based feature transformation; (2) an optimized feature pyramid structure with separable convolutions for multi-scale defect analysis; and (3) a static-dynamic prototype mechanism that enhances feature representation for imbalanced classes. Extensive experiments on benchmark infrastructure inspection datasets demonstrate that KARMA achieves competitive or superior mean IoU performance compared to state-of-the-art approaches, while using significantly fewer parameters (0.959M vs. 31.04M, a 97% reduction). Operating at 0.264 GFLOPS, KARMA maintains inference speeds suitable for real-time deployment, enabling practical automated infrastructure inspection systems without compromising accuracy. The source code can be accessed at the following URL: https://github.com/faeyelab/karma.
中文: KARMA是一种高效的语义分割框架,相比现有方法减少了97%的参数,在保持高精度的同时实现了实时基础设施缺陷检测。
English: KARMA is a highly efficient semantic segmentation framework that achieves competitive accuracy with 97% fewer parameters than state-of-the-art methods, enabling real-time infrastructure defect inspection.

Authors:Hongkun Jin, Hongcheng Jiang, Zejun Zhang, Yuan Zhang, Jia Fu, Tingfeng Li, Kai Luo
Title: THAT: Token-wise High-frequency Augmentation Transformer for Hyperspectral Pansharpening
Abstract:
Transformer-based methods have demonstrated strong potential in hyperspectral pansharpening by modeling long-range dependencies. However, their effectiveness is often limited by redundant token representations and a lack of multi-scale feature modeling. Hyperspectral images exhibit intrinsic spectral priors (e.g., abundance sparsity) and spatial priors (e.g., non-local similarity), which are critical for accurate reconstruction. From a spectral-spatial perspective, Vision Transformers (ViTs) face two major limitations: they struggle to preserve high-frequency components--such as material edges and texture transitions--and suffer from attention dispersion across redundant tokens. These issues stem from the global self-attention mechanism, which tends to dilute high-frequency signals and overlook localized details. To address these challenges, we propose the Token-wise High-frequency Augmentation Transformer (THAT), a novel framework designed to enhance hyperspectral pansharpening through improved high-frequency feature representation and token selection. Specifically, THAT introduces: (1) Pivotal Token Selective Attention (PTSA) to prioritize informative tokens and suppress redundancy; (2) a Multi-level Variance-aware Feed-forward Network (MVFN) to enhance high-frequency detail learning. Experiments on standard benchmarks show that THAT achieves state-of-the-art performance with improved reconstruction quality and efficiency. The source code is available at https://github.com/kailuo93/THAT.
中文: 提出的高频增强Transformer(THAT)通过关键令牌选择和多层方差感知机制,解决了视觉Transformer在光谱图像融合中高频特征保持不足的问题,实现了最先进的融合性能。
English: The proposed Token-wise High-frequency Augmentation Transformer (THAT) overcomes limitations of Vision Transformers in hyperspectral pansharpening by introducing token selection and multi-level variance mechanisms to enhance high-frequency feature representation, achieving state-of-the-art performance.

Authors:Luca Zedda, Andrea Loddo, Cecilia Di Ruberto, Carsten Marr
Title: RedDino: A foundation model for red blood cell analysis
Abstract:
Red blood cells (RBCs) are essential to human health, and their precise morphological analysis is important for diagnosing hematological disorders. Despite the promise of foundation models in medical diagnostics, comprehensive AI solutions for RBC analysis remain scarce. We present RedDino, a self-supervised foundation model designed for RBC image analysis. RedDino uses an RBC-specific adaptation of the DINOv2 self-supervised learning framework and is trained on a curated dataset of 1.25 million RBC images from diverse acquisition modalities and sources. Extensive evaluations show that RedDino outperforms existing state-of-the-art models on RBC shape classification. Through assessments including linear probing and nearest neighbor classification, we confirm its strong feature representations and generalization ability. Our main contributions are: (1) a foundation model tailored for RBC analysis, (2) ablation studies exploring DINOv2 configurations for RBC modeling, and (3) a detailed evaluation of generalization performance. RedDino addresses key challenges in computational hematology by capturing nuanced morphological features, advancing the development of reliable diagnostic tools. The source code and pretrained models for RedDino are available at https://github.com/Snarci/RedDino, and the pretrained models can be downloaded from our Hugging Face collection at https://huggingface.co/collections/Snarcy/reddino-689a13e29241d2e5690202fc
中文: RedDino是一种自监督基础模型,通过训练125万张多样化图像,在红细胞形态分析中表现出卓越的分类性能和强大的泛化能力。
English: RedDino is a self-supervised foundation model that excels in red blood cell image analysis, achieving superior classification performance and robust generalization through training on 1.25 million diverse images.

Authors:Chongke Bi, Xin Gao, Jiangkang Deng, Guan Li, Jun Han
Title: CD-TVD: Contrastive Diffusion for 3D Super-Resolution with Scarce High-Resolution Time-Varying Data
Abstract:
Large-scale scientific simulations require significant resources to generate high-resolution time-varying data (TVD). While super-resolution is an efficient post-processing strategy to reduce costs, existing methods rely on a large amount of HR training data, limiting their applicability to diverse simulation scenarios. To address this constraint, we proposed CD-TVD, a novel framework that combines contrastive learning and an improved diffusion-based super-resolution model to achieve accurate 3D super-resolution from limited time-step high-resolution data. During pre-training on historical simulation data, the contrastive encoder and diffusion superresolution modules learn degradation patterns and detailed features of high-resolution and low-resolution samples. In the training phase, the improved diffusion model with a local attention mechanism is fine-tuned using only one newly generated high-resolution timestep, leveraging the degradation knowledge learned by the encoder. This design minimizes the reliance on large-scale high-resolution datasets while maintaining the capability to recover fine-grained details. Experimental results on fluid and atmospheric simulation datasets confirm that CD-TVD delivers accurate and resource-efficient 3D super-resolution, marking a significant advancement in data augmentation for large-scale scientific simulations. The code is available at https://github.com/Xin-Gao-private/CD-TVD.
Chinese: CD-TVD 是一种创新框架,结合对比学习和改进的扩散模型,仅需少量高分辨率数据即可实现大规模科学模拟的精确三维超分辨率,显著降低资源需求同时保留细节特征。
English: CD-TVD is a novel framework that integrates contrastive learning with an enhanced diffusion model to achieve accurate 3D super-resolution for large-scale simulations using minimal high-resolution data, significantly reducing resource demands while preserving fine details.

Authors:Yan Wang, Da-Wei Zhou, Han-Jia Ye
Title: Integrating Task-Specific and Universal Adapters for Pre-Trained Model-based Class-Incremental Learning
Abstract:
Class-Incremental Learning (CIL) requires a learning system to continually learn new classes without forgetting. Existing pre-trained model-based CIL methods often freeze the pre-trained network and adapt to incremental tasks using additional lightweight modules such as adapters. However, incorrect module selection during inference hurts performance, and task-specific modules often overlook shared general knowledge, leading to errors on distinguishing between similar classes across tasks. To address the aforementioned challenges, we propose integrating Task-Specific and Universal Adapters (TUNA) in this paper. Specifically, we train task-specific adapters to capture the most crucial features relevant to their respective tasks and introduce an entropy-based selection mechanism to choose the most suitable adapter. Furthermore, we leverage an adapter fusion strategy to construct a universal adapter, which encodes the most discriminative features shared across tasks. We combine task-specific and universal adapter predictions to harness both specialized and general knowledge during inference. Extensive experiments on various benchmark datasets demonstrate the state-of-the-art performance of our approach. Code is available at: https://github.com/LAMDA-CL/ICCV2025-TUNA
Chinese: 本文提出TUNA方法,通过结合任务特定和通用适配器及基于熵的选择机制,有效利用专业知识和共享特征来提升类增量学习性能,实现了最先进的成果。
English: This paper introduces TUNA, a method that integrates task-specific and universal adapters with an entropy-based selection mechanism to enhance class-incremental learning by leveraging both specialized and shared knowledge, achieving state-of-the-art results.

Authors:Md Rezwanul Haque, Md. Milon Islam, S M Taslim Uddin Raju, Hamdi Altaheri, Lobna Nassar, Fakhri Karray
Title: MDD-Net: Multimodal Depression Detection through Mutual Transformer
Abstract:
Depression is a major mental health condition that severely impacts the emotional and physical well-being of individuals. The simple nature of data collection from social media platforms has attracted significant interest in properly utilizing this information for mental health research. A Multimodal Depression Detection Network (MDD-Net), utilizing acoustic and visual data obtained from social media networks, is proposed in this work where mutual transformers are exploited to efficiently extract and fuse multimodal features for efficient depression detection. The MDD-Net consists of four core modules: an acoustic feature extraction module for retrieving relevant acoustic attributes, a visual feature extraction module for extracting significant high-level patterns, a mutual transformer for computing the correlations among the generated features and fusing these features from multiple modalities, and a detection layer for detecting depression using the fused feature representations. The extensive experiments are performed using the multimodal D-Vlog dataset, and the findings reveal that the developed multimodal depression detection network surpasses the state-of-the-art by up to 17.37% for F1-Score, demonstrating the greater performance of the proposed system. The source code is accessible at https://github.com/rezwanh001/Multimodal-Depression-Detection.
中文: 本研究提出MDD-Net多模态网络,利用社交媒体中的声学和视觉数据,通过互变换器进行抑郁检测,其F1分数比现有方法提高达17.37%。
English: This study introduces MDD-Net, a multimodal network that uses acoustic and visual data from social media with mutual transformers to detect depression, achieving a 17.37% higher F1-Score than existing methods.

Authors:Zizheng Guo, Bochao Zou, Junbao Zhuo, Huimin Ma
Title: ME-TST+: Micro-expression Analysis via Temporal State Transition with ROI Relationship Awareness
Abstract:
Micro-expressions (MEs) are regarded as important indicators of an individual's intrinsic emotions, preferences, and tendencies. ME analysis requires spotting of ME intervals within long video sequences and recognition of their corresponding emotional categories. Previous deep learning approaches commonly employ sliding-window classification networks. However, the use of fixed window lengths and hard classification presents notable limitations in practice. Furthermore, these methods typically treat ME spotting and recognition as two separate tasks, overlooking the essential relationship between them. To address these challenges, this paper proposes two state space model-based architectures, namely ME-TST and ME-TST+, which utilize temporal state transition mechanisms to replace conventional window-level classification with video-level regression. This enables a more precise characterization of the temporal dynamics of MEs and supports the modeling of MEs with varying durations. In ME-TST+, we further introduce multi-granularity ROI modeling and the slowfast Mamba framework to alleviate information loss associated with treating ME analysis as a time-series task. Additionally, we propose a synergy strategy for spotting and recognition at both the feature and result levels, leveraging their intrinsic connection to enhance overall analysis performance. Extensive experiments demonstrate that the proposed methods achieve state-of-the-art performance. The codes are available at https://github.com/zizheng-guo/ME-TST.
中文摘要:本文提出ME-TST和ME-TST+架构,利用时序状态转换机制替代传统窗口级分类,实现更精确的微表情动态表征,并通过多粒度建模和慢快Mamba框架结合特征与结果层面的协同策略,显著提升了微表情分析的性能。
English Summary: This paper introduces ME-TST and ME-TST+ architectures that use temporal state transition mechanisms to replace traditional window-level classification with video-level regression, enabling more precise micro-expression analysis while integrating spotting and recognition tasks through synergy strategies for state-of-the-art performance.

Authors:Ziad Al-Haj Hemidi, Eytan Kats, Mattias P. Heinrich
Title: PrIINeR: Towards Prior-Informed Implicit Neural Representations for Accelerated MRI
Abstract:
Accelerating Magnetic Resonance Imaging (MRI) reduces scan time but often degrades image quality. While Implicit Neural Representations (INRs) show promise for MRI reconstruction, they struggle at high acceleration factors due to weak prior constraints, leading to structural loss and aliasing artefacts. To address this, we propose PrIINeR, an INR-based MRI reconstruction method that integrates prior knowledge from pre-trained deep learning models into the INR framework. By combining population-level knowledge with instance-based optimization and enforcing dual data consistency, PrIINeR aligns both with the acquired k-space data and the prior-informed reconstruction. Evaluated on the NYU fastMRI dataset, our method not only outperforms state-of-the-art INR-based approaches but also improves upon several learning-based state-of-the-art methods, significantly improving structural preservation and fidelity while effectively removing aliasing artefacts.PrIINeR bridges deep learning and INR-based techniques, offering a more reliable solution for high-quality, accelerated MRI reconstruction. The code is publicly available on https://github.com/multimodallearning/PrIINeR.
中文: PrIINeR通过将深度学习先验知识融入隐式神经表示,显著提升了高倍加速下的MRI重建质量,有效消除混叠伪影并增强结构保真度。
English: PrIINeR enhances MRI reconstruction by integrating deep learning priors into implicit neural representations, effectively reducing aliasing artifacts and improving image quality at high acceleration factors.

Authors:Peng Dai, Feitong Tan, Qiangeng Xu, Yihua Huang, David Futschik, Ruofei Du, Sean Fanello, Yinda Zhang, Xiaojuan Qi
Title: S^2VG: 3D Stereoscopic and Spatial Video Generation via Denoising Frame Matrix
Abstract:
While video generation models excel at producing high-quality monocular videos, generating 3D stereoscopic and spatial videos for immersive applications remains an underexplored challenge. We present a pose-free and training-free method that leverages an off-the-shelf monocular video generation model to produce immersive 3D videos. Our approach first warps the generated monocular video into pre-defined camera viewpoints using estimated depth information, then applies a novel \textit{frame matrix} inpainting framework. This framework utilizes the original video generation model to synthesize missing content across different viewpoints and timestamps, ensuring spatial and temporal consistency without requiring additional model fine-tuning. Moreover, we develop a \dualupdate~scheme that further improves the quality of video inpainting by alleviating the negative effects propagated from disoccluded areas in the latent space. The resulting multi-view videos are then adapted into stereoscopic pairs or optimized into 4D Gaussians for spatial video synthesis. We validate the efficacy of our proposed method by conducting experiments on videos from various generative models, such as Sora, Lumiere, WALT, and Zeroscope. The experiments demonstrate that our method has a significant improvement over previous methods. Project page at: https://daipengwa.github.io/S-2VG_ProjectPage/

Authors:Huawei Sun, Zixu Wang, Hao Feng, Julius Ott, Lorenzo Servadei, Robert Wille
Title: TRIDE: A Text-assisted Radar-Image weather-aware fusion network for Depth Estimation
Abstract:
Depth estimation, essential for autonomous driving, seeks to interpret the 3D environment surrounding vehicles. The development of radar sensors, known for their cost-efficiency and robustness, has spurred interest in radar-camera fusion-based solutions. However, existing algorithms fuse features from these modalities without accounting for weather conditions, despite radars being known to be more robust than cameras under adverse weather. Additionally, while Vision-Language models have seen rapid advancement, utilizing language descriptions alongside other modalities for depth estimation remains an open challenge. This paper first introduces a text-generation strategy along with feature extraction and fusion techniques that can assist monocular depth estimation pipelines, leading to improved accuracy across different algorithms on the KITTI dataset. Building on this, we propose TRIDE, a radar-camera fusion algorithm that enhances text feature extraction by incorporating radar point information. To address the impact of weather on sensor performance, we introduce a weather-aware fusion block that adaptively adjusts radar weighting based on current weather conditions. Our method, benchmarked on the nuScenes dataset, demonstrates performance gains over the state-of-the-art, achieving a 12.87% improvement in MAE and a 9.08% improvement in RMSE. Code: https://github.com/harborsarah/TRIDE
中文: 本文提出TRIDE雷达-相机融合算法,通过结合文本特征和天气自适应融合模块来增强深度估计,在自动驾驶数据集上实现了显著的性能提升。
English: This paper introduces TRIDE, a radar-camera fusion algorithm that enhances depth estimation by incorporating text features and a weather-aware fusion block, achieving significant performance improvements on autonomous driving datasets.

Authors:Anqi Xiao, Weichen Yu, Hongyuan Yu
Title: Sample-aware RandAugment: Search-free Automatic Data Augmentation for Effective Image Recognition
Abstract:
Automatic data augmentation (AutoDA) plays an important role in enhancing the generalization of neural networks. However, mainstream AutoDA methods often encounter two challenges: either the search process is excessively time-consuming, hindering practical application, or the performance is suboptimal due to insufficient policy adaptation during training. To address these issues, we propose Sample-aware RandAugment (SRA), an asymmetric, search-free AutoDA method that dynamically adjusts augmentation policies while maintaining straightforward implementation. SRA incorporates a heuristic scoring module that evaluates the complexity of the original training data, enabling the application of tailored augmentations for each sample. Additionally, an asymmetric augmentation strategy is employed to maximize the potential of this scoring module. In multiple experimental settings, SRA narrows the performance gap between search-based and search-free AutoDA methods, achieving a state-of-the-art Top-1 accuracy of 78.31\% on ImageNet with ResNet-50. Notably, SRA demonstrates good compatibility with existing augmentation pipelines and solid generalization across new tasks, without requiring hyperparameter tuning. The pretrained models leveraging SRA also enhance recognition in downstream object detection tasks. SRA represents a promising step towards simpler, more effective, and practical AutoDA designs applicable to a variety of future tasks. Our code is available at \href{https://github.com/ainieli/Sample-awareRandAugment}{https://github.com/ainieli/Sample-awareRandAugment
Chinese: 样本感知随机增强(SRA)是一种无需搜索的自动数据增强方法,它根据样本复杂度动态调整增强策略,在ImageNet上取得了最优性能,并在多种任务中展现出良好的泛化能力。
English: Sample-aware RandAugment (SRA) is a search-free automatic data augmentation method that dynamically adjusts policies based on sample complexity, achieving state-of-the-art performance on ImageNet and demonstrating strong generalization across tasks.

Authors:Ajnas Muhammed, Iurri Medvedev, Nuno Gonçalves
Title: VOIDFace: A Privacy-Preserving Multi-Network Face Recognition With Enhanced Security
Abstract:
Advancement of machine learning techniques, combined with the availability of large-scale datasets, has significantly improved the accuracy and efficiency of facial recognition. Modern facial recognition systems are trained using large face datasets collected from diverse individuals or public repositories. However, for training, these datasets are often replicated and stored in multiple workstations, resulting in data replication, which complicates database management and oversight. Currently, once a user submits their face for dataset preparation, they lose control over how their data is used, raising significant privacy and ethical concerns. This paper introduces VOIDFace, a novel framework for facial recognition systems that addresses two major issues. First, it eliminates the need of data replication and improves data control to securely store training face data by using visual secret sharing. Second, it proposes a patch-based multi-training network that uses this novel training data storage mechanism to develop a robust, privacy-preserving facial recognition system. By integrating these advancements, VOIDFace aims to improve the privacy, security, and efficiency of facial recognition training, while ensuring greater control over sensitive personal face data. VOIDFace also enables users to exercise their Right-To-Be-Forgotten property to control their personal data. Experimental evaluations on the VGGFace2 dataset show that VOIDFace provides Right-To-Be-Forgotten, improved data control, security, and privacy while maintaining competitive facial recognition performance. Code is available at: https://github.com/ajnasmuhammed89/VOIDFace
中文: VOIDFace提出了一种基于视觉秘密共享的人脸识别框架,无需数据复制即可增强用户对个人数据的控制权,在保护隐私的同时保持了优异的识别性能。
English: VOIDFace introduces a privacy-preserving facial recognition framework using visual secret sharing to eliminate data replication and enhance user control over personal data, while maintaining competitive performance.

Authors:Jin-Seop Lee, SungJoon Lee, Jaehan Ahn, YunSeok Choi, Jee-Hyong Lee
Title: TAG: A Simple Yet Effective Temporal-Aware Approach for Zero-Shot Video Temporal Grounding
Abstract:
Video Temporal Grounding (VTG) aims to extract relevant video segments based on a given natural language query. Recently, zero-shot VTG methods have gained attention by leveraging pretrained vision-language models (VLMs) to localize target moments without additional training. However, existing approaches suffer from semantic fragmentation, where temporally continuous frames sharing the same semantics are split across multiple segments. When segments are fragmented, it becomes difficult to predict an accurate target moment that aligns with the text query. Also, they rely on skewed similarity distributions for localization, making it difficult to select the optimal segment. Furthermore, they heavily depend on the use of LLMs which require expensive inferences. To address these limitations, we propose a \textit{TAG}, a simple yet effective Temporal-Aware approach for zero-shot video temporal Grounding, which incorporates temporal pooling, temporal coherence clustering, and similarity adjustment. Our proposed method effectively captures the temporal context of videos and addresses distorted similarity distributions without training. Our approach achieves state-of-the-art results on Charades-STA and ActivityNet Captions benchmark datasets without rely on LLMs. Our code is available at https://github.com/Nuetee/TAG
中文: 提出的TAG方法通过时序池化、时序一致性聚类和相似度调整,解决了零样本视频时序定位中的语义碎片化和相似度分布偏差问题,在不依赖大语言模型或额外训练的情况下实现了最优性能。
English: The proposed TAG method addresses semantic fragmentation and skewed similarity distributions in zero-shot video temporal grounding by incorporating temporal pooling, coherence clustering, and similarity adjustment, achieving state-of-the-art performance without LLMs or additional training.

Authors:Yongtao Ge, Kangyang Xie, Guangkai Xu, Mingyu Liu, Li Ke, Longtao Huang, Hui Xue, Hao Chen, Chunhua Shen
Title: Generative Video Matting
Abstract:
Video matting has traditionally been limited by the lack of high-quality ground-truth data. Most existing video matting datasets provide only human-annotated imperfect alpha and foreground annotations, which must be composited to background images or videos during the training stage. Thus, the generalization capability of previous methods in real-world scenarios is typically poor. In this work, we propose to solve the problem from two perspectives. First, we emphasize the importance of large-scale pre-training by pursuing diverse synthetic and pseudo-labeled segmentation datasets. We also develop a scalable synthetic data generation pipeline that can render diverse human bodies and fine-grained hairs, yielding around 200 video clips with a 3-second duration for fine-tuning. Second, we introduce a novel video matting approach that can effectively leverage the rich priors from pre-trained video diffusion models. This architecture offers two key advantages. First, strong priors play a critical role in bridging the domain gap between synthetic and real-world scenes. Second, unlike most existing methods that process video matting frame-by-frame and use an independent decoder to aggregate temporal information, our model is inherently designed for video, ensuring strong temporal consistency. We provide a comprehensive quantitative evaluation across three benchmark datasets, demonstrating our approach's superior performance, and present comprehensive qualitative results in diverse real-world scenes, illustrating the strong generalization capability of our method. The code is available at https://github.com/aim-uofa/GVM.
Chinese: 本研究通过引入可扩展的合成数据生成流程和利用预训练视频扩散模型的新视频抠图方法,解决了视频抠图的局限性,在真实场景中实现了卓越性能和强大的泛化能力。
English: This study addresses the limitations in video matting by introducing a scalable synthetic data generation pipeline and a novel video matting approach that leverages pre-trained video diffusion models, achieving superior performance and strong generalization in real-world scenarios.

Authors:Marco Peer, Anna Scius-Bertrand, Andreas Fischer
Title: CTC Transcription Alignment of the Bullinger Letters: Automatic Improvement of Annotation Quality
Abstract:
Handwritten text recognition for historical documents remains challenging due to handwriting variability, degraded sources, and limited layout-aware annotations. In this work, we address annotation errors - particularly hyphenation issues - in the Bullinger correspondence, a large 16th-century letter collection. We introduce a self-training method based on a CTC alignment algorithm that matches full transcriptions to text line images using dynamic programming and model output probabilities trained with the CTC loss. Our approach improves performance (e.g., by 1.1 percentage points CER with PyLaia) and increases alignment accuracy. Interestingly, we find that weaker models yield more accurate alignments, enabling an iterative training strategy. We release a new manually corrected subset of 100 pages from the Bullinger dataset, along with our code and benchmarks. Our approach can be applied iteratively to further improve the CER as well as the alignment quality for text recognition pipelines. Code and data are available via https://github.com/andreas-fischer-unifr/nntp.
中文: 本研究提出了一种基于CTC对齐的自训练方法,用于纠正历史文献中的标注错误,提升了识别性能和匹配精度,并发布了手动校正的数据集和代码。
English: This study introduces a self-training method using CTC alignment to correct annotation errors in historical documents, improving recognition performance and alignment accuracy while releasing a manually corrected dataset and code.

Authors:Bin Cao, Sipeng Zheng, Ye Wang, Lujie Xia, Qianshan Wei, Qin Jin, Jing Liu, Zongqing Lu
Title: Being-M0.5: A Real-Time Controllable Vision-Language-Motion Model
Abstract:
Human motion generation has emerged as a critical technology with transformative potential for real-world applications. However, existing vision-language-motion models (VLMMs) face significant limitations that hinder their practical deployment. We identify controllability as a main bottleneck, manifesting in five key aspects: inadequate response to diverse human commands, limited pose initialization capabilities, poor performance on long-term sequences, insufficient handling of unseen scenarios, and lack of fine-grained control over individual body parts. To overcome these limitations, we present Being-M0.5, the first real-time, controllable VLMM that achieves state-of-the-art performance across multiple motion generation tasks. Our approach is built upon HuMo100M, the largest and most comprehensive human motion dataset to date, comprising over 5 million self-collected motion sequences, 100 million multi-task instructional instances, and detailed part-level annotations that address a critical gap in existing datasets. We introduce a novel part-aware residual quantization technique for motion tokenization that enables precise, granular control over individual body parts during generation. Extensive experimental validation demonstrates Being-M0.5's superior performance across diverse motion benchmarks, while comprehensive efficiency analysis confirms its real-time capabilities. Our contributions include design insights and detailed computational analysis to guide future development of practical motion generators. We believe that HuMo100M and Being-M0.5 represent significant advances that will accelerate the adoption of motion generation technologies in real-world applications. The project page is available at https://beingbeyond.github.io/Being-M0.5.
中文: 本文提出首个实时可控的视觉-语言-动作模型Being-M0.5,通过创新的部位感知量化技术和海量HuMo100M数据集解决了动作控制的五大关键瓶颈,在多项基准测试中实现了最先进的性能表现。
English: This paper introduces Being-M0.5, the first real-time controllable vision-language-motion model that overcomes key limitations in motion controllability through a novel part-aware quantization technique and the comprehensive HuMo100M dataset, achieving state-of-the-art performance across multiple benchmarks.

Authors:Shunya Nagashima, Komei Sugiura
Title: Deep Space Weather Model: Long-Range Solar Flare Prediction from Multi-Wavelength Images
Abstract:
Accurate, reliable solar flare prediction is crucial for mitigating potential disruptions to critical infrastructure, while predicting solar flares remains a significant challenge. Existing methods based on heuristic physical features often lack representation learning from solar images. On the other hand, end-to-end learning approaches struggle to model long-range temporal dependencies in solar images. In this study, we propose Deep Space Weather Model (Deep SWM), which is based on multiple deep state space models for handling both ten-channel solar images and long-range spatio-temporal dependencies. Deep SWM also features a sparse masked autoencoder, a novel pretraining strategy that employs a two-phase masking approach to preserve crucial regions such as sunspots while compressing spatial information. Furthermore, we built FlareBench, a new public benchmark for solar flare prediction covering a full 11-year solar activity cycle, to validate our method. Our method outperformed baseline methods and even human expert performance on standard metrics in terms of performance and reliability. The project page can be found at https://keio-smilab25.github.io/DeepSWM.

Authors:Jingna Qiu, Nishanth Jain, Jonas Ammeling, Marc Aubreville, Katharina Breininger
Title: Effortless Vision-Language Model Specialization in Histopathology without Annotation
Abstract:
Recent advances in Vision-Language Models (VLMs) in histopathology, such as CONCH and QuiltNet, have demonstrated impressive zero-shot classification capabilities across various tasks. However, their general-purpose design may lead to suboptimal performance in specific downstream applications. While supervised fine-tuning methods address this issue, they require manually labeled samples for adaptation. This paper investigates annotation-free adaptation of VLMs through continued pretraining on domain- and task-relevant image-caption pairs extracted from existing databases. Our experiments on two VLMs, CONCH and QuiltNet, across three downstream tasks reveal that these pairs substantially enhance both zero-shot and few-shot performance. Notably, with larger training sizes, continued pretraining matches the performance of few-shot methods while eliminating manual labeling. Its effectiveness, task-agnostic design, and annotation-free workflow make it a promising pathway for adapting VLMs to new histopathology tasks. Code is available at https://github.com/DeepMicroscopy/Annotation-free-VLM-specialization.
Chinese: 本文提出了一种无需标注的组织病理学视觉语言模型自适应方法,通过对领域相关图像-文本对进行持续预训练,无需人工标注即可显著提升零样本和小样本任务的性能。
English: This paper introduces an annotation-free adaptation method for Vision-Language Models (VLMs) in histopathology, using continued pretraining on domain-specific image-caption pairs to enhance zero-shot and few-shot performance without manual labeling.

Authors:Animesh Jain, Alexandros Stergiou
Title: MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization
Abstract:
Vision Language Models (VLMs) encode multimodal inputs over large, complex, and difficult-to-interpret architectures, which limit transparency and trust. We propose a Multimodal Inversion for Model Interpretation and Conceptualization (MIMIC) framework to visualize the internal representations of VLMs by synthesizing visual concepts corresponding to internal encodings. MIMIC uses a joint VLM-based inversion and a feature alignment objective to account for VLM's autoregressive processing. It additionally includes a triplet of regularizers for spatial alignment, natural image smoothness, and semantic realism. We quantitatively and qualitatively evaluate MIMIC by inverting visual concepts over a range of varying-length free-form VLM output texts. Reported results include both standard visual quality metrics as well as semantic text-based metrics. To the best of our knowledge, this is the first model inversion approach addressing visual interpretations of VLM concepts.

Authors:Xiaoqi Zhao, Peiqian Cao, Lihe Zhang, Zonglei Feng, Hanqi Liu, Jiaming Zuo, Youwei Pang, Weisi Lin, Georges El Fakhri, Huchuan Lu, Xiaofeng Liu
Title: Power Battery Detection
Abstract:
Power batteries are essential components in electric vehicles, where internal structural defects can pose serious safety risks. We conduct a comprehensive study on a new task, power battery detection (PBD), which aims to localize the dense endpoints of cathode and anode plates from industrial X-ray images for quality inspection. Manual inspection is inefficient and error-prone, while traditional vision algorithms struggle with densely packed plates, low contrast, scale variation, and imaging artifacts. To address this issue and drive more attention into this meaningful task, we present PBD5K, the first large-scale benchmark for this task, consisting of 5,000 X-ray images from nine battery types with fine-grained annotations and eight types of real-world visual interference. To support scalable and consistent labeling, we develop an intelligent annotation pipeline that combines image filtering, model-assisted pre-labeling, cross-verification, and layered quality evaluation. We formulate PBD as a point-level segmentation problem and propose MDCNeXt, a model designed to extract and integrate multi-dimensional structure clues including point, line, and count information from the plate itself. To improve discrimination between plates and suppress visual interference, MDCNeXt incorporates two state space modules. The first is a prompt-filtered module that learns contrastive relationships guided by task-specific prompts. The second is a density-aware reordering module that refines segmentation in regions with high plate density. In addition, we propose a distance-adaptive mask generation strategy to provide robust supervision under varying spatial distributions of anode and cathode positions. The source code and datasets will be publicly available at \href{https://github.com/Xiaoqi-Zhao-DLUT/X-ray-PBD}{PBD5K}.
中文摘要:本研究针对动力电池X射线图像中的极片检测难题,首次提出大规模基准数据集PBD5K和MDCNeXt模型,通过融合多维结构线索有效解决密集排布、低对比度等工业检测痛点。
English Summary: This study introduces PBD5K, the first large-scale benchmark for power battery detection using X-ray images, and proposes MDCNeXt, a novel model that integrates multi-dimensional structural clues to accurately localize electrode plates despite visual challenges.

Authors:Hongrui Zheng, Yuezun Li, Liejun Wang, Yunfeng Diao, Zhiqing Guo
Title: Boosting Active Defense Persistence: A Two-Stage Defense Framework Combining Interruption and Poisoning Against Deepfake
Abstract:
Active defense strategies have been developed to counter the threat of deepfake technology. However, a primary challenge is their lack of persistence, as their effectiveness is often short-lived. Attackers can bypass these defenses by simply collecting protected samples and retraining their models. This means that static defenses inevitably fail when attackers retrain their models, which severely limits practical use. We argue that an effective defense not only distorts forged content but also blocks the model's ability to adapt, which occurs when attackers retrain their models on protected images. To achieve this, we propose an innovative Two-Stage Defense Framework (TSDF). Benefiting from the intensity separation mechanism designed in this paper, the framework uses dual-function adversarial perturbations to perform two roles. First, it can directly distort the forged results. Second, it acts as a poisoning vehicle that disrupts the data preparation process essential for an attacker's retraining pipeline. By poisoning the data source, TSDF aims to prevent the attacker's model from adapting to the defensive perturbations, thus ensuring the defense remains effective long-term. Comprehensive experiments show that the performance of traditional interruption methods degrades sharply when it is subjected to adversarial retraining. However, our framework shows a strong dual defense capability, which can improve the persistence of active defense. Our code will be available at https://github.com/vpsg-research/TSDF.
中文摘要:本文提出的两阶段防御框架(TSDF)通过双功能对抗扰动,既能直接干扰伪造结果,又能作为数据污染载体破坏攻击者的模型重训练过程,从而显著提升主动防御的持久性。
English Summary: The proposed Two-Stage Defense Framework (TSDF) uses dual-function adversarial perturbations to both distort deepfake outputs and poison attackers' training data, ensuring long-term defense persistence against model retraining.

Authors:Lennart Bastian, Mohammad Rashed, Nassir Navab, Tolga Birdal
Title: Forecasting Continuous Non-Conservative Dynamical Systems in SO(3)
Abstract:
Modeling the rotation of moving objects is a fundamental task in computer vision, yet $SO(3)$ extrapolation still presents numerous challenges: (1) unknown quantities such as the moment of inertia complicate dynamics, (2) the presence of external forces and torques can lead to non-conservative kinematics, and (3) estimating evolving state trajectories under sparse, noisy observations requires robustness. We propose modeling trajectories of noisy pose estimates on the manifold of 3D rotations in a physically and geometrically meaningful way by leveraging Neural Controlled Differential Equations guided with $SO(3)$ Savitzky-Golay paths. Existing extrapolation methods often rely on energy conservation or constant velocity assumptions, limiting their applicability in real-world scenarios involving non-conservative forces. In contrast, our approach is agnostic to energy and momentum conservation while being robust to input noise, making it applicable to complex, non-inertial systems. Our approach is easily integrated as a module in existing pipelines and generalizes well to trajectories with unknown physical parameters. By learning to approximate object dynamics from noisy states during training, our model attains robust extrapolation capabilities in simulation and various real-world settings. Code is available at https://github.com/bastianlb/forecasting-rotational-dynamics
中文摘要:本研究提出一种利用神经控制微分方程和SO(3) Savitzky-Golay路径的稳健方法,能够在无需能量守恒假设的情况下有效处理非保守力和噪声,实现对三维旋转轨迹的精确建模。
English Summary: This study introduces a robust method using Neural Controlled Differential Equations and SO(3) Savitzky-Golay paths to model 3D rotation trajectories, effectively handling non-conservative forces and noise without relying on energy conservation assumptions.

Authors:Xiaoyan Liu, Kangrui Li, Jiaxin Liu
Title: Dream4D: Lifting Camera-Controlled I2V towards Spatiotemporally Consistent 4D Generation
Abstract:
The synthesis of spatiotemporally coherent 4D content presents fundamental challenges in computer vision, requiring simultaneous modeling of high-fidelity spatial representations and physically plausible temporal dynamics. Current approaches often struggle to maintain view consistency while handling complex scene dynamics, particularly in large-scale environments with multiple interacting elements. This work introduces Dream4D, a novel framework that bridges this gap through a synergy of controllable video generation and neural 4D reconstruction. Our approach seamlessly combines a two-stage architecture: it first predicts optimal camera trajectories from a single image using few-shot learning, then generates geometrically consistent multi-view sequences via a specialized pose-conditioned diffusion process, which are finally converted into a persistent 4D representation. This framework is the first to leverage both rich temporal priors from video diffusion models and geometric awareness of the reconstruction models, which significantly facilitates 4D generation and shows higher quality (e.g., mPSNR, mSSIM) over existing methods.

Authors:Jinke Li, Jiarui Yu, Chenxing Wei, Hande Dong, Qiang Lin, Liangjing Yang, Zhicai Wang, Yanbin Hao
Title: UniSVG: A Unified Dataset for Vector Graphic Understanding and Generation with Multimodal Large Language Models
Abstract:
Unlike bitmap images, scalable vector graphics (SVG) maintain quality when scaled, frequently employed in computer vision and artistic design in the representation of SVG code. In this era of proliferating AI-powered systems, enabling AI to understand and generate SVG has become increasingly urgent. However, AI-driven SVG understanding and generation (U&G) remain significant challenges. SVG code, equivalent to a set of curves and lines controlled by floating-point parameters, demands high precision in SVG U&G. Besides, SVG generation operates under diverse conditional constraints, including textual prompts and visual references, which requires powerful multi-modal processing for condition-to-SVG transformation. Recently, the rapid growth of Multi-modal Large Language Models (MLLMs) have demonstrated capabilities to process multi-modal inputs and generate complex vector controlling parameters, suggesting the potential to address SVG U&G tasks within a unified model. To unlock MLLM's capabilities in the SVG area, we propose an SVG-centric dataset called UniSVG, comprising 525k data items, tailored for MLLM training and evaluation. To our best knowledge, it is the first comprehensive dataset designed for unified SVG generation (from textual prompts and images) and SVG understanding (color, category, usage, etc.). As expected, learning on the proposed dataset boosts open-source MLLMs' performance on various SVG U&G tasks, surpassing SOTA close-source MLLMs like GPT-4V. We release dataset, benchmark, weights, codes and experiment details on https://ryanlijinke.github.io/.

Authors:Junhyuk So, Juncheol Shin, Hyunho Kook, Eunhyeok Park
Title: Grouped Speculative Decoding for Autoregressive Image Generation
Abstract:
Recently, autoregressive (AR) image models have demonstrated remarkable generative capabilities, positioning themselves as a compelling alternative to diffusion models. However, their sequential nature leads to long inference times, limiting their practical scalability. In this work, we introduce Grouped Speculative Decoding (GSD), a novel, training-free acceleration method for AR image models. While recent studies have explored Speculative Decoding (SD) as a means to speed up AR image generation, existing approaches either provide only modest acceleration or require additional training. Our in-depth analysis reveals a fundamental difference between language and image tokens: image tokens exhibit inherent redundancy and diversity, meaning multiple tokens can convey valid semantics. However, traditional SD methods are designed to accept only a single most-likely token, which fails to leverage this difference, leading to excessive false-negative rejections. To address this, we propose a new SD strategy that evaluates clusters of visually valid tokens rather than relying on a single target token. Additionally, we observe that static clustering based on embedding distance is ineffective, which motivates our dynamic GSD approach. Extensive experiments show that GSD accelerates AR image models by an average of 3.7x while preserving image quality-all without requiring any additional training. The source code is available at https://github.com/junhyukso/GSD
Grouped Speculative Decoding (GSD) is a novel training-free acceleration method that addresses the slow inference of autoregressive image models by dynamically evaluating clusters of visually valid tokens, achieving an average 3.7x speedup while maintaining image quality.
English Summary:

Authors:Bo Jia, Yanan Guo, Ying Chang, Benkui Zhang, Ying Xie, Kangning Du, Lin Cao
Title: Multi-view Normal and Distance Guidance Gaussian Splatting for Surface Reconstruction
Abstract:
3D Gaussian Splatting (3DGS) achieves remarkable results in the field of surface reconstruction. However, when Gaussian normal vectors are aligned within the single-view projection plane, while the geometry appears reasonable in the current view, biases may emerge upon switching to nearby views. To address the distance and global matching challenges in multi-view scenes, we design multi-view normal and distance-guided Gaussian splatting. This method achieves geometric depth unification and high-accuracy reconstruction by constraining nearby depth maps and aligning 3D normals. Specifically, for the reconstruction of small indoor and outdoor scenes, we propose a multi-view distance reprojection regularization module that achieves multi-view Gaussian alignment by computing the distance loss between two nearby views and the same Gaussian surface. Additionally, we develop a multi-view normal enhancement module, which ensures consistency across views by matching the normals of pixel points in nearby views and calculating the loss. Extensive experimental results demonstrate that our method outperforms the baseline in both quantitative and qualitative evaluations, significantly enhancing the surface reconstruction capability of 3DGS. Our code will be made publicly available at (https://github.com/Bistu3DV/MND-GS/).
中文摘要:本文提出了一种多视角法向量和距离引导的高斯溅射方法,通过约束相邻深度图和对齐三维法向量,有效解决了多视角场景中的几何偏差问题,显著提升了3DGS的表面重建能力。
English Summary: This paper introduces a multi-view normal and distance-guided Gaussian splatting method that enhances 3DGS surface reconstruction by addressing geometric inconsistencies through depth unification and normal alignment across views.

Authors:Yimin Fu, Zhunga Liu, Dongxiu Guo, Longfei Wang
Title: Collaborative Learning of Scattering and Deep Features for SAR Target Recognition with Noisy Labels
Abstract:
The acquisition of high-quality labeled synthetic aperture radar (SAR) data is challenging due to the demanding requirement for expert knowledge. Consequently, the presence of unreliable noisy labels is unavoidable, which results in performance degradation of SAR automatic target recognition (ATR). Existing research on learning with noisy labels mainly focuses on image data. However, the non-intuitive visual characteristics of SAR data are insufficient to achieve noise-robust learning. To address this problem, we propose collaborative learning of scattering and deep features (CLSDF) for SAR ATR with noisy labels. Specifically, a multi-model feature fusion framework is designed to integrate scattering and deep features. The attributed scattering centers (ASCs) are treated as dynamic graph structure data, and the extracted physical characteristics effectively enrich the representation of deep image features. Then, the samples with clean and noisy labels are divided by modeling the loss distribution with multiple class-wise Gaussian Mixture Models (GMMs). Afterward, the semi-supervised learning of two divergent branches is conducted based on the data divided by each other. Moreover, a joint distribution alignment strategy is introduced to enhance the reliability of co-guessed labels. Extensive experiments have been done on the Moving and Stationary Target Acquisition and Recognition (MSTAR) dataset, and the results show that the proposed method can achieve state-of-the-art performance under different operating conditions with various label noises.
中文摘要:本研究提出的CLSDF方法通过多模型特征融合和半监督学习,将散射特征与深度特征相结合,有效解决了合成孔径雷达自动目标识别中的噪声标签问题,在MSTAR数据集上取得了最优性能。
English Summary: The proposed CLSDF method integrates scattering and deep features through multi-model fusion and semi-supervised learning to effectively address noisy label challenges in SAR automatic target recognition, achieving state-of-the-art performance on the MSTAR dataset.

Authors:Xiaohang Zhan, Dingming Liu
Title: LaRender: Training-Free Occlusion Control in Image Generation via Latent Rendering
Abstract:
We propose a novel training-free image generation algorithm that precisely controls the occlusion relationships between objects in an image. Existing image generation methods typically rely on prompts to influence occlusion, which often lack precision. While layout-to-image methods provide control over object locations, they fail to address occlusion relationships explicitly. Given a pre-trained image diffusion model, our method leverages volume rendering principles to "render" the scene in latent space, guided by occlusion relationships and the estimated transmittance of objects. This approach does not require retraining or fine-tuning the image diffusion model, yet it enables accurate occlusion control due to its physics-grounded foundation. In extensive experiments, our method significantly outperforms existing approaches in terms of occlusion accuracy. Furthermore, we demonstrate that by adjusting the opacities of objects or concepts during rendering, our method can achieve a variety of effects, such as altering the transparency of objects, the density of mass (e.g., forests), the concentration of particles (e.g., rain, fog), the intensity of light, and the strength of lens effects, etc.

Authors:Jian Ma, Xujie Zhu, Zihao Pan, Qirong Peng, Xu Guo, Chen Chen, Haonan Lu
Title: X2Edit: Revisiting Arbitrary-Instruction Image Editing through Self-Constructed Data and Task-Aware Representation Learning
Abstract:
Existing open-source datasets for arbitrary-instruction image editing remain suboptimal, while a plug-and-play editing module compatible with community-prevalent generative models is notably absent. In this paper, we first introduce the X2Edit Dataset, a comprehensive dataset covering 14 diverse editing tasks, including subject-driven generation. We utilize the industry-leading unified image generation models and expert models to construct the data. Meanwhile, we design reasonable editing instructions with the VLM and implement various scoring mechanisms to filter the data. As a result, we construct 3.7 million high-quality data with balanced categories. Second, to better integrate seamlessly with community image generation models, we design task-aware MoE-LoRA training based on FLUX.1, with only 8\% of the parameters of the full model. To further improve the final performance, we utilize the internal representations of the diffusion model and define positive/negative samples based on image editing types to introduce contrastive learning. Extensive experiments demonstrate that the model's editing performance is competitive among many excellent models. Additionally, the constructed dataset exhibits substantial advantages over existing open-source datasets. The open-source code, checkpoints, and datasets for X2Edit can be found at the following link: https://github.com/OPPO-Mente-Lab/X2Edit.
Chinese: 本文提出了X2Edit数据集,包含370万高质量图像编辑样本覆盖14个任务,并设计了基于FLUX.1的任务感知MoE-LoRA模型,仅用8%参数量即实现卓越的编辑性能。
English: This paper introduces the X2Edit dataset, a comprehensive collection of 3.7 million high-quality image editing examples across 14 tasks, and presents a task-aware MoE-LoRA model that achieves competitive editing performance with only 8% of full model parameters.

Authors:Wenhui Song, Hanhui Li, Jiehui Huang, Panwen Hu, Yuhao Cheng, Long Chen, Yiqiang Yan, Xiaodan Liang
Title: LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation
Abstract:
In this paper, we present LaVieID, a novel \underline{l}ocal \underline{a}utoregressive \underline{vi}d\underline{e}o diffusion framework designed to tackle the challenging \underline{id}entity-preserving text-to-video task. The key idea of LaVieID is to mitigate the loss of identity information inherent in the stochastic global generation process of diffusion transformers (DiTs) from both spatial and temporal perspectives. Specifically, unlike the global and unstructured modeling of facial latent states in existing DiTs, LaVieID introduces a local router to explicitly represent latent states by weighted combinations of fine-grained local facial structures. This alleviates undesirable feature interference and encourages DiTs to capture distinctive facial characteristics. Furthermore, a temporal autoregressive module is integrated into LaVieID to refine denoised latent tokens before video decoding. This module divides latent tokens temporally into chunks, exploiting their long-range temporal dependencies to predict biases for rectifying tokens, thereby significantly enhancing inter-frame identity consistency. Consequently, LaVieID can generate high-fidelity personalized videos and achieve state-of-the-art performance. Our code and models are available at https://github.com/ssugarwh/LaVieID.
中文: LaVieID是一种新颖的局部自回归视频扩散框架,通过空间上建模细粒度面部特征和时间上利用自回归模块增强帧间一致性,有效解决了文本到视频生成中的身份保持难题。
English: LaVieID is a novel local autoregressive video diffusion framework that preserves identity in text-to-video generation by spatially modeling fine-grained facial features and temporally enhancing inter-frame consistency through autoregressive bias prediction.

Authors:Yu-Huan Wu, Wei Liu, Zi-Xuan Zhu, Zizhou Wang, Yong Liu, Liangli Zhen
Title: GAPNet: A Lightweight Framework for Image and Video Salient Object Detection via Granularity-Aware Paradigm
Abstract:
Recent salient object detection (SOD) models predominantly rely on heavyweight backbones, incurring substantial computational cost and hindering their practical application in various real-world settings, particularly on edge devices. This paper presents GAPNet, a lightweight network built on the granularity-aware paradigm for both image and video SOD. We assign saliency maps of different granularities to supervise the multi-scale decoder side-outputs: coarse object locations for high-level outputs and fine-grained object boundaries for low-level outputs. Specifically, our decoder is built with granularity-aware connections which fuse high-level features of low granularity and low-level features of high granularity, respectively. To support these connections, we design granular pyramid convolution (GPC) and cross-scale attention (CSA) modules for efficient fusion of low-scale and high-scale features, respectively. On top of the encoder, a self-attention module is built to learn global information, enabling accurate object localization with negligible computational cost. Unlike traditional U-Net-based approaches, our proposed method optimizes feature utilization and semantic interpretation while applying appropriate supervision at each processing stage. Extensive experiments show that the proposed method achieves a new state-of-the-art performance among lightweight image and video SOD models. Code is available at https://github.com/yuhuan-wu/GAPNet.
中文摘要:GAPNet提出了一种轻量级的粒度感知网络,用于图像和视频显著性目标检测,通过多尺度监督和高效融合模块,以极低计算成本实现了最先进的性能。
English Summary: GAPNet introduces a lightweight granularity-aware network for image and video salient object detection, using multi-scale supervision and efficient fusion modules to achieve state-of-the-art performance with minimal computational cost.

Authors:Chidaksh Ravuru
Title: Commentary Generation for Soccer Highlights
Abstract:
Automated soccer commentary generation has evolved from template-based systems to advanced neural architectures, aiming to produce real-time descriptions of sports events. While frameworks like SoccerNet-Caption laid foundational work, their inability to achieve fine-grained alignment between video content and commentary remains a significant challenge. Recent efforts such as MatchTime, with its MatchVoice model, address this issue through coarse and fine-grained alignment techniques, achieving improved temporal synchronization. In this paper, we extend MatchVoice to commentary generation for soccer highlights using the GOAL dataset, which emphasizes short clips over entire games. We conduct extensive experiments to reproduce the original MatchTime results and evaluate our setup, highlighting the impact of different training configurations and hardware limitations. Furthermore, we explore the effect of varying window sizes on zero-shot performance. While MatchVoice exhibits promising generalization capabilities, our findings suggest the need for integrating techniques from broader video-language domains to further enhance performance. Our code is available at https://github.com/chidaksh/SoccerCommentary.
中文摘要:本文基于GOAL数据集扩展MatchVoice模型用于足球集锦解说生成,通过实验验证了时序对齐效果的提升,同时指出需融合更广泛的视频语言技术以进一步提高性能。
English Summary: This paper extends the MatchVoice model for generating soccer commentary on highlight clips using the GOAL dataset, demonstrating improved temporal alignment through experiments while identifying the need for incorporating broader video-language techniques to enhance performance.

Authors:Xiaoming Li, Wangmeng Zuo, Chen Change Loy
Title: Enhanced Generative Structure Prior for Chinese Text Image Super-resolution
Abstract:
Faithful text image super-resolution (SR) is challenging because each character has a unique structure and usually exhibits diverse font styles and layouts. While existing methods primarily focus on English text, less attention has been paid to more complex scripts like Chinese. In this paper, we introduce a high-quality text image SR framework designed to restore the precise strokes of low-resolution (LR) Chinese characters. Unlike methods that rely on character recognition priors to regularize the SR task, we propose a novel structure prior that offers structure-level guidance to enhance visual quality. Our framework incorporates this structure prior within a StyleGAN model, leveraging its generative capabilities for restoration. To maintain the integrity of character structures while accommodating various font styles and layouts, we implement a codebook-based mechanism that restricts the generative space of StyleGAN. Each code in the codebook represents the structure of a specific character, while the vector $w$ in StyleGAN controls the character's style, including typeface, orientation, and location. Through the collaborative interaction between the codebook and style, we generate a high-resolution structure prior that aligns with LR characters both spatially and structurally. Experiments demonstrate that this structure prior provides robust, character-specific guidance, enabling the accurate restoration of clear strokes in degraded characters, even for real-world LR Chinese text with irregular layouts. Our code and pre-trained models will be available at https://github.com/csxmli2016/MARCONetPlusPlus
中文摘要:本文提出了一种文本图像超分辨率框架,通过结合结构先验与StyleGAN模型,利用码本机制分离汉字结构与风格特征,实现对低分辨率汉字笔画结构的精准重建。
English Summary: This paper introduces a text image super-resolution framework that uses a novel structure prior integrated with StyleGAN to accurately restore degraded Chinese characters by separating structural and stylistic features.

Authors:Pranav Chougule
Title: Novel View Synthesis with Gaussian Splatting: Impact on Photogrammetry Model Accuracy and Resolution
Abstract:
In this paper, I present a comprehensive study comparing Photogrammetry and Gaussian Splatting techniques for 3D model reconstruction and view synthesis. I created a dataset of images from a real-world scene and constructed 3D models using both methods. To evaluate the performance, I compared the models using structural similarity index (SSIM), peak signal-to-noise ratio (PSNR), learned perceptual image patch similarity (LPIPS), and lp/mm resolution based on the USAF resolution chart. A significant contribution of this work is the development of a modified Gaussian Splatting repository, which I forked and enhanced to enable rendering images from novel camera poses generated in the Blender environment. This innovation allows for the synthesis of high-quality novel views, showcasing the flexibility and potential of Gaussian Splatting. My investigation extends to an augmented dataset that includes both original ground images and novel views synthesized via Gaussian Splatting. This augmented dataset was employed to generate a new photogrammetry model, which was then compared against the original photogrammetry model created using only the original images. The results demonstrate the efficacy of using Gaussian Splatting to generate novel high-quality views and its potential to improve photogrammetry-based 3D reconstructions. The comparative analysis highlights the strengths and limitations of both approaches, providing valuable information for applications in extended reality (XR), photogrammetry, and autonomous vehicle simulations. Code is available at https://github.com/pranavc2255/gaussian-splatting-novel-view-render.git.
Chinese: 本研究比较了摄影测量与高斯泼溅技术在三维重建中的表现,通过改进的高斯泼溅方法生成高质量新视角图像,有效提升了摄影测量模型的质量,为扩展现实和自动驾驶模拟提供了实用参考。
English: This study compares Photogrammetry and Gaussian Splatting for 3D reconstruction, demonstrating that enhanced Gaussian Splatting can generate high-quality novel views to improve photogrammetric models through comprehensive performance metrics.

Authors:Yuxin Zhang, Yunkang Cao, Yuqi Cheng, Yihan Sun, Weiming Shen
Title: Levarging Learning Bias for Noisy Anomaly Detection
Abstract:
This paper addresses the challenge of fully unsupervised image anomaly detection (FUIAD), where training data may contain unlabeled anomalies. Conventional methods assume anomaly-free training data, but real-world contamination leads models to absorb anomalies as normal, degrading detection performance. To mitigate this, we propose a two-stage framework that systematically exploits inherent learning bias in models. The learning bias stems from: (1) the statistical dominance of normal samples, driving models to prioritize learning stable normal patterns over sparse anomalies, and (2) feature-space divergence, where normal data exhibit high intra-class consistency while anomalies display high diversity, leading to unstable model responses. Leveraging the learning bias, stage 1 partitions the training set into subsets, trains sub-models, and aggregates cross-model anomaly scores to filter a purified dataset. Stage 2 trains the final detector on this dataset. Experiments on the Real-IAD benchmark demonstrate superior anomaly detection and localization performance under different noise conditions. Ablation studies further validate the framework's contamination resilience, emphasizing the critical role of learning bias exploitation. The model-agnostic design ensures compatibility with diverse unsupervised backbones, offering a practical solution for real-world scenarios with imperfect training data. Code is available at https://github.com/hustzhangyuxin/LLBNAD.
中文摘要:本文提出了一种两阶段全无监督图像异常检测框架,通过利用模型内在学习偏差来过滤受污染的训练数据,在不同噪声条件下均实现了卓越的异常检测与定位性能。
English Summary: This paper introduces a two-stage framework for fully unsupervised image anomaly detection that leverages inherent model learning bias to filter contaminated training data, achieving superior detection and localization performance across various noise conditions.

Authors:Youqi Wang, Shunquan Tan, Rongxuan Peng, Bin Li, Jiwu Huang
Title: CLUE: Leveraging Low-Rank Adaptation to Capture Latent Uncovered Evidence for Image Forgery Localization
Abstract:
The increasing accessibility of image editing tools and generative AI has led to a proliferation of visually convincing forgeries, compromising the authenticity of digital media. In this paper, in addition to leveraging distortions from conventional forgeries, we repurpose the mechanism of a state-of-the-art (SOTA) text-to-image synthesis model by exploiting its internal generative process, turning it into a high-fidelity forgery localization tool. To this end, we propose CLUE (Capture Latent Uncovered Evidence), a framework that employs Low- Rank Adaptation (LoRA) to parameter-efficiently reconfigure Stable Diffusion 3 (SD3) as a forensic feature extractor. Our approach begins with the strategic use of SD3's Rectified Flow (RF) mechanism to inject noise at varying intensities into the latent representation, thereby steering the LoRAtuned denoising process to amplify subtle statistical inconsistencies indicative of a forgery. To complement the latent analysis with high-level semantic context and precise spatial details, our method incorporates contextual features from the image encoder of the Segment Anything Model (SAM), which is parameter-efficiently adapted to better trace the boundaries of forged regions. Extensive evaluations demonstrate CLUE's SOTA generalization performance, significantly outperforming prior methods. Furthermore, CLUE shows superior robustness against common post-processing attacks and Online Social Networks (OSNs). Code is publicly available at https://github.com/SZAISEC/CLUE.
中文: 本文提出CLUE框架,通过改造Stable Diffusion 3并结合Segment Anything模型,能高效检测并定位数字图像伪造区域,其通过放大统计异常和强化边界细节实现卓越的取证性能。
English: The paper introduces CLUE, a framework that repurposes Stable Diffusion 3 and integrates it with the Segment Anything Model to efficiently detect and localize digital image forgeries by amplifying statistical inconsistencies and enhancing boundary details.

Authors:Junyao Gao, Jiaxing Li, Wenran Liu, Yanhong Zeng, Fei Shen, Kai Chen, Yanan Sun, Cairong Zhao
Title: CharacterShot: Controllable and Consistent 4D Character Animation
Abstract:
In this paper, we propose \textbf{CharacterShot}, a controllable and consistent 4D character animation framework that enables any individual designer to create dynamic 3D characters (i.e., 4D character animation) from a single reference character image and a 2D pose sequence. We begin by pretraining a powerful 2D character animation model based on a cutting-edge DiT-based image-to-video model, which allows for any 2D pose sequnce as controllable signal. We then lift the animation model from 2D to 3D through introducing dual-attention module together with camera prior to generate multi-view videos with spatial-temporal and spatial-view consistency. Finally, we employ a novel neighbor-constrained 4D gaussian splatting optimization on these multi-view videos, resulting in continuous and stable 4D character representations. Moreover, to improve character-centric performance, we construct a large-scale dataset Character4D, containing 13,115 unique characters with diverse appearances and motions, rendered from multiple viewpoints. Extensive experiments on our newly constructed benchmark, CharacterBench, demonstrate that our approach outperforms current state-of-the-art methods. Code, models, and datasets will be publicly available at https://github.com/Jeoyal/CharacterShot.
中文: 本文提出CharacterShot框架,通过单张角色图像和二维姿态序列生成可控的四维角色动画,结合二维动画预训练、双注意力三维提升和优化高斯溅射技术,并在新建大规模数据集上验证了其优越性能。
English: This paper introduces CharacterShot, a controllable 4D character animation framework that generates dynamic 3D characters from a single image and 2D pose sequence through a multi-stage process involving 2D animation pretraining, 3D lifting with dual-attention modules, and optimized 4D Gaussian splatting, validated on a new large-scale dataset.

Authors:Rongxuan Peng, Shunquan Tan, Chenqi Kong, Anwei Luo, Alex C. Kot, Jiwu Huang
Title: ForensicsSAM: Toward Robust and Unified Image Forgery Detection and Localization Resisting to Adversarial Attack
Abstract:
Parameter-efficient fine-tuning (PEFT) has emerged as a popular strategy for adapting large vision foundation models, such as the Segment Anything Model (SAM) and LLaVA, to downstream tasks like image forgery detection and localization (IFDL). However, existing PEFT-based approaches overlook their vulnerability to adversarial attacks. In this paper, we show that highly transferable adversarial images can be crafted solely via the upstream model, without accessing the downstream model or training data, significantly degrading the IFDL performance. To address this, we propose ForensicsSAM, a unified IFDL framework with built-in adversarial robustness. Our design is guided by three key ideas: (1) To compensate for the lack of forgery-relevant knowledge in the frozen image encoder, we inject forgery experts into each transformer block to enhance its ability to capture forgery artifacts. These forgery experts are always activated and shared across any input images. (2) To detect adversarial images, we design an light-weight adversary detector that learns to capture structured, task-specific artifact in RGB domain, enabling reliable discrimination across various attack methods. (3) To resist adversarial attacks, we inject adversary experts into the global attention layers and MLP modules to progressively correct feature shifts induced by adversarial noise. These adversary experts are adaptively activated by the adversary detector, thereby avoiding unnecessary interference with clean images. Extensive experiments across multiple benchmarks demonstrate that ForensicsSAM achieves superior resistance to various adversarial attack methods, while also delivering state-of-the-art performance in image-level forgery detection and pixel-level forgery localization. The resource is available at https://github.com/siriusPRX/ForensicsSAM.
中文摘要:针对视觉模型参数高效微调方法易受对抗攻击的问题,我们提出ForensicsSAM框架,通过注入伪造专家和对抗检测器来增强模型鲁棒性,在图像伪造检测与定位任务中同时实现卓越的抗攻击能力和最优性能。
English summary: Parameter-efficient fine-tuning (PEFT) methods for vision models are vulnerable to adversarial attacks, so we propose ForensicsSAM, a unified framework that integrates forgery experts and adversary detectors to enhance robustness while maintaining state-of-the-art performance in image forgery detection and localization.

Authors:Qilin Zhang, Olaf Wysocki, Boris Jutzi
Title: GS4Buildings: Prior-Guided Gaussian Splatting for 3D Building Reconstruction
Abstract:
Recent advances in Gaussian Splatting (GS) have demonstrated its effectiveness in photo-realistic rendering and 3D reconstruction. Among these, 2D Gaussian Splatting (2DGS) is particularly suitable for surface reconstruction due to its flattened Gaussian representation and integrated normal regularization. However, its performance often degrades in large-scale and complex urban scenes with frequent occlusions, leading to incomplete building reconstructions. We propose GS4Buildings, a novel prior-guided Gaussian Splatting method leveraging the ubiquity of semantic 3D building models for robust and scalable building surface reconstruction. Instead of relying on traditional Structure-from-Motion (SfM) pipelines, GS4Buildings initializes Gaussians directly from low-level Level of Detail (LoD)2 semantic 3D building models. Moreover, we generate prior depth and normal maps from the planar building geometry and incorporate them into the optimization process, providing strong geometric guidance for surface consistency and structural accuracy. We also introduce an optional building-focused mode that limits reconstruction to building regions, achieving a 71.8% reduction in Gaussian primitives and enabling a more efficient and compact representation. Experiments on urban datasets demonstrate that GS4Buildings improves reconstruction completeness by 20.5% and geometric accuracy by 32.8%. These results highlight the potential of semantic building model integration to advance GS-based reconstruction toward real-world urban applications such as smart cities and digital twins. Our project is available: https://github.com/zqlin0521/GS4Buildings.
中文: GS4Buildings提出了一种基于先验引导的高斯泼溅方法,利用语义三维建筑模型将复杂城市场景的重建完整度提升20.5%,几何精度提高32.8%,并可通过专注建筑区域模式减少71.8%的高斯基元以实现高效重建。
English: GS4Buildings introduces a prior-guided Gaussian Splatting method that leverages semantic 3D building models to enhance reconstruction completeness by 20.5% and geometric accuracy by 32.8% in complex urban scenes, while optionally reducing Gaussian primitives by 71.8% for efficiency.

Authors:Tingyu Yang, Jue Gong, Jinpei Guo, Wenbo Li, Yong Guo, Yulun Zhang
Title: SODiff: Semantic-Oriented Diffusion Model for JPEG Compression Artifacts Removal
Abstract:
JPEG, as a widely used image compression standard, often introduces severe visual artifacts when achieving high compression ratios. Although existing deep learning-based restoration methods have made considerable progress, they often struggle to recover complex texture details, resulting in over-smoothed outputs. To overcome these limitations, we propose SODiff, a novel and efficient semantic-oriented one-step diffusion model for JPEG artifacts removal. Our core idea is that effective restoration hinges on providing semantic-oriented guidance to the pre-trained diffusion model, thereby fully leveraging its powerful generative prior. To this end, SODiff incorporates a semantic-aligned image prompt extractor (SAIPE). SAIPE extracts rich features from low-quality (LQ) images and projects them into an embedding space semantically aligned with that of the text encoder. Simultaneously, it preserves crucial information for faithful reconstruction. Furthermore, we propose a quality factor-aware time predictor that implicitly learns the compression quality factor (QF) of the LQ image and adaptively selects the optimal denoising start timestep for the diffusion process. Extensive experimental results show that our SODiff outperforms recent leading methods in both visual quality and quantitative metrics. Code is available at: https://github.com/frakenation/SODiff
Chinese: SODiff是一种新颖的语义导向一步扩散模型,通过引入语义对齐图像提示提取器和质量因子感知时间预测器,有效去除JPEG伪影,在视觉质量和量化指标上均优于现有方法。
English: SODiff is a novel semantic-oriented one-step diffusion model that effectively removes JPEG artifacts by incorporating semantic-aligned image prompts and a quality factor-aware time predictor, outperforming existing methods in visual quality and quantitative metrics.

Authors:Fangtai Wu, Mushui Liu, Weijie He, Wanggui He, Hao Jiang, Zhao Wang, Yunlong Yu
Title: CoAR: Concept Injection into Autoregressive Models for Personalized Text-to-Image Generation
Abstract:
The unified autoregressive (AR) model excels at multimodal understanding and generation, but its potential for customized image generation remains underexplored. Existing customized generation methods rely on full fine-tuning or adapters, making them costly and prone to overfitting or catastrophic forgetting. In this paper, we propose \textbf{CoAR}, a novel framework for injecting subject concepts into the unified AR models while keeping all pre-trained parameters completely frozen. CoAR learns effective, specific subject representations with only a minimal number of parameters using a Layerwise Multimodal Context Learning strategy. To address overfitting and language drift, we further introduce regularization that preserves the pre-trained distribution and anchors context tokens to improve subject fidelity and re-contextualization. Additionally, CoAR supports training-free subject customization in a user-provided style. Experiments demonstrate that CoAR achieves superior performance on both subject-driven personalization and style personalization, while delivering significant gains in computational and memory efficiency. Notably, CoAR tunes less than \textbf{0.05\%} of the parameters while achieving competitive performance compared to recent Proxy-Tuning. Code: https://github.com/KZF-kzf/CoAR
中文:CoAR是一种创新框架,通过极少量参数调整将主题概念注入统一自回归模型,在实现卓越定制化和高效性的同时有效防止过拟合和语言漂移。
English: CoAR is a novel framework that injects subject concepts into unified autoregressive models with minimal parameter tuning, achieving superior customization and efficiency while preventing overfitting and language drift.

Authors:Min Yang, Zihan Jia, Zhilin Dai, Sheng Guo, Limin Wang
Title: MobileViCLIP: An Efficient Video-Text Model for Mobile Devices
Abstract:
Efficient lightweight neural networks are with increasing attention due to their faster reasoning speed and easier deployment on mobile devices. However, existing video pre-trained models still focus on the common ViT architecture with high latency, and few works attempt to build efficient architecture on mobile devices. This paper bridges this gap by introducing temporal structural reparameterization into an efficient image-text model and training it on a large-scale high-quality video-text dataset, resulting in an efficient video-text model that can run on mobile devices with strong zero-shot classification and retrieval capabilities, termed as MobileViCLIP. In particular, in terms of inference speed on mobile devices, our MobileViCLIP-Small is 55.4x times faster than InternVideo2-L14 and 6.7x faster than InternVideo2-S14. In terms of zero-shot retrieval performance, our MobileViCLIP-Small obtains similar performance as InternVideo2-L14 and obtains 6.9\% better than InternVideo2-S14 on MSR-VTT. The code is available at https://github.com/MCG-NJU/MobileViCLIP.
Chinese: 本文提出MobileViCLIP高效视频文本模型,通过时序结构重参数化技术和大规模训练,在移动设备上实现高速运行,具备强大的零样本分类与检索能力,其速度和性能均显著超越现有模型。
English: This paper introduces MobileViCLIP, an efficient video-text model that leverages temporal structural reparameterization and large-scale training to achieve high-speed performance on mobile devices with strong zero-shot capabilities, significantly outperforming existing models in both speed and retrieval accuracy.

Authors:Haiyang Guo, Fei Zhu, Hongbo Zhao, Fanhu Zeng, Wenzhuo Liu, Shijie Ma, Da-Han Wang, Xu-Yao Zhang
Title: MCITlib: Multimodal Continual Instruction Tuning Library and Benchmark
Abstract:
Continual learning aims to equip AI systems with the ability to continuously acquire and adapt to new knowledge without forgetting previously learned information, similar to human learning. While traditional continual learning methods focusing on unimodal tasks have achieved notable success, the emergence of Multimodal Large Language Models has brought increasing attention to Multimodal Continual Learning tasks involving multiple modalities, such as vision and language. In this setting, models are expected to not only mitigate catastrophic forgetting but also handle the challenges posed by cross-modal interactions and coordination. To facilitate research in this direction, we introduce MCITlib, a comprehensive and constantly evolving code library for continual instruction tuning of Multimodal Large Language Models. In MCITlib, we have currently implemented 8 representative algorithms for Multimodal Continual Instruction Tuning and systematically evaluated them on 2 carefully selected benchmarks. MCITlib will be continuously updated to reflect advances in the Multimodal Continual Learning field. The codebase is released at https://github.com/Ghy0501/MCITlib.
中文: 持续学习旨在让AI系统能够不断获取新知识而不遗忘已学内容,随着多模态大语言模型的出现,涉及视觉和语言等多模态的持续学习任务受到关注,为此我们开发了MCITlib代码库,用于多模态持续指令调优,目前已实现8种算法并在2个基准上进行了系统评估。
English: Continual learning enables AI to continuously learn new knowledge without forgetting past information, and with the rise of Multimodal Large Language Models, Multimodal Continual Learning has gained attention for handling tasks across multiple modalities like vision and language, leading to the development of MCITlib, a code library for continual instruction tuning that includes 8 algorithms and evaluations on 2 benchmarks.

Authors:Ping-Mao Huang, I-Tien Chao, Ping-Chia Huang, Jia-Wei Liao, Yung-Yu Chuang
Title: BEVANet: Bilateral Efficient Visual Attention Network for Real-Time Semantic Segmentation
Abstract:
Real-time semantic segmentation presents the dual challenge of designing efficient architectures that capture large receptive fields for semantic understanding while also refining detailed contours. Vision transformers model long-range dependencies effectively but incur high computational cost. To address these challenges, we introduce the Large Kernel Attention (LKA) mechanism. Our proposed Bilateral Efficient Visual Attention Network (BEVANet) expands the receptive field to capture contextual information and extracts visual and structural features using Sparse Decomposed Large Separable Kernel Attentions (SDLSKA). The Comprehensive Kernel Selection (CKS) mechanism dynamically adapts the receptive field to further enhance performance. Furthermore, the Deep Large Kernel Pyramid Pooling Module (DLKPPM) enriches contextual features by synergistically combining dilated convolutions and large kernel attention. The bilateral architecture facilitates frequent branch communication, and the Boundary Guided Adaptive Fusion (BGAF) module enhances boundary delineation by integrating spatial and semantic features under boundary guidance. BEVANet achieves real-time segmentation at 33 FPS, yielding 79.3% mIoU without pretraining and 81.0% mIoU on Cityscapes after ImageNet pretraining, demonstrating state-of-the-art performance. The code and model is available at https://github.com/maomao0819/BEVANet.
Chinese: BEVANet采用双边架构和大核注意力机制,结合自适应模块实现实时语义分割,在Cityscapes数据集上以33 FPS达到81.0% mIoU的顶尖性能。
English: BEVANet introduces a bilateral architecture with Large Kernel Attention and adaptive mechanisms to achieve real-time semantic segmentation, delivering state-of-the-art performance of 81.0% mIoU on Cityscapes at 33 FPS.

Authors:Zhiqiang Shen, Peng Cao, Xiaoli Liu, Jinzhu Yang, Osmar R. Zaiane
Title: SynMatch: Rethinking Consistency in Medical Image Segmentation with Sparse Annotations
Abstract:
Label scarcity remains a major challenge in deep learning-based medical image segmentation. Recent studies use strong-weak pseudo supervision to leverage unlabeled data. However, performance is often hindered by inconsistencies between pseudo labels and their corresponding unlabeled images. In this work, we propose \textbf{SynMatch}, a novel framework that sidesteps the need for improving pseudo labels by synthesizing images to match them instead. Specifically, SynMatch synthesizes images using texture and shape features extracted from the same segmentation model that generates the corresponding pseudo labels for unlabeled images. This design enables the generation of highly consistent synthesized-image-pseudo-label pairs without requiring any training parameters for image synthesis. We extensively evaluate SynMatch across diverse medical image segmentation tasks under semi-supervised learning (SSL), weakly-supervised learning (WSL), and barely-supervised learning (BSL) settings with increasingly limited annotations. The results demonstrate that SynMatch achieves superior performance, especially in the most challenging BSL setting. For example, it outperforms the recent strong-weak pseudo supervision-based method by 29.71\% and 10.05\% on the polyp segmentation task with 5\% and 10\% scribble annotations, respectively. The code will be released at https://github.com/Senyh/SynMatch.
Chinese Summary: SynMatch通过合成与伪标签相匹配的图像来解决医学图像分割中的标签稀缺问题,无需额外训练参数,在多种标注受限场景下均实现了卓越性能。
English Summary: SynMatch addresses label scarcity in medical image segmentation by synthesizing images to match pseudo labels, achieving superior performance across various annotation-limited settings without additional training parameters.

Authors:Xiang Xiang, Qinhao Zhou, Zhuo Xu, Jing Ma, Jiaxin Dai, Yifan Liang, Hanlin Li
Title: OpenHAIV: A Framework Towards Practical Open-World Learning
Abstract:
Substantial progress has been made in various techniques for open-world recognition. Out-of-distribution (OOD) detection methods can effectively distinguish between known and unknown classes in the data, while incremental learning enables continuous model knowledge updates. However, in open-world scenarios, these approaches still face limitations. Relying solely on OOD detection does not facilitate knowledge updates in the model, and incremental fine-tuning typically requires supervised conditions, which significantly deviate from open-world settings. To address these challenges, this paper proposes OpenHAIV, a novel framework that integrates OOD detection, new class discovery, and incremental continual fine-tuning into a unified pipeline. This framework allows models to autonomously acquire and update knowledge in open-world environments. The proposed framework is available at https://haiv-lab.github.io/openhaiv .

Authors:Sihan Yang, Huitong Ji, Shaolin Lu, Jiayi Chen, Binxiao Xu, Ming Lu, Yuanxing Zhang, Wenhui Dong, Wentao Zhang
Title: Small-Large Collaboration: Training-efficient Concept Personalization for Large VLM using a Meta Personalized Small VLM
Abstract:
Personalizing Vision-Language Models (VLMs) to transform them into daily assistants has emerged as a trending research direction. However, leading companies like OpenAI continue to increase model size and develop complex designs such as the chain of thought (CoT). While large VLMs are proficient in complex multi-modal understanding, their high training costs and limited access via paid APIs restrict direct personalization. Conversely, small VLMs are easily personalized and freely available, but they lack sufficient reasoning capabilities. Inspired by this, we propose a novel collaborative framework named Small-Large Collaboration (SLC) for large VLM personalization, where the small VLM is responsible for generating personalized information, while the large model integrates this personalized information to deliver accurate responses. To effectively incorporate personalized information, we develop a test-time reflection strategy, preventing the potential hallucination of the small VLM. Since SLC only needs to train a meta personalized small VLM for the large VLMs, the overall process is training-efficient. To the best of our knowledge, this is the first training-efficient framework that supports both open-source and closed-source large VLMs, enabling broader real-world personalized applications. We conduct thorough experiments across various benchmarks and large VLMs to demonstrate the effectiveness of the proposed SLC framework. The code will be released at https://github.com/Hhankyangg/SLC.
中文: 本文提出了一种训练高效的“大小模型协作”(SLC)框架,通过小型视觉语言模型生成个性化信息、大型模型整合信息实现精准响应,为视觉语言模型的现实个性化应用开辟了新途径。
English: This paper introduces a training-efficient Small-Large Collaboration (SLC) framework where small VLMs generate personalized information and large VLMs integrate it for accurate responses, enabling broader real-world personalization of vision-language models.

Authors:Fengchao Xiong, Zhenxing Wu, Sen Jia, Yuntao Qian
Title: SUIT: Spatial-Spectral Union-Intersection Interaction Network for Hyperspectral Object Tracking
Abstract:
Hyperspectral videos (HSVs), with their inherent spatial-spectral-temporal structure, offer distinct advantages in challenging tracking scenarios such as cluttered backgrounds and small objects. However, existing methods primarily focus on spatial interactions between the template and search regions, often overlooking spectral interactions, leading to suboptimal performance. To address this issue, this paper investigates spectral interactions from both the architectural and training perspectives. At the architectural level, we first establish band-wise long-range spatial relationships between the template and search regions using Transformers. We then model spectral interactions using the inclusion-exclusion principle from set theory, treating them as the union of spatial interactions across all bands. This enables the effective integration of both shared and band-specific spatial cues. At the training level, we introduce a spectral loss to enforce material distribution alignment between the template and predicted regions, enhancing robustness to shape deformation and appearance variations. Extensive experiments demonstrate that our tracker achieves state-of-the-art tracking performance. The source code, trained models and results will be publicly available via https://github.com/bearshng/suit to support reproducibility.
中文摘要:本文提出了一种新颖的高光谱视频跟踪方法,通过基于Transformer的空间关系建模和光谱损失函数实现材料分布对齐,有效提升了跟踪性能,达到了当前最优水平。
English Summary: This paper proposes a novel hyperspectral video tracking method that enhances performance by modeling spectral interactions through Transformer-based spatial relationships and a spectral loss for material alignment, achieving state-of-the-art results.

Authors:Xin Ma, Yaohui Wang, Genyun Jia, Xinyuan Chen, Tien-Tsin Wong, Cunjian Chen
Title: Consistent and Controllable Image Animation with Motion Linear Diffusion Transformers
Abstract:
Image animation has seen significant progress, driven by the powerful generative capabilities of diffusion models. However, maintaining appearance consistency with static input images and mitigating abrupt motion transitions in generated animations remain persistent challenges. While text-to-video (T2V) generation has demonstrated impressive performance with diffusion transformer models, the image animation field still largely relies on U-Net-based diffusion models, which lag behind the latest T2V approaches. Moreover, the quadratic complexity of vanilla self-attention mechanisms in Transformers imposes heavy computational demands, making image animation particularly resource-intensive. To address these issues, we propose MiraMo, a framework designed to enhance efficiency, appearance consistency, and motion smoothness in image animation. Specifically, MiraMo introduces three key elements: (1) A foundational text-to-video architecture replacing vanilla self-attention with efficient linear attention to reduce computational overhead while preserving generation quality; (2) A novel motion residual learning paradigm that focuses on modeling motion dynamics rather than directly predicting frames, improving temporal consistency; and (3) A DCT-based noise refinement strategy during inference to suppress sudden motion artifacts, complemented by a dynamics control module to balance motion smoothness and expressiveness. Extensive experiments against state-of-the-art methods validate the superiority of MiraMo in generating consistent, smooth, and controllable animations with accelerated inference speed. Additionally, we demonstrate the versatility of MiraMo through applications in motion transfer and video editing tasks.
中文摘要:提出的MiraMo框架通过采用高效线性注意力降低计算开销、运动残差学习范式提升时序一致性,以及噪声优化策略抑制运动伪影,显著提升了图像动画的流畅度与可控性,实验证明其综合性能优于现有方法。
English Summary: The proposed MiraMo framework enhances image animation by integrating efficient linear attention for reduced computational cost, a motion residual learning paradigm for improved temporal consistency, and a noise refinement strategy to suppress motion artifacts, achieving superior performance in smoothness and controllability.

Authors:Bo Wang, Mengyuan Xu, Yue Yan, Yuqun Yang, Kechen Shu, Wei Ping, Xu Tang, Wei Jiang, Zheng You
Title: ASM-UNet: Adaptive Scan Mamba Integrating Group Commonalities and Individual Variations for Fine-Grained Segmentation
Abstract:
Precise lesion resection depends on accurately identifying fine-grained anatomical structures. While many coarse-grained segmentation (CGS) methods have been successful in large-scale segmentation (e.g., organs), they fall short in clinical scenarios requiring fine-grained segmentation (FGS), which remains challenging due to frequent individual variations in small-scale anatomical structures. Although recent Mamba-based models have advanced medical image segmentation, they often rely on fixed manually-defined scanning orders, which limit their adaptability to individual variations in FGS. To address this, we propose ASM-UNet, a novel Mamba-based architecture for FGS. It introduces adaptive scan scores to dynamically guide the scanning order, generated by combining group-level commonalities and individual-level variations. Experiments on two public datasets (ACDC and Synapse) and a newly proposed challenging biliary tract FGS dataset, namely BTMS, demonstrate that ASM-UNet achieves superior performance in both CGS and FGS tasks. Our code and dataset are available at https://github.com/YqunYang/ASM-UNet.
中文: ASM-UNet通过自适应扫描评分动态调整扫描顺序,在多个医疗数据集上实现了粗粒度与细粒度分割任务的卓越性能。
English: ASM-UNet introduces adaptive scan scores to dynamically guide scanning orders, achieving superior performance in both coarse-grained and fine-grained segmentation tasks across multiple medical datasets.

Authors:Huihui Xu, Jiashi Lin, Haoyu Chen, Junjun He, Lei Zhu
Title: EventRR: Event Referential Reasoning for Referring Video Object Segmentation
Abstract:
Referring Video Object Segmentation (RVOS) aims to segment out the object in a video referred by an expression. Current RVOS methods view referring expressions as unstructured sequences, neglecting their crucial semantic structure essential for referent reasoning. Besides, in contrast to image-referring expressions whose semantics focus only on object attributes and object-object relations, video-referring expressions also encompass event attributes and event-event temporal relations. This complexity challenges traditional structured reasoning image approaches. In this paper, we propose the Event Referential Reasoning (EventRR) framework. EventRR decouples RVOS into object summarization part and referent reasoning part. The summarization phase begins by summarizing each frame into a set of bottleneck tokens, which are then efficiently aggregated in the video-level summarization step to exchange the global cross-modal temporal context. For reasoning part, EventRR extracts semantic eventful structure of a video-referring expression into highly expressive Referential Event Graph (REG), which is a single-rooted directed acyclic graph. Guided by topological traversal of REG, we propose Temporal Concept-Role Reasoning (TCRR) to accumulate the referring score of each temporal query from REG leaf nodes to root node. Each reasoning step can be interpreted as a question-answer pair derived from the concept-role relations in REG. Extensive experiments across four widely recognized benchmark datasets, show that EventRR quantitatively and qualitatively outperforms state-of-the-art RVOS methods. Code is available at https://github.com/bio-mlhui/EventRR
中文:本文提出的EventRR框架通过将任务分解为对象摘要和指代推理,利用指代事件图构建语义结构,有效解决了视频指代对象分割中表达复杂性的挑战,在多个基准测试中超越了现有最优方法。
English: The proposed EventRR framework addresses the limitations of current Referring Video Object Segmentation methods by decoupling the task into object summarization and referential reasoning, utilizing a Referential Event Graph to structure expressions and outperforming state-of-the-art approaches across multiple benchmarks.

Authors:Yunpeng Shi, Lei Chen, Xiaolu Shen, Yanju Guo
Title: Lightweight Multi-Scale Feature Extraction with Fully Connected LMF Layer for Salient Object Detection
Abstract:
In the domain of computer vision, multi-scale feature extraction is vital for tasks such as salient object detection. However, achieving this capability in lightweight networks remains challenging due to the trade-off between efficiency and performance. This paper proposes a novel lightweight multi-scale feature extraction layer, termed the LMF layer, which employs depthwise separable dilated convolutions in a fully connected structure. By integrating multiple LMF layers, we develop LMFNet, a lightweight network tailored for salient object detection. Our approach significantly reduces the number of parameters while maintaining competitive performance. Here, we show that LMFNet achieves state-of-the-art or comparable results on five benchmark datasets with only 0.81M parameters, outperforming several traditional and lightweight models in terms of both efficiency and accuracy. Our work not only addresses the challenge of multi-scale learning in lightweight networks but also demonstrates the potential for broader applications in image processing tasks. The related code files are available at https://github.com/Shi-Yun-peng/LMFNet
中文: 本文提出LMFNet轻量级网络,通过新型多尺度特征提取层在显著目标检测中仅用0.81M参数就实现最优性能,成功解决了轻量网络中效率与精度的平衡难题。
English: This paper introduces LMFNet, a lightweight network using novel multi-scale layers that achieve state-of-the-art salient object detection with minimal parameters while maintaining high efficiency and accuracy.

Authors:Yingtie Lei, Fanghai Yi, Yihang Dong, Weihuang Liu, Xiaofeng Zhang, Zimeng Li, Chi-Man Pun, Xuhang Chen
Title: CMAMRNet: A Contextual Mask-Aware Network Enhancing Mural Restoration Through Comprehensive Mask Guidance
Abstract:
Murals, as invaluable cultural artifacts, face continuous deterioration from environmental factors and human activities. Digital restoration of murals faces unique challenges due to their complex degradation patterns and the critical need to preserve artistic authenticity. Existing learning-based methods struggle with maintaining consistent mask guidance throughout their networks, leading to insufficient focus on damaged regions and compromised restoration quality. We propose CMAMRNet, a Contextual Mask-Aware Mural Restoration Network that addresses these limitations through comprehensive mask guidance and multi-scale feature extraction. Our framework introduces two key components: (1) the Mask-Aware Up/Down-Sampler (MAUDS), which ensures consistent mask sensitivity across resolution scales through dedicated channel-wise feature selection and mask-guided feature fusion; and (2) the Co-Feature Aggregator (CFA), operating at both the highest and lowest resolutions to extract complementary features for capturing fine textures and global structures in degraded regions. Experimental results on benchmark datasets demonstrate that CMAMRNet outperforms state-of-the-art methods, effectively preserving both structural integrity and artistic details in restored murals. The code is available at~\href{https://github.com/CXH-Research/CMAMRNet}{https://github.com/CXH-Research/CMAMRNet}.
中文摘要:CMAMRNet提出了一种创新的上下文掩码感知网络,通过专用组件实现持续掩码引导和多尺度特征提取,在保持壁画艺术真实性和结构完整性方面优于现有方法。
English Summary: CMAMRNet introduces a novel contextual mask-aware network with dedicated components for consistent mask guidance and multi-scale feature extraction, outperforming existing methods in preserving mural authenticity and structural details.

Authors:Yuke Xing, William Gordon, Qi Yang, Kaifa Yang, Jiarui Wang, Yiling Xu
Title: 3DGS-VBench: A Comprehensive Video Quality Evaluation Benchmark for 3DGS Compression
Abstract:
3D Gaussian Splatting (3DGS) enables real-time novel view synthesis with high visual fidelity, but its substantial storage requirements hinder practical deployment, prompting state-of-the-art (SOTA) 3DGS methods to incorporate compression modules. However, these 3DGS generative compression techniques introduce unique distortions lacking systematic quality assessment research. To this end, we establish 3DGS-VBench, a large-scale Video Quality Assessment (VQA) Dataset and Benchmark with 660 compressed 3DGS models and video sequences generated from 11 scenes across 6 SOTA 3DGS compression algorithms with systematically designed parameter levels. With annotations from 50 participants, we obtained MOS scores with outlier removal and validated dataset reliability. We benchmark 6 3DGS compression algorithms on storage efficiency and visual quality, and evaluate 15 quality assessment metrics across multiple paradigms. Our work enables specialized VQA model training for 3DGS, serving as a catalyst for compression and quality assessment research. The dataset is available at https://github.com/YukeXing/3DGS-VBench.
中文: 3DGS-VBench建立了首个针对3D高斯溅射压缩算法的视频质量评估基准,通过包含660个压缩模型的数据集填补了系统性质量分析空白,为压缩技术与质量评估研究提供了重要支撑。
English: 3DGS-VBench introduces a comprehensive video quality assessment benchmark for evaluating compression algorithms in 3D Gaussian Splatting, addressing the lack of systematic quality analysis while enabling specialized model development through its open dataset.

Authors:Huihui Xu, Jin Ye, Hongqiu Wang, Changkai Ji, Jiashi Lin, Ming Hu, Ziyan Huang, Ying Chen, Chenglong Ma, Tianbin Li, Lihao Liu, Junjun He, Lei Zhu
Title: S2-UniSeg: Fast Universal Agglomerative Pooling for Scalable Segment Anything without Supervision
Abstract:
Recent self-supervised image segmentation models have achieved promising performance on semantic segmentation and class-agnostic instance segmentation. However, their pretraining schedule is multi-stage, requiring a time-consuming pseudo-masks generation process between each training epoch. This time-consuming offline process not only makes it difficult to scale with training dataset size, but also leads to sub-optimal solutions due to its discontinuous optimization routine. To solve these, we first present a novel pseudo-mask algorithm, Fast Universal Agglomerative Pooling (UniAP). Each layer of UniAP can identify groups of similar nodes in parallel, allowing to generate both semantic-level and instance-level and multi-granular pseudo-masks within ens of milliseconds for one image. Based on the fast UniAP, we propose the Scalable Self-Supervised Universal Segmentation (S2-UniSeg), which employs a student and a momentum teacher for continuous pretraining. A novel segmentation-oriented pretext task, Query-wise Self-Distillation (QuerySD), is proposed to pretrain S2-UniSeg to learn the local-to-global correspondences. Under the same setting, S2-UniSeg outperforms the SOTA UnSAM model, achieving notable improvements of AP+6.9 on COCO, AR+11.1 on UVO, PixelAcc+4.5 on COCOStuff-27, RQ+8.0 on Cityscapes. After scaling up to a larger 2M-image subset of SA-1B, S2-UniSeg further achieves performance gains on all four benchmarks. Our code and pretrained models are available at https://github.com/bio-mlhui/S2-UniSeg
Chinese: 提出的S2-UniSeg模型采用快速通用聚合池化算法,实现了连续自监督预训练,并在多个分割基准测试中显著超越了现有最优方法。
English: The proposed S2-UniSeg model with Fast Universal Agglomerative Pooling enables continuous self-supervised pretraining and outperforms state-of-the-art methods across multiple segmentation benchmarks.

Authors:Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Qian Li, Shuai Liu, Chao Shen
Title: Adversarial Video Promotion Against Text-to-Video Retrieval
Abstract:
Thanks to the development of cross-modal models, text-to-video retrieval (T2VR) is advancing rapidly, but its robustness remains largely unexamined. Existing attacks against T2VR are designed to push videos away from queries, i.e., suppressing the ranks of videos, while the attacks that pull videos towards selected queries, i.e., promoting the ranks of videos, remain largely unexplored. These attacks can be more impactful as attackers may gain more views/clicks for financial benefits and widespread (mis)information. To this end, we pioneer the first attack against T2VR to promote videos adversarially, dubbed the Video Promotion attack (ViPro). We further propose Modal Refinement (MoRe) to capture the finer-grained, intricate interaction between visual and textual modalities to enhance black-box transferability. Comprehensive experiments cover 2 existing baselines, 3 leading T2VR models, 3 prevailing datasets with over 10k videos, evaluated under 3 scenarios. All experiments are conducted in a multi-target setting to reflect realistic scenarios where attackers seek to promote the video regarding multiple queries simultaneously. We also evaluated our attacks for defences and imperceptibility. Overall, ViPro surpasses other baselines by over $30/10/4\%$ for white/grey/black-box settings on average. Our work highlights an overlooked vulnerability, provides a qualitative analysis on the upper/lower bound of our attacks, and offers insights into potential counterplays. Code will be publicly available at https://github.com/michaeltian108/ViPro.
Chinese: 本研究首次提出针对文本到视频检索的对抗性攻击方法ViPro,通过模态细化增强跨模态交互来提升视频排名,在多种设置下均表现出优越性能,揭示了检索系统中的关键漏洞。
English: This study introduces ViPro, the first adversarial attack method for text-to-video retrieval that promotes video rankings by enhancing cross-modal interactions through Modal Refinement, demonstrating superior performance across various settings and highlighting a critical vulnerability in retrieval systems.

Authors:Lixuan He, Jie Feng, Yong Li
Title: AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance
Abstract:
Large Language Models (LLMs) are typically fine-tuned for reasoning tasks through a two-stage pipeline of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL), a process fraught with catastrophic forgetting and suboptimal trade-offs between imitation and exploration. Recent single-stage methods attempt to unify SFT and RL using heuristics, but lack a principled mechanism for dynamically balancing the two paradigms. In this paper, we reframe this challenge through the theoretical lens of \textbf{implicit rewards}, viewing SFT and RL not as distinct methods but as complementary reward signals. We introduce \textbf{Adaptive Meta Fine-Tuning (AMFT)}, a novel single-stage algorithm that learns the optimal balance between SFT's implicit, path-level reward and RL's explicit, outcome-based reward. The core of AMFT is a \textbf{meta-gradient adaptive weight controller} that treats the SFT-RL balance as a learnable parameter, dynamically optimizing it to maximize long-term task performance. This forward-looking approach, regularized by policy entropy for stability, autonomously discovers an effective training curriculum. We conduct a comprehensive evaluation on challenging benchmarks spanning mathematical reasoning, abstract visual reasoning (General Points), and vision-language navigation (V-IRL). AMFT consistently establishes a new state-of-the-art and demonstrats superior generalization on out-of-distribution (OOD) tasks. Ablation studies and training dynamic analysis confirm that the meta-learning controller is crucial for AMFT's stability, sample efficiency, and performance, offering a more principled and effective paradigm for LLM alignment. Our codes are open-sourced via https://github.com/hlxtsyj/AMFT.
中文: 本文提出自适应元微调(AMFT)算法,通过元梯度控制器动态平衡监督微调与强化学习,在多项推理任务中实现了最优性能并展现出卓越的泛化能力。
English: This paper introduces Adaptive Meta Fine-Tuning (AMFT), a single-stage algorithm that dynamically balances supervised fine-tuning and reinforcement learning through meta-gradient control to achieve state-of-the-art performance across multiple reasoning tasks.

Authors:Shihao Yuan, Yahui Liu, Yang Yue, Jingyuan Zhang, Wangmeng Zuo, Qi Wang, Fuzheng Zhang, Guorui Zhou
Title: AR-GRPO: Training Autoregressive Image Generation Models via Reinforcement Learning
Abstract:
Inspired by the success of reinforcement learning (RL) in refining large language models (LLMs), we propose AR-GRPO, an approach to integrate online RL training into autoregressive (AR) image generation models. We adapt the Group Relative Policy Optimization (GRPO) algorithm to refine the vanilla autoregressive models' outputs by carefully designed reward functions that evaluate generated images across multiple quality dimensions, including perceptual quality, realism, and semantic fidelity. We conduct comprehensive experiments on both class-conditional (i.e., class-to-image) and text-conditional (i.e., text-to-image) image generation tasks, demonstrating that our RL-enhanced framework significantly improves both the image quality and human preference of generated images compared to the standard AR baselines. Our results show consistent improvements across various evaluation metrics, establishing the viability of RL-based optimization for AR image generation and opening new avenues for controllable and high-quality image synthesis. The source codes and models are available at: https://github.com/Kwai-Klear/AR-GRPO.
中文: AR-GRPO方法将在线强化学习与自回归图像生成模型相结合,通过精心设计的奖励函数在多维度评估生成图像,显著提升了图像质量和人类偏好度。
English: The AR-GRPO approach integrates online reinforcement learning with autoregressive image generation models, using tailored reward functions to significantly enhance image quality and human preference across multiple tasks.

Authors:Ruoxi Chen, Dongping Chen, Siyuan Wu, Sinan Wang, Shiyun Lang, Petr Sushko, Gaoyang Jiang, Yao Wan, Ranjay Krishna
Title: MultiRef: Controllable Image Generation with Multiple Visual References
Abstract:
Visual designers naturally draw inspiration from multiple visual references, combining diverse elements and aesthetic principles to create artwork. However, current image generative frameworks predominantly rely on single-source inputs -- either text prompts or individual reference images. In this paper, we focus on the task of controllable image generation using multiple visual references. We introduce MultiRef-bench, a rigorous evaluation framework comprising 990 synthetic and 1,000 real-world samples that require incorporating visual content from multiple reference images. The synthetic samples are synthetically generated through our data engine RefBlend, with 10 reference types and 33 reference combinations. Based on RefBlend, we further construct a dataset MultiRef containing 38k high-quality images to facilitate further research. Our experiments across three interleaved image-text models (i.e., OmniGen, ACE, and Show-o) and six agentic frameworks (e.g., ChatDiT and LLM + SD) reveal that even state-of-the-art systems struggle with multi-reference conditioning, with the best model OmniGen achieving only 66.6% in synthetic samples and 79.0% in real-world cases on average compared to the golden answer. These findings provide valuable directions for developing more flexible and human-like creative tools that can effectively integrate multiple sources of visual inspiration. The dataset is publicly available at: https://multiref.github.io/.
中文: 本文提出MultiRef-bench多参考图像生成评估框架,发现即使最先进的模型也难以有效整合多个视觉参考,最优系统在合成和真实样本中仅分别达到66.6%和79.0%的准确率。
English: This paper introduces MultiRef-bench, a comprehensive evaluation framework for multi-reference image generation, revealing that even advanced models struggle to effectively integrate multiple visual references, with the top-performing system achieving only 66.6-79.0% accuracy compared to ideal outputs.

Authors:Zihao Sheng, Zilin Huang, Yen-Jung Chen, Yansong Qu, Yuhao Luo, Yue Leng, Sikai Chen
Title: SafePLUG: Empowering Multimodal LLMs with Pixel-Level Insight and Temporal Grounding for Traffic Accident Understanding
Abstract:
Multimodal large language models (MLLMs) have achieved remarkable progress across a range of vision-language tasks and demonstrate strong potential for traffic accident understanding. However, existing MLLMs in this domain primarily focus on coarse-grained image-level or video-level comprehension and often struggle to handle fine-grained visual details or localized scene components, limiting their applicability in complex accident scenarios. To address these limitations, we propose SafePLUG, a novel framework that empowers MLLMs with both Pixel-Level Understanding and temporal Grounding for comprehensive traffic accident analysis. SafePLUG supports both arbitrary-shaped visual prompts for region-aware question answering and pixel-level segmentation based on language instructions, while also enabling the recognition of temporally anchored events in traffic accident scenarios. To advance the development of MLLMs for traffic accident understanding, we curate a new dataset containing multimodal question-answer pairs centered on diverse accident scenarios, with detailed pixel-level annotations and temporal event boundaries. Experimental results show that SafePLUG achieves strong performance on multiple tasks, including region-based question answering, pixel-level segmentation, temporal event localization, and accident event understanding. These capabilities lay a foundation for fine-grained understanding of complex traffic scenes, with the potential to improve driving safety and enhance situational awareness in smart transportation systems. The code, dataset, and model checkpoints will be made publicly available at: https://zihaosheng.github.io/SafePLUG

Authors:Yash Garg, Saketh Bachu, Arindam Dutta, Rohit Lal, Sarosij Bose, Calvin-Khang Ta, M. Salman Asif, Amit Roy-Chowdhury
Title: VOccl3D: A Video Benchmark Dataset for 3D Human Pose and Shape Estimation under real Occlusions
Abstract:
Human pose and shape (HPS) estimation methods have been extensively studied, with many demonstrating high zero-shot performance on in-the-wild images and videos. However, these methods often struggle in challenging scenarios involving complex human poses or significant occlusions. Although some studies address 3D human pose estimation under occlusion, they typically evaluate performance on datasets that lack realistic or substantial occlusions, e.g., most existing datasets introduce occlusions with random patches over the human or clipart-style overlays, which may not reflect real-world challenges. To bridge this gap in realistic occlusion datasets, we introduce a novel benchmark dataset, VOccl3D, a Video-based human Occlusion dataset with 3D body pose and shape annotations. Inspired by works such as AGORA and BEDLAM, we constructed this dataset using advanced computer graphics rendering techniques, incorporating diverse real-world occlusion scenarios, clothing textures, and human motions. Additionally, we fine-tuned recent HPS methods, CLIFF and BEDLAM-CLIFF, on our dataset, demonstrating significant qualitative and quantitative improvements across multiple public datasets, as well as on the test split of our dataset, while comparing its performance with other state-of-the-art methods. Furthermore, we leveraged our dataset to enhance human detection performance under occlusion by fine-tuning an existing object detector, YOLO11, thus leading to a robust end-to-end HPS estimation system under occlusions. Overall, this dataset serves as a valuable resource for future research aimed at benchmarking methods designed to handle occlusions, offering a more realistic alternative to existing occlusion datasets. See the Project page for code and dataset:https://yashgarg98.github.io/VOccl3D-dataset/

Authors:Md Rezwanul Haque, Md. Milon Islam, S M Taslim Uddin Raju, Hamdi Altaheri, Lobna Nassar, Fakhri Karray
Title: MMFformer: Multimodal Fusion Transformer Network for Depression Detection
Abstract:
Depression is a serious mental health illness that significantly affects an individual's well-being and quality of life, making early detection crucial for adequate care and treatment. Detecting depression is often difficult, as it is based primarily on subjective evaluations during clinical interviews. Hence, the early diagnosis of depression, thanks to the content of social networks, has become a prominent research area. The extensive and diverse nature of user-generated information poses a significant challenge, limiting the accurate extraction of relevant temporal information and the effective fusion of data across multiple modalities. This paper introduces MMFformer, a multimodal depression detection network designed to retrieve depressive spatio-temporal high-level patterns from multimodal social media information. The transformer network with residual connections captures spatial features from videos, and a transformer encoder is exploited to design important temporal dynamics in audio. Moreover, the fusion architecture fused the extracted features through late and intermediate fusion strategies to find out the most relevant intermodal correlations among them. Finally, the proposed network is assessed on two large-scale depression detection datasets, and the results clearly reveal that it surpasses existing state-of-the-art approaches, improving the F1-Score by 13.92% for D-Vlog dataset and 7.74% for LMVD dataset. The code is made available publicly at https://github.com/rezwanh001/Large-Scale-Multimodal-Depression-Detection.
中文: 本文提出的MMFformer多模态网络通过从社交媒体数据中提取时空特征来检测抑郁,在基准数据集上的表现显著优于现有方法。
English: This paper introduces MMFformer, a multimodal network that effectively detects depression by extracting spatio-temporal patterns from social media data, significantly outperforming existing methods on benchmark datasets.

Authors:Zheyuan Zhang, Weihao Tang, Hong Chen
Title: Rethinking Key-frame-based Micro-expression Recognition: A Robust and Accurate Framework Against Key-frame Errors
Abstract:
Micro-expression recognition (MER) is a highly challenging task in affective computing. With the reduced-sized micro-expression (ME) input that contains key information based on key-frame indexes, key-frame-based methods have significantly improved the performance of MER. However, most of these methods focus on improving the performance with relatively accurate key-frame indexes, while ignoring the difficulty of obtaining accurate key-frame indexes and the objective existence of key-frame index errors, which impedes them from moving towards practical applications. In this paper, we propose CausalNet, a novel framework to achieve robust MER facing key-frame index errors while maintaining accurate recognition. To enhance robustness, CausalNet takes the representation of the entire ME sequence as the input. To address the information redundancy brought by the complete ME range input and maintain accurate recognition, first, the Causal Motion Position Learning Module (CMPLM) is proposed to help the model locate the muscle movement areas related to Action Units (AUs), thereby reducing the attention to other redundant areas. Second, the Causal Attention Block (CAB) is proposed to deeply learn the causal relationships between the muscle contraction and relaxation movements in MEs. Empirical experiments have demonstrated that on popular ME benchmarks, the CausalNet has achieved robust MER under different levels of key-frame index noise. Meanwhile, it has surpassed state-of-the-art (SOTA) methods on several standard MER benchmarks when using the provided annotated key-frames. Code is available at https://github.com/tony19980810/CausalNet.
Chinese: 本文提出CausalNet框架,通过处理完整微表情序列并采用因果学习模块聚焦相关肌肉运动,在保持识别精度的同时实现了对关键帧索引误差具有鲁棒性的微表情识别。
English: The paper introduces CausalNet, a robust framework for micro-expression recognition that maintains accuracy despite key-frame index errors by processing full sequences and using causal learning modules to focus on relevant muscle movements.

Authors:Guanyu Hu, Dimitrios Kollias, Xinyu Yang
Title: Grounding Emotion Recognition with Visual Prototypes: VEGA -- Revisiting CLIP in MERC
Abstract:
Multimodal Emotion Recognition in Conversations remains a challenging task due to the complex interplay of textual, acoustic and visual signals. While recent models have improved performance via advanced fusion strategies, they often lack psychologically meaningful priors to guide multimodal alignment. In this paper, we revisit the use of CLIP and propose a novel Visual Emotion Guided Anchoring (VEGA) mechanism that introduces class-level visual semantics into the fusion and classification process. Distinct from prior work that primarily utilizes CLIP's textual encoder, our approach leverages its image encoder to construct emotion-specific visual anchors based on facial exemplars. These anchors guide unimodal and multimodal features toward a perceptually grounded and psychologically aligned representation space, drawing inspiration from cognitive theories (prototypical emotion categories and multisensory integration). A stochastic anchor sampling strategy further enhances robustness by balancing semantic stability and intra-class diversity. Integrated into a dual-branch architecture with self-distillation, our VEGA-augmented model achieves sota performance on IEMOCAP and MELD. Code is available at: https://github.com/dkollias/VEGA.
中文摘要:本文提出视觉情感引导锚定(VEGA)机制,利用CLIP图像编码器构建情感特异性视觉锚点,通过心理学对齐表征提升多模态情感识别性能,在基准数据集上达到最优效果。
English Summary: This paper introduces the Visual Emotion Guided Anchoring (VEGA) mechanism that leverages CLIP's image encoder to create emotion-specific visual anchors, enhancing multimodal emotion recognition through psychologically aligned representations and achieving state-of-the-art performance on benchmark datasets.

Authors:Unisha Joshi
Title: Age-Diverse Deepfake Dataset: Bridging the Age Gap in Deepfake Detection
Abstract:
The challenges associated with deepfake detection are increasing significantly with the latest advancements in technology and the growing popularity of deepfake videos and images. Despite the presence of numerous detection models, demographic bias in the deepfake dataset remains largely unaddressed. This paper focuses on the mitigation of age-specific bias in the deepfake dataset by introducing an age-diverse deepfake dataset that will improve fairness across age groups. The dataset is constructed through a modular pipeline incorporating the existing deepfake datasets Celeb-DF, FaceForensics++, and UTKFace datasets, and the creation of synthetic data to fill the age distribution gaps. The effectiveness and generalizability of this dataset are evaluated using three deepfake detection models: XceptionNet, EfficientNet, and LipForensics. Evaluation metrics, including AUC, pAUC, and EER, revealed that models trained on the age-diverse dataset demonstrated fairer performance across age groups, improved overall accuracy, and higher generalization across datasets. This study contributes a reproducible, fairness-aware deepfake dataset and model pipeline that can serve as a foundation for future research in fairer deepfake detection. The complete dataset and implementation code are available at https://github.com/unishajoshi/age-diverse-deepfake-detection.
中文: 本文提出一个年龄多样化的深度伪造数据集以解决检测模型中的群体偏见,通过全面评估证明该数据集能提高跨年龄组的检测公平性、准确性和泛化能力。
English: This paper introduces an age-diverse deepfake dataset to address demographic bias in detection models, demonstrating improved fairness, accuracy, and generalization across age groups through comprehensive evaluations.

Authors:Qi Xun Yeo, Yanyan Li, Gim Hee Lee
Title: Statistical Confidence Rescoring for Robust 3D Scene Graph Generation from Multi-View Images
Abstract:
Modern 3D semantic scene graph estimation methods utilize ground truth 3D annotations to accurately predict target objects, predicates, and relationships. In the absence of given 3D ground truth representations, we explore leveraging only multi-view RGB images to tackle this task. To attain robust features for accurate scene graph estimation, we must overcome the noisy reconstructed pseudo point-based geometry from predicted depth maps and reduce the amount of background noise present in multi-view image features. The key is to enrich node and edge features with accurate semantic and spatial information and through neighboring relations. We obtain semantic masks to guide feature aggregation to filter background features and design a novel method to incorporate neighboring node information to aid robustness of our scene graph estimates. Furthermore, we leverage on explicit statistical priors calculated from the training summary statistics to refine node and edge predictions based on their one-hop neighborhood. Our experiments show that our method outperforms current methods purely using multi-view images as the initial input. Our project page is available at https://qixun1.github.io/projects/SCRSSG.

Authors:Ming-Kun Xie, Jia-Hao Xiao, Gang Niu, Lei Feng, Zhiqiang Kou, Min-Ling Zhang, Masashi Sugiyama
Title: What Makes "Good" Distractors for Object Hallucination Evaluation in Large Vision-Language Models?
Abstract:
Large Vision-Language Models (LVLMs), empowered by the success of Large Language Models (LLMs), have achieved impressive performance across domains. Despite the great advances in LVLMs, they still suffer from the unavailable object hallucination issue, which tends to generate objects inconsistent with the image content. The most commonly used Polling-based Object Probing Evaluation (POPE) benchmark evaluates this issue by sampling negative categories according to category-level statistics, \textit{e.g.}, category frequencies and co-occurrence. However, with the continuous advancement of LVLMs, the POPE benchmark has shown diminishing effectiveness in assessing object hallucination, as it employs a simplistic sampling strategy that overlooks image-specific information and restricts distractors to negative object categories only. In this paper, we introduce the Hallucination searching-based Object Probing Evaluation (HOPE) benchmark, aiming to generate the most misleading distractors (\textit{i.e.}, non-existent objects or incorrect image descriptions) that can trigger hallucination in LVLMs, which serves as a means to more rigorously assess their immunity to hallucination. To explore the image-specific information, the content-aware hallucination searching leverages Contrastive Language-Image Pre-Training (CLIP) to approximate the predictive behavior of LVLMs by selecting negative objects with the highest predicted likelihood as distractors. To expand the scope of hallucination assessment, the description-based hallucination searching constructs highly misleading distractors by pairing true objects with false descriptions. Experimental results show that HOPE leads to a precision drop of at least 9\% and up to 23\% across various state-of-the-art LVLMs, significantly outperforming POPE in exposing hallucination vulnerabilities. The code is available at https://github.com/xiemk/HOPE.
中文: HOPE基准通过内容感知和基于描述的搜索生成误导性干扰项,严格评估大型视觉语言模型的物体幻觉问题,在揭示模型缺陷方面显著优于现有的POPE基准。
English: The HOPE benchmark is introduced to rigorously assess object hallucination in Large Vision-Language Models by generating misleading distractors through content-aware and description-based searching, significantly outperforming the existing POPE benchmark in exposing model vulnerabilities.

Authors:Jiayuan Wang, Q. M. Jonathan Wu, Katsuya Suto, Ning Zhang
Title: RMT-PPAD: Real-time Multi-task Learning for Panoptic Perception in Autonomous Driving
Abstract:
Autonomous driving systems rely on panoptic driving perception that requires both precision and real-time performance. In this work, we propose RMT-PPAD, a real-time, transformer-based multi-task model that jointly performs object detection, drivable area segmentation, and lane line segmentation. We introduce a lightweight module, a gate control with an adapter to adaptively fuse shared and task-specific features, effectively alleviating negative transfer between tasks. Additionally, we design an adaptive segmentation decoder to learn the weights over multi-scale features automatically during the training stage. This avoids the manual design of task-specific structures for different segmentation tasks. We also identify and resolve the inconsistency between training and testing labels in lane line segmentation. This allows fairer evaluation. Experiments on the BDD100K dataset demonstrate that RMT-PPAD achieves state-of-the-art results with mAP50 of 84.9% and Recall of 95.4% for object detection, mIoU of 92.6% for drivable area segmentation, and IoU of 56.8% and accuracy of 84.7% for lane line segmentation. The inference speed reaches 32.6 FPS. Moreover, we introduce real-world scenarios to evaluate RMT-PPAD performance in practice. The results show that RMT-PPAD consistently delivers stable performance. The source codes and pre-trained models are released at https://github.com/JiayuanWang-JW/RMT-PPAD.
中文: RMT-PPAD是一种基于Transformer的实时多任务模型,在BDD100K数据集上实现了目标检测、可行驶区域分割和车道线分割的最优性能,同时保持了高效的推理速度。
English: RMT-PPAD is a real-time transformer-based multi-task model that achieves state-of-the-art performance in object detection, drivable area segmentation, and lane line segmentation on the BDD100K dataset while maintaining high inference speed.

Authors:He Feng, Yongjia Ma, Donglin Di, Lei Fan, Tonghua Su, Xiangqian Wu
Title: DiTalker: A Unified DiT-based Framework for High-Quality and Speaking Styles Controllable Portrait Animation
Abstract:
Portrait animation aims to synthesize talking videos from a static reference face, conditioned on audio and style frame cues (e.g., emotion and head poses), while ensuring precise lip synchronization and faithful reproduction of speaking styles. Existing diffusion-based portrait animation methods primarily focus on lip synchronization or static emotion transformation, often overlooking dynamic styles such as head movements. Moreover, most of these methods rely on a dual U-Net architecture, which preserves identity consistency but incurs additional computational overhead. To this end, we propose DiTalker, a unified DiT-based framework for speaking style-controllable portrait animation. We design a Style-Emotion Encoding Module that employs two separate branches: a style branch extracting identity-specific style information (e.g., head poses and movements), and an emotion branch extracting identity-agnostic emotion features. We further introduce an Audio-Style Fusion Module that decouples audio and speaking styles via two parallel cross-attention layers, using these features to guide the animation process. To enhance the quality of results, we adopt and modify two optimization constraints: one to improve lip synchronization and the other to preserve fine-grained identity and background details. Extensive experiments demonstrate the superiority of DiTalker in terms of lip synchronization and speaking style controllability. Project Page: https://thenameishope.github.io/DiTalker/

Authors:Rakesh Raj Madavan, Akshat Kaimal, Hashim Faisal, Chandrakala S
Title: Med-GRIM: Enhanced Zero-Shot Medical VQA using prompt-embedded Multimodal Graph RAG
Abstract:
An ensemble of trained multimodal encoders and vision-language models (VLMs) has become a standard approach for visual question answering (VQA) tasks. However, such models often fail to produce responses with the detailed precision necessary for complex, domain-specific applications such as medical VQA. Our representation model, BIND: BLIVA Integrated with Dense Encoding, extends prior multimodal work by refining the joint embedding space through dense, query-token-based encodings inspired by contrastive pretraining techniques. This refined encoder powers Med-GRIM, a model designed for medical VQA tasks that leverages graph-based retrieval and prompt engineering to integrate domain-specific knowledge. Rather than relying on compute-heavy fine-tuning of vision and language models on specific datasets, Med-GRIM applies a low-compute, modular workflow with small language models (SLMs) for efficiency. Med-GRIM employs prompt-based retrieval to dynamically inject relevant knowledge, ensuring both accuracy and robustness in its responses. By assigning distinct roles to each agent within the VQA system, Med-GRIM achieves large language model performance at a fraction of the computational cost. Additionally, to support scalable research in zero-shot multimodal medical applications, we introduce DermaGraph, a novel Graph-RAG dataset comprising diverse dermatological conditions. This dataset facilitates both multimodal and unimodal querying. The code and dataset are available at: https://github.com/Rakesh-123-cryp/Med-GRIM.git
中文: BIND模型通过密集编码优化多模态编码器的联合嵌入空间,而Med-GRIM利用此技术,结合基于图的检索和提示工程,采用小型语言模型高效处理医学视觉问答任务,无需大量微调即可实现精准响应。
English: The BIND model enhances multimodal encoders with dense encoding to improve joint embedding spaces, while Med-GRIM leverages this for medical VQA by integrating graph-based retrieval and prompt engineering with small language models, achieving high efficiency and accuracy without extensive fine-tuning.

Authors:Yehonathan Litman, Fernando De la Torre, Shubham Tulsiani
Title: LightSwitch: Multi-view Relighting with Material-guided Diffusion
Abstract:
Recent approaches for 3D relighting have shown promise in integrating 2D image relighting generative priors to alter the appearance of a 3D representation while preserving the underlying structure. Nevertheless, generative priors used for 2D relighting that directly relight from an input image do not take advantage of intrinsic properties of the subject that can be inferred or cannot consider multi-view data at scale, leading to subpar relighting. In this paper, we propose Lightswitch, a novel finetuned material-relighting diffusion framework that efficiently relights an arbitrary number of input images to a target lighting condition while incorporating cues from inferred intrinsic properties. By using multi-view and material information cues together with a scalable denoising scheme, our method consistently and efficiently relights dense multi-view data of objects with diverse material compositions. We show that our 2D relighting prediction quality exceeds previous state-of-the-art relighting priors that directly relight from images. We further demonstrate that LightSwitch matches or outperforms state-of-the-art diffusion inverse rendering methods in relighting synthetic and real objects in as little as 2 minutes.

Authors:Yuwei Yang, Zeyu Zhang, Yunzhong Hou, Zhuowan Li, Gaowen Liu, Ali Payani, Yuan-Sen Ting, Liang Zheng
Title: Effective Training Data Synthesis for Improving MLLM Chart Understanding
Abstract:
Being able to effectively read scientific plots, or chart understanding, is a central part toward building effective agents for science. However, existing multimodal large language models (MLLMs), especially open-source ones, are still falling behind with a typical success rate of 30%-50% on challenging benchmarks. Previous studies on fine-tuning MLLMs with synthetic charts are often restricted by their inadequate similarity to the real charts, which could compromise model training and performance on complex real-world charts. In this study, we show that modularizing chart generation and diversifying visual details improves chart understanding capabilities. In particular, we design a five-step data synthesis pipeline, where we separate data and function creation for single plot generation, condition the generation of later subplots on earlier ones for multi-subplot figures, visually diversify the generated figures, filter out low quality data, and finally generate the question-answer (QA) pairs with GPT-4o. This approach allows us to streamline the generation of fine-tuning datasets and introduce the effective chart dataset (ECD), which contains 10k+ chart images and 300k+ QA pairs, covering 25 topics and featuring 250+ chart type combinations with high visual complexity. We show that ECD consistently improves the performance of various MLLMs on a range of real-world and synthetic test sets. Code, data and models are available at: https://github.com/yuweiyang-anu/ECD.
中文摘要:本研究提出了一种模块化且多样化的数据合成流程,创建了有效图表数据集(ECD),显著提升了多模态大语言模型在各类真实与合成测试集上的图表理解能力。
English Summary: This study introduces a modular and diversified data synthesis pipeline to create the Effective Chart Dataset (ECD), which significantly enhances the chart understanding capabilities of multimodal large language models across various real-world and synthetic benchmarks.

Authors:Sofiane Bouaziz, Adel Hafiane, Raphael Canals, Rachid Nedjai
Title: WGAST: Weakly-Supervised Generative Network for Daily 10 m Land Surface Temperature Estimation via Spatio-Temporal Fusion
Abstract:
Urbanization, climate change, and agricultural stress are increasing the demand for precise and timely environmental monitoring. Land Surface Temperature (LST) is a key variable in this context and is retrieved from remote sensing satellites. However, these systems face a trade-off between spatial and temporal resolution. While spatio-temporal fusion methods offer promising solutions, few have addressed the estimation of daily LST at 10 m resolution. In this study, we present WGAST, a Weakly-Supervised Generative Network for Daily 10 m LST Estimation via Spatio-Temporal Fusion of Terra MODIS, Landsat 8, and Sentinel-2. WGAST is the first end-to-end deep learning framework designed for this task. It adopts a conditional generative adversarial architecture, with a generator composed of four stages: feature extraction, fusion, LST reconstruction, and noise suppression. The first stage employs a set of encoders to extract multi-level latent representations from the inputs, which are then fused in the second stage using cosine similarity, normalization, and temporal attention mechanisms. The third stage decodes the fused features into high-resolution LST, followed by a Gaussian filter to suppress high-frequency noise. Training follows a weakly supervised strategy based on physical averaging principles and reinforced by a PatchGAN discriminator. Experiments demonstrate that WGAST outperforms existing methods in both quantitative and qualitative evaluations. Compared to the best-performing baseline, on average, WGAST reduces RMSE by 17.18% and improves SSIM by 11.00%. Furthermore, WGAST is robust to cloud-induced LST and effectively captures fine-scale thermal patterns, as validated against 33 ground-based sensors. The code is available at https://github.com/Sofianebouaziz1/WGAST.git.
中文: 本研究提出了WGAST,首个端到端深度学习框架,通过弱监督生成网络融合多卫星数据,实现了10米分辨率的日地表温度估算,在环境监测中展现出卓越的精度和鲁棒性。
English: This study introduces WGAST, the first end-to-end deep learning framework that uses a weakly-supervised generative network to estimate daily 10-meter resolution land surface temperature by fusing data from multiple satellites, achieving superior accuracy and robustness in environmental monitoring.

Authors:Ruida Cheng, Tejas Sudharshan Mathai, Pritam Mukherjee, Benjamin Hou, Qingqing Zhu, Zhiyong Lu, Matthew McAuliffe, Ronald M. Summers
Title: Text Embedded Swin-UMamba for DeepLesion Segmentation
Abstract:
Segmentation of lesions on CT enables automatic measurement for clinical assessment of chronic diseases (e.g., lymphoma). Integrating large language models (LLMs) into the lesion segmentation workflow offers the potential to combine imaging features with descriptions of lesion characteristics from the radiology reports. In this study, we investigate the feasibility of integrating text into the Swin-UMamba architecture for the task of lesion segmentation. The publicly available ULS23 DeepLesion dataset was used along with short-form descriptions of the findings from the reports. On the test dataset, a high Dice Score of 82% and low Hausdorff distance of 6.58 (pixels) was obtained for lesion segmentation. The proposed Text-Swin-UMamba model outperformed prior approaches: 37% improvement over the LLM-driven LanGuideMedSeg model (p < 0.001),and surpassed the purely image-based xLSTM-UNet and nnUNet models by 1.74% and 0.22%, respectively. The dataset and code can be accessed at https://github.com/ruida/LLM-Swin-UMamba
中文: 将大型语言模型与Swin-UMamba架构结合用于CT病灶分割,以82%的Dice分数显著超越现有方法,展现出卓越性能。
English: Integrating large language models with the Swin-UMamba architecture for lesion segmentation on CT scans achieves superior performance, significantly outperforming previous methods with an 82% Dice Score.

Authors:Shengzhu Yang, Jiawei Du, Shuai Lu, Weihang Zhang, Ningli Wang, Huiqi Li
Title: CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment
Abstract:
Large-scale natural image-text datasets, especially those automatically collected from the web, often suffer from loose semantic alignment due to weak supervision, while medical datasets tend to have high cross-modal correlation but low content diversity. These properties pose a common challenge for contrastive language-image pretraining (CLIP): they hinder the model's ability to learn robust and generalizable representations. In this work, we propose CLIPin, a unified non-contrastive plug-in that can be seamlessly integrated into CLIP-style architectures to improve multimodal semantic alignment, providing stronger supervision and enhancing alignment robustness. Furthermore, two shared pre-projectors are designed for image and text modalities respectively to facilitate the integration of contrastive and non-contrastive learning in a parameter-compromise manner. Extensive experiments on diverse downstream tasks demonstrate the effectiveness and generality of CLIPin as a plug-and-play component compatible with various contrastive frameworks. Code is available at https://github.com/T6Yang/CLIPin.
中文:提出的CLIPin框架通过非对比插件和共享预投影器,增强了CLIP类模型的多模态语义对齐能力,在不同任务中提升了鲁棒性和泛化性能。
English: The proposed CLIPin framework enhances multimodal semantic alignment in CLIP-style models through a non-contrastive plug-in and shared pre-projectors, improving robustness and generalization across diverse tasks.

Authors:Guido Manni, Clemente Lauretti, Loredana Zollo, Paolo Soda
Title: SPARSE Data, Rich Results: Few-Shot Semi-Supervised Learning via Class-Conditioned Image Translation
Abstract:
Deep learning has revolutionized medical imaging, but its effectiveness is severely limited by insufficient labeled training data. This paper introduces a novel GAN-based semi-supervised learning framework specifically designed for low labeled-data regimes, evaluated across settings with 5 to 50 labeled samples per class. Our approach integrates three specialized neural networks -- a generator for class-conditioned image translation, a discriminator for authenticity assessment and classification, and a dedicated classifier -- within a three-phase training framework. The method alternates between supervised training on limited labeled data and unsupervised learning that leverages abundant unlabeled images through image-to-image translation rather than generation from noise. We employ ensemble-based pseudo-labeling that combines confidence-weighted predictions from the discriminator and classifier with temporal consistency through exponential moving averaging, enabling reliable label estimation for unlabeled data. Comprehensive evaluation across eleven MedMNIST datasets demonstrates that our approach achieves statistically significant improvements over six state-of-the-art GAN-based semi-supervised methods, with particularly strong performance in the extreme 5-shot setting where the scarcity of labeled data is most challenging. The framework maintains its superiority across all evaluated settings (5, 10, 20, and 50 shots per class). Our approach offers a practical solution for medical imaging applications where annotation costs are prohibitive, enabling robust classification performance even with minimal labeled data. Code is available at https://github.com/GuidoManni/SPARSE.
中文: 本文提出了一种基于GAN的半监督学习框架,通过整合图像翻译和集成伪标记技术,有效解决了医学影像中标注数据稀缺的难题,在每类仅五个标注样本的极端条件下仍能实现卓越的分类性能。
English: This paper presents a GAN-based semi-supervised learning framework that effectively addresses the challenge of limited labeled data in medical imaging by integrating image translation and ensemble pseudo-labeling, achieving superior performance across multiple datasets with as few as five labeled samples per class.

Authors:Youguang Xing, Xu Luo, Junlin Xie, Lianli Gao, Hengtao Shen, Jingkuan Song
Title: Shortcut Learning in Generalist Robot Policies: The Role of Dataset Diversity and Fragmentation
Abstract:
Generalist robot policies trained on large-scale datasets such as Open X-Embodiment (OXE) demonstrate strong performance across a wide range of tasks. However, they often struggle to generalize beyond the distribution of their training data. In this paper, we investigate the underlying cause of this limited generalization capability. We identify shortcut learning -- the reliance on task-irrelevant features -- as a key impediment to generalization. Through comprehensive theoretical and empirical analysis, we uncover two primary contributors to shortcut learning: (1) limited diversity within individual sub-datasets, and (2) significant distributional disparities across sub-datasets, leading to dataset fragmentation. These issues arise from the inherent structure of large-scale datasets like OXE, which are typically composed of multiple sub-datasets collected independently across varied environments and embodiments. Our findings provide critical insights into dataset collection strategies that can reduce shortcut learning and enhance the generalization ability of generalist robot policies. Moreover, in scenarios where acquiring new large-scale data is impractical, we demonstrate that carefully selected robotic data augmentation strategies can effectively reduce shortcut learning in existing offline datasets, thereby improving generalization capabilities of generalist robot policies, e.g., $π_0$, in both simulation and real-world environments. More information at https://lucky-light-sun.github.io/proj/shortcut-learning-in-grps/.
中文: 基于大规模数据集(如Open X-Embodiment)训练的通用机器人策略因捷径学习而泛化能力受限,其根源在于子数据集多样性不足和分布差异,但通过优化数据采集或针对性增强策略可有效改善。
English: Generalist robot policies trained on large datasets like Open X-Embodiment often fail to generalize due to shortcut learning, which stems from limited sub-dataset diversity and distributional disparities, but this can be mitigated through improved data collection or targeted augmentation strategies.

Authors:Xiangyu Wu, Feng Yu, Yang Yang, Jianfeng Lu
Title: Text as Any-Modality for Zero-Shot Classification by Consistent Prompt Tuning
Abstract:
The integration of prompt tuning with multimodal learning has shown significant generalization abilities for various downstream tasks. Despite advancements, existing methods heavily depend on massive modality-specific labeled data (e.g., video, audio, and image), or are customized for a single modality. In this study, we present Text as Any-Modality by Consistent Prompt Tuning (TaAM-CPT), a scalable approach for constructing a general representation model toward unlimited modalities using solely text data. TaAM-CPT comprises modality prompt pools, text construction, and modality-aligned text encoders from pre-trained models, which allows for extending new modalities by simply adding prompt pools and modality-aligned text encoders. To harmonize the learning across different modalities, TaAM-CPT designs intra- and inter-modal learning objectives, which can capture category details within modalities while maintaining semantic consistency across different modalities. Benefiting from its scalable architecture and pre-trained models, TaAM-CPT can be seamlessly extended to accommodate unlimited modalities. Remarkably, without any modality-specific labeled data, TaAM-CPT achieves leading results on diverse datasets spanning various modalities, including video classification, image classification, and audio classification. The code is available at https://github.com/Jinx630/TaAM-CPT.
中文摘要:TaAM-CPT提出了一种仅使用文本数据即可构建适用于无限模态的通用表征模型的可扩展方法,无需任何模态特定标注数据,便在视频、图像和音频分类等多种任务中取得了领先性能。
English Summary: TaAM-CPT introduces a scalable method that uses only text data to create a general representation model for unlimited modalities, achieving leading results across video, image, and audio classification without requiring modality-specific labeled data.

Authors:Zelin Li, Ruohan Zong, Yifan Liu, Ruichen Yao, Yaokun Liu, Yang Zhang, Dong Wang
Title: Anti-Tamper Protection for Unauthorized Individual Image Generation
Abstract:
With the advancement of personalized image generation technologies, concerns about forgery attacks that infringe on portrait rights and privacy are growing. To address these concerns, protection perturbation algorithms have been developed to disrupt forgery generation. However, the protection algorithms would become ineffective when forgery attackers apply purification techniques to bypass the protection. To address this issue, we present a novel approach, Anti-Tamper Perturbation (ATP). ATP introduces a tamper-proof mechanism within the perturbation. It consists of protection and authorization perturbations, where the protection perturbation defends against forgery attacks, while the authorization perturbation detects purification-based tampering. Both protection and authorization perturbations are applied in the frequency domain under the guidance of a mask, ensuring that the protection perturbation does not disrupt the authorization perturbation. This design also enables the authorization perturbation to be distributed across all image pixels, preserving its sensitivity to purification-based tampering. ATP demonstrates its effectiveness in defending forgery attacks across various attack settings through extensive experiments, providing a robust solution for protecting individuals' portrait rights and privacy. Our code is available at: https://github.com/Seeyn/Anti-Tamper-Perturbation .
中文: 提出的抗篡改扰动(ATP)方法在频域中结合保护性扰动和授权性扰动,既能防御伪造攻击,又能检测基于净化的篡改行为,为保护肖像权和隐私提供了可靠方案。
English: The proposed Anti-Tamper Perturbation (ATP) method combines protection and authorization perturbations in the frequency domain to defend against forgery attacks while detecting purification-based tampering, offering a robust solution for safeguarding portrait rights and privacy.

Authors:Zhangquan Chen, Ruihui Zhao, Chuwei Luo, Mingze Sun, Xinlei Yu, Yangyang Kang, Ruqi Huang
Title: SIFThinker: Spatially-Aware Image Focus for Visual Reasoning
Abstract:
Current multimodal large language models (MLLMs) still face significant challenges in complex visual tasks (e.g., spatial understanding, fine-grained perception). Prior methods have tried to incorporate visual reasoning, however, they fail to leverage attention correction with spatial cues to iteratively refine their focus on prompt-relevant regions. In this paper, we introduce SIFThinker, a spatially-aware "think-with-images" framework that mimics human visual perception. Specifically, SIFThinker enables attention correcting and image region focusing by interleaving depth-enhanced bounding boxes and natural language. Our contributions are twofold: First, we introduce a reverse-expansion-forward-inference strategy that facilitates the generation of interleaved image-text chains of thought for process-level supervision, which in turn leads to the construction of the SIF-50K dataset. Besides, we propose GRPO-SIF, a reinforced training paradigm that integrates depth-informed visual grounding into a unified reasoning pipeline, teaching the model to dynamically correct and focus on prompt-relevant regions. Extensive experiments demonstrate that SIFThinker outperforms state-of-the-art methods in spatial understanding and fine-grained visual perception, while maintaining strong general capabilities, highlighting the effectiveness of our method. Code: https://github.com/zhangquanchen/SIFThinker.
中文: 当前多模态大语言模型在复杂视觉任务中仍面临挑战,而SIFThinker提出了一种空间感知框架,通过深度增强边界框和自然语言动态校正注意力并聚焦相关区域,在空间理解和细粒度感知方面超越了现有最优方法。
English: Current multimodal large language models struggle with complex visual tasks, but SIFThinker introduces a spatially-aware framework that uses depth-enhanced bounding boxes and natural language to dynamically correct attention and focus on relevant regions, outperforming state-of-the-art methods in spatial understanding and fine-grained perception.

Authors:Md Sazidur Rahman, David Cabecinhas, Ricard Marxer
Title: Depth Jitter: Seeing through the Depth
Abstract:
Depth information is essential in computer vision, particularly in underwater imaging, robotics, and autonomous navigation. However, conventional augmentation techniques overlook depth aware transformations, limiting model robustness in real world depth variations. In this paper, we introduce Depth-Jitter, a novel depth-based augmentation technique that simulates natural depth variations to improve generalization. Our approach applies adaptive depth offsetting, guided by depth variance thresholds, to generate synthetic depth perturbations while preserving structural integrity. We evaluate Depth-Jitter on two benchmark datasets, FathomNet and UTDAC2020 demonstrating its impact on model stability under diverse depth conditions. Extensive experiments compare Depth-Jitter against traditional augmentation strategies such as ColorJitter, analyzing performance across varying learning rates, encoders, and loss functions. While Depth-Jitter does not always outperform conventional methods in absolute performance, it consistently enhances model stability and generalization in depth-sensitive environments. These findings highlight the potential of depth-aware augmentation for real-world applications and provide a foundation for further research into depth-based learning strategies. The proposed technique is publicly available to support advancements in depth-aware augmentation. The code is publicly available on \href{https://github.com/mim-team/Depth-Jitter}{github}.
中文: 本文提出Depth-Jitter这一基于深度的数据增强技术,通过模拟自然深度变化来提升模型在深度敏感应用中的鲁棒性和泛化能力,在不同条件下持续增强模型稳定性。
English: This paper introduces Depth-Jitter, a depth-based augmentation technique that simulates natural depth variations to enhance model robustness and generalization in depth-sensitive applications, consistently improving stability across diverse conditions.

Authors:Hanqing Wang, Shaoyang Wang, Yiming Zhong, Zemin Yang, Jiamin Wang, Zhiqing Cui, Jiahao Yuan, Yifan Han, Mingyu Liu, Yuexin Ma
Title: Affordance-R1: Reinforcement Learning for Generalizable Affordance Reasoning in Multimodal Large Language Model
Abstract:
Affordance grounding focuses on predicting the specific regions of objects that are associated with the actions to be performed by robots. It plays a vital role in the fields of human-robot interaction, human-object interaction, embodied manipulation, and embodied perception. Existing models often neglect the affordance shared among different objects because they lack the Chain-of-Thought(CoT) reasoning abilities, limiting their out-of-domain (OOD) generalization and explicit reasoning capabilities. To address these challenges, we propose Affordance-R1, the first unified affordance grounding framework that integrates cognitive CoT guided Group Relative Policy Optimization (GRPO) within a reinforcement learning paradigm. Specifically, we designed a sophisticated affordance function, which contains format, perception, and cognition rewards to effectively guide optimization directions. Furthermore, we constructed a high-quality affordance-centric reasoning dataset, ReasonAff, to support training. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Affordance-R1 achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Comprehensive experiments demonstrate that our model outperforms well-established methods and exhibits open-world generalization. To the best of our knowledge, Affordance-R1 is the first to integrate GRPO-based RL with reasoning into affordance reasoning. The code of our method and our dataset is released on https://github.com/hq-King/Affordance-R1.
中文: Affordance-R1是首个融合思维链推理的强化学习框架,通过群体相对策略优化提升机器人对物体可操作区域的识别能力,在无需显式推理数据的情况下实现了强大的零样本泛化和涌现推理性能。
English: Affordance-R1 is a novel reinforcement learning framework that integrates Chain-of-Thought reasoning to enhance robots' ability to identify actionable object regions, achieving superior zero-shot generalization and explicit reasoning without relying on pre-existing reasoning data.

Authors:Zhenbang Du, Yonggan Fu, Lifu Wang, Jiayi Qian, Xiao Luo, Yingyan, Lin
Title: Fewer Denoising Steps or Cheaper Per-Step Inference: Towards Compute-Optimal Diffusion Model Deployment
Abstract:
Diffusion models have shown remarkable success across generative tasks, yet their high computational demands challenge deployment on resource-limited platforms. This paper investigates a critical question for compute-optimal diffusion model deployment: Under a post-training setting without fine-tuning, is it more effective to reduce the number of denoising steps or to use a cheaper per-step inference? Intuitively, reducing the number of denoising steps increases the variability of the distributions across steps, making the model more sensitive to compression. In contrast, keeping more denoising steps makes the differences smaller, preserving redundancy, and making post-training compression more feasible. To systematically examine this, we propose PostDiff, a training-free framework for accelerating pre-trained diffusion models by reducing redundancy at both the input level and module level in a post-training manner. At the input level, we propose a mixed-resolution denoising scheme based on the insight that reducing generation resolution in early denoising steps can enhance low-frequency components and improve final generation fidelity. At the module level, we employ a hybrid module caching strategy to reuse computations across denoising steps. Extensive experiments and ablation studies demonstrate that (1) PostDiff can significantly improve the fidelity-efficiency trade-off of state-of-the-art diffusion models, and (2) to boost efficiency while maintaining decent generation fidelity, reducing per-step inference cost is often more effective than reducing the number of denoising steps. Our code is available at https://github.com/GATECH-EIC/PostDiff.
中文: 本文提出无需训练的PostDiff框架,通过降低输入级和模块级冗余来加速预训练扩散模型,证明减少单步推理成本比减少去噪步骤更能有效维持生成质量。
English: This paper introduces PostDiff, a training-free framework that accelerates pre-trained diffusion models by reducing input-level and module-level redundancy, demonstrating that lowering per-step inference cost is more effective than reducing denoising steps for maintaining generation fidelity.

Authors:Yuchen Guan, Chong Sun, Canmiao Fu, Zhipeng Huang, Chun Yuan, Chen Li
Title: Text-guided Visual Prompt DINO for Generic Segmentation
Abstract:
Recent advancements in multimodal vision models have highlighted limitations in late-stage feature fusion and suboptimal query selection for hybrid prompts open-world segmentation, alongside constraints from caption-derived vocabularies. To address these challenges, we propose Prompt-DINO, a text-guided visual Prompt DINO framework featuring three key innovations. First, we introduce an early fusion mechanism that unifies text/visual prompts and backbone features at the initial encoding stage, enabling deeper cross-modal interactions to resolve semantic ambiguities. Second, we design order-aligned query selection for DETR-based architectures, explicitly optimizing the structural alignment between text and visual queries during decoding to enhance semantic-spatial consistency. Third, we develop a generative data engine powered by the Recognize Anything via Prompting (RAP) model, which synthesizes 0.5B diverse training instances through a dual-path cross-verification pipeline, reducing label noise by 80.5% compared to conventional approaches. Extensive experiments demonstrate that Prompt-DINO achieves state-of-the-art performance on open-world detection benchmarks while significantly expanding semantic coverage beyond fixed-vocabulary constraints. Our work establishes a new paradigm for scalable multimodal detection and data generation in open-world scenarios. Data&Code are available at https://github.com/WeChatCV/WeVisionOne.
中文摘要:Prompt-DINO通过早期融合机制、顺序对齐查询选择和生成式数据引擎,解决了多模态分割中的关键限制,在开放世界检测中实现了最先进的性能并显著扩展了语义覆盖范围。
English Summary: Prompt-DINO introduces an early fusion mechanism, order-aligned query selection, and a generative data engine to overcome limitations in multimodal segmentation, achieving state-of-the-art open-world detection performance with enhanced semantic coverage.

Authors:Hanqing Wang, Yuan Tian, Mingyu Liu, Zhenhao Zhang, Xiangyang Zhu
Title: SDEval: Safety Dynamic Evaluation for Multimodal Large Language Models
Abstract:
In the rapidly evolving landscape of Multimodal Large Language Models (MLLMs), the safety concerns of their outputs have earned significant attention. Although numerous datasets have been proposed, they may become outdated with MLLM advancements and are susceptible to data contamination issues. To address these problems, we propose \textbf{SDEval}, the \textit{first} safety dynamic evaluation framework to controllably adjust the distribution and complexity of safety benchmarks. Specifically, SDEval mainly adopts three dynamic strategies: text, image, and text-image dynamics to generate new samples from original benchmarks. We first explore the individual effects of text and image dynamics on model safety. Then, we find that injecting text dynamics into images can further impact safety, and conversely, injecting image dynamics into text also leads to safety risks. SDEval is general enough to be applied to various existing safety and even capability benchmarks. Experiments across safety benchmarks, MLLMGuard and VLSBench, and capability benchmarks, MMBench and MMVet, show that SDEval significantly influences safety evaluation, mitigates data contamination, and exposes safety limitations of MLLMs. Code is available at https://github.com/hq-King/SDEval
中文:SDEval提出了首个多模态大语言模型安全动态评估框架,通过文本、图像及图文动态策略生成可调控的基准测试,有效揭示模型安全缺陷并缓解多个数据集的数据污染问题。
English: SDEval introduces the first dynamic safety evaluation framework for Multimodal Large Language Models, employing text, image, and text-image dynamics to generate adjustable benchmarks that reveal safety limitations and mitigate data contamination across multiple datasets.

Authors:Shaohua Pan, Xinyu Yi, Yan Zhou, Weihua Jian, Yuan Zhang, Pengfei Wan, Feng Xu
Title: DiffCap: Diffusion-based Real-time Human Motion Capture using Sparse IMUs and a Monocular Camera
Abstract:
Combining sparse IMUs and a monocular camera is a new promising setting to perform real-time human motion capture. This paper proposes a diffusion-based solution to learn human motion priors and fuse the two modalities of signals together seamlessly in a unified framework. By delicately considering the characteristics of the two signals, the sequential visual information is considered as a whole and transformed into a condition embedding, while the inertial measurement is concatenated with the noisy body pose frame by frame to construct a sequential input for the diffusion model. Firstly, we observe that the visual information may be unavailable in some frames due to occlusions or subjects moving out of the camera view. Thus incorporating the sequential visual features as a whole to get a single feature embedding is robust to the occasional degenerations of visual information in those frames. On the other hand, the IMU measurements are robust to occlusions and always stable when signal transmission has no problem. So incorporating them frame-wisely could better explore the temporal information for the system. Experiments have demonstrated the effectiveness of the system design and its state-of-the-art performance in pose estimation compared with the previous works. Our codes are available for research at https://shaohua-pan.github.io/diffcap-page.

Authors:Michael Wehrli, Alicia Durrer, Paul Friedrich, Sidaty El Hadramy, Edwin Li, Luana Brahaj, Carol C. Hasler, Philippe C. Cattin
Title: Towards MR-Based Trochleoplasty Planning
Abstract:
To treat Trochlear Dysplasia (TD), current approaches rely mainly on low-resolution clinical Magnetic Resonance (MR) scans and surgical intuition. The surgeries are planned based on surgeons experience, have limited adoption of minimally invasive techniques, and lead to inconsistent outcomes. We propose a pipeline that generates super-resolved, patient-specific 3D pseudo-healthy target morphologies from conventional clinical MR scans. First, we compute an isotropic super-resolved MR volume using an Implicit Neural Representation (INR). Next, we segment femur, tibia, patella, and fibula with a multi-label custom-trained network. Finally, we train a Wavelet Diffusion Model (WDM) to generate pseudo-healthy target morphologies of the trochlear region. In contrast to prior work producing pseudo-healthy low-resolution 3D MR images, our approach enables the generation of sub-millimeter resolved 3D shapes compatible for pre- and intraoperative use. These can serve as preoperative blueprints for reshaping the femoral groove while preserving the native patella articulation. Furthermore, and in contrast to other work, we do not require a CT for our pipeline - reducing the amount of radiation. We evaluated our approach on 25 TD patients and could show that our target morphologies significantly improve the sulcus angle (SA) and trochlear groove depth (TGD). The code and interactive visualization are available at https://wehrlimi.github.io/sr-3d-planning/.

Authors:Chao Hao, Zitong Yu, Xin Liu, Yuhao Wang, Weicheng Xie, Jingang Shi, Huanjing Yue, Jingyu Yang
Title: Distribution-Specific Learning for Joint Salient and Camouflaged Object Detection
Abstract:
Salient object detection (SOD) and camouflaged object detection (COD) are two closely related but distinct computer vision tasks. Although both are class-agnostic segmentation tasks that map from RGB space to binary space, the former aims to identify the most salient objects in the image, while the latter focuses on detecting perfectly camouflaged objects that blend into the background in the image. These two tasks exhibit strong contradictory attributes. Previous works have mostly believed that joint learning of these two tasks would confuse the network, reducing its performance on both tasks. However, here we present an opposite perspective: with the correct approach to learning, the network can simultaneously possess the capability to find both salient and camouflaged objects, allowing both tasks to benefit from joint learning. We propose SCJoint, a joint learning scheme for SOD and COD tasks, assuming that the decoding processes of SOD and COD have different distribution characteristics. The key to our method is to learn the respective means and variances of the decoding processes for both tasks by inserting a minimal amount of task-specific learnable parameters within a fully shared network structure, thereby decoupling the contradictory attributes of the two tasks at a minimal cost. Furthermore, we propose a saliency-based sampling strategy (SBSS) to sample the training set of the SOD task to balance the training set sizes of the two tasks. In addition, SBSS improves the training set quality and shortens the training time. Based on the proposed SCJoint and SBSS, we train a powerful generalist network, named JoNet, which has the ability to simultaneously capture both ``salient" and ``camouflaged". Extensive experiments demonstrate the competitive performance and effectiveness of our proposed method. The code is available at https://github.com/linuxsino/JoNet.
中文摘要:SCJoint框架通过在共享网络中嵌入任务特定参数实现显著与伪装目标检测的联合学习,其基于显著性的采样策略提升了训练效率与性能表现。
English Summary: The SCJoint framework enables joint learning of salient and camouflaged object detection through task-specific parameters in a shared network, with a sampling strategy that enhances training efficiency and performance.

Authors:Wonjung Park, Suhyun Ahn, Jinah Park
Title: LV-Net: Anatomy-aware lateral ventricle shape modeling with a case study on Alzheimer's disease
Abstract:
Lateral ventricle (LV) shape analysis holds promise as a biomarker for neurological diseases; however, challenges remain due to substantial shape variability across individuals and segmentation difficulties arising from limited MRI resolution. We introduce LV-Net, a novel framework for producing individualized 3D LV meshes from brain MRI by deforming an anatomy-aware joint LV-hippocampus template mesh. By incorporating anatomical relationships embedded within the joint template, LV-Net reduces boundary segmentation artifacts and improves reconstruction robustness. In addition, by classifying the vertices of the template mesh based on their anatomical adjacency, our method enhances point correspondence across subjects, leading to more accurate LV shape statistics. We demonstrate that LV-Net achieves superior reconstruction accuracy, even in the presence of segmentation imperfections, and delivers more reliable shape descriptors across diverse datasets. Finally, we apply LV-Net to Alzheimer's disease analysis, identifying LV subregions that show significantly associations with the disease relative to cognitively normal controls. The codes for LV shape modeling are available at https://github.com/PWonjung/LV_Shape_Modeling.
中文: LV-Net是一种通过变形联合模板从脑部MRI生成个性化3D侧脑室网格的新框架,它提高了分割鲁棒性和形状对应性,从而增强了神经疾病分析能力,并在阿尔茨海默病研究中得到验证。
English: LV-Net is a novel framework that generates individualized 3D lateral ventricle meshes from brain MRI by deforming a joint template, improving segmentation robustness and shape correspondence for enhanced neurological disease analysis, as demonstrated in Alzheimer's disease research.

Authors:Jun Xie, Yingjian Zhu, Feng Chen, Zhenghao Zhang, Xiaohui Fan, Hongzhu Yi, Xinming Wang, Chen Yu, Yue Bi, Zhaoran Zhao, Xiongjun Guan, Zhepeng Wang
Title: More Is Better: A MoE-Based Emotion Recognition Framework with Human Preference Alignment
Abstract:
In this paper, we present our solution for the semi-supervised learning track (MER-SEMI) in MER2025. We propose a comprehensive framework, grounded in the principle that "more is better," to construct a robust Mixture of Experts (MoE) emotion recognition system. Our approach integrates a diverse range of input modalities as independent experts, including novel signals such as knowledge from large Vision-Language Models (VLMs) and temporal Action Unit (AU) information. To effectively utilize unlabeled data, we introduce a consensus-based pseudo-labeling strategy, generating high-quality labels from the agreement between a baseline model and Gemini, which are then used in a two-stage training paradigm. Finally, we employ a multi-expert voting ensemble combined with a rule-based re-ranking process to correct prediction bias and better align the outputs with human preferences. Evaluated on the MER2025-SEMI challenge dataset, our method achieves an F1-score of 0.8772 on the test set, ranking 2nd in the track. Our code is available at https://github.com/zhuyjan/MER2025-MRAC25.
中文: 本文提出了一种半监督情感识别框架,融合了多种输入模态和基于共识的伪标签策略,在MER2025-SEMI挑战赛中以0.8772的F1分数获得第二名。
English: This paper introduces a semi-supervised framework for emotion recognition that combines diverse input modalities and consensus-based pseudo-labeling, achieving second place in the MER2025-SEMI challenge with an F1-score of 0.8772.

Authors:Utku Ozbulak, Michaela Cohrs, Hristo L. Svilenov, Joris Vankerschaver, Wesley De Neve
Title: Improved Sub-Visible Particle Classification in Flow Imaging Microscopy via Generative AI-Based Image Synthesis
Abstract:
Sub-visible particle analysis using flow imaging microscopy combined with deep learning has proven effective in identifying particle types, enabling the distinction of harmless components such as silicone oil from protein particles. However, the scarcity of available data and severe imbalance between particle types within datasets remain substantial hurdles when applying multi-class classifiers to such problems, often forcing researchers to rely on less effective methods. The aforementioned issue is particularly challenging for particle types that appear unintentionally and in lower numbers, such as silicone oil and air bubbles, as opposed to protein particles, where obtaining large numbers of images through controlled settings is comparatively straightforward. In this work, we develop a state-of-the-art diffusion model to address data imbalance by generating high-fidelity images that can augment training datasets, enabling the effective training of multi-class deep neural networks. We validate this approach by demonstrating that the generated samples closely resemble real particle images in terms of visual quality and structure. To assess the effectiveness of using diffusion-generated images in training datasets, we conduct large-scale experiments on a validation dataset comprising 500,000 protein particle images and demonstrate that this approach improves classification performance with no negligible downside. Finally, to promote open research and reproducibility, we publicly release both our diffusion models and the trained multi-class deep neural network classifiers, along with a straightforward interface for easy integration into future studies, at https://github.com/utkuozbulak/svp-generative-ai.
Chinese: 本研究开发了一种先进的扩散模型,通过生成高质量粒子图像有效解决了亚可见颗粒分析中数据稀缺和类别不平衡的问题,从而提升了多类深度神经网络的分类性能且无明显弊端。
English: This study introduces a state-of-the-art diffusion model to generate high-fidelity particle images, effectively addressing data scarcity and imbalance in training multi-class deep neural networks for sub-visible particle analysis, thereby improving classification performance without significant drawbacks.

Authors:Jun Feng, Zixin Wang, Zhentao Zhang, Yue Guo, Zhihan Zhou, Xiuyi Chen, Zhenyang Li, Dawei Yin
Title: MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in visual mathematical reasoning across various existing benchmarks. However, these benchmarks are predominantly based on clean or processed multimodal inputs, without incorporating the images provided by real-world Kindergarten through 12th grade (K-12) educational users. To address this gap, we introduce MathReal, a meticulously curated dataset comprising 2,000 mathematical questions with images captured by handheld mobile devices in authentic scenarios. Each question is an image, containing the question text and visual element. We systematically classify the real images into three primary categories: image quality degradation, perspective variation, and irrelevant content interference, which are further delineated into 14 subcategories. Additionally, MathReal spans five core knowledge and ability categories, which encompass three question types and are divided into three difficulty levels. To comprehensively evaluate the multimodal mathematical reasoning abilities of state-of-the-art MLLMs in real-world scenarios, we design six experimental settings that enable a systematic analysis of their performance. Through extensive experimentation, we find that the problem-solving abilities of existing MLLMs are significantly challenged in realistic educational contexts. Based on this, we conduct a thorough analysis of their performance and error patterns, providing insights into their recognition, comprehension, and reasoning capabilities, and outlining directions for future improvements. Data and code: https://github.com/junfeng0288/MathReal.
Chinese: MathReal数据集通过提供移动设备拍摄的真实K-12数学问题,填补了多模态大语言模型评估的空白,揭示了这些模型在真实场景下解题能力面临的重大挑战。
English: The MathReal dataset addresses the gap in evaluating multimodal large language models (MLLMs) by providing real-world K-12 math questions captured on mobile devices, revealing significant challenges in their problem-solving abilities under realistic conditions.

Authors:Kai Zhang, Peng Wang, Sai Bi, Jianming Zhang, Yuanjun Xiong
Title: KnapFormer: An Online Load Balancer for Efficient Diffusion Transformers Training
Abstract:
We present KnapFormer, an efficient and versatile framework to combine workload balancing and sequence parallelism in distributed training of Diffusion Transformers (DiT). KnapFormer builds on the insight that strong synergy exists between sequence parallelism and the need to address the significant token imbalance across ranks. This imbalance arises from variable-length text inputs and varying visual token counts in mixed-resolution and image-video joint training. KnapFormer redistributes tokens by first gathering sequence length metadata across all ranks in a balancing group and solving a global knapsack problem. The solver aims to minimize the variances of total workload per-GPU, while accounting for the effect of sequence parallelism. By integrating DeepSpeed-Ulysees-based sequence parallelism in the load-balancing decision process and utilizing a simple semi-empirical workload model, KnapFormers achieves minimal communication overhead and less than 1% workload discrepancy in real-world training workloads with sequence length varying from a few hundred to tens of thousands. It eliminates straggler effects and achieves 2x to 3x speedup when training state-of-the-art diffusion models like FLUX on mixed-resolution and image-video joint data corpora. We open-source the KnapFormer implementation at https://github.com/Kai-46/KnapFormer/
中文:KnapFormer是一个高效框架,在扩散变换器的分布式训练中结合负载均衡与序列并行,通过重新分配令牌最小化工作负载差异并消除滞后效应,实现了2-3倍的训练加速。
English: KnapFormer is an efficient framework that combines workload balancing and sequence parallelism in distributed training of Diffusion Transformers, achieving 2-3x speedup by redistributing tokens to minimize workload variance and eliminate stragglers.

Authors:Younjoon Chung, Hyoungseob Park, Patrick Rim, Xiaoran Zhang, Jihe He, Ziyao Zeng, Safa Cicek, Byung-Woo Hong, James S. Duncan, Alex Wong
Title: ETA: Energy-based Test-time Adaptation for Depth Completion
Abstract:
We propose a method for test-time adaptation of pretrained depth completion models. Depth completion models, trained on some ``source'' data, often predict erroneous outputs when transferred to ``target'' data captured in novel environmental conditions due to a covariate shift. The crux of our method lies in quantifying the likelihood of depth predictions belonging to the source data distribution. The challenge is in the lack of access to out-of-distribution (target) data prior to deployment. Hence, rather than making assumptions regarding the target distribution, we utilize adversarial perturbations as a mechanism to explore the data space. This enables us to train an energy model that scores local regions of depth predictions as in- or out-of-distribution. We update the parameters of pretrained depth completion models at test time to minimize energy, effectively aligning test-time predictions to those of the source distribution. We call our method ``Energy-based Test-time Adaptation'', or ETA for short. We evaluate our method across three indoor and three outdoor datasets, where ETA improve over the previous state-of-the-art method by an average of 6.94% for outdoors and 10.23% for indoors. Project Page: https://fuzzythecat.github.io/eta.
中文:我们提出基于能量的测试时适应(ETA)方法,通过在测试时利用对抗扰动训练能量模型来评估深度预测,并调整预训练模型参数以匹配源数据分布,从而在室内外数据集上显著超越现有最优方法。
English: We introduce Energy-based Test-time Adaptation (ETA), a method that adjusts pretrained depth completion models during testing by using adversarial perturbations to train an energy model, which scores predictions and updates model parameters to align with the source data distribution, achieving significant improvements over prior methods.

Authors:Lang Nie, Yuan Mei, Kang Liao, Yunqiu Xu, Chunyu Lin, Bin Xiao
Title: Robust Image Stitching with Optimal Plane
Abstract:
We present \textit{RopStitch}, an unsupervised deep image stitching framework with both robustness and naturalness. To ensure the robustness of \textit{RopStitch}, we propose to incorporate the universal prior of content perception into the image stitching model by a dual-branch architecture. It separately captures coarse and fine features and integrates them to achieve highly generalizable performance across diverse unseen real-world scenes. Concretely, the dual-branch model consists of a pretrained branch to capture semantically invariant representations and a learnable branch to extract fine-grained discriminative features, which are then merged into a whole by a controllable factor at the correlation level. Besides, considering that content alignment and structural preservation are often contradictory to each other, we propose a concept of virtual optimal planes to relieve this conflict. To this end, we model this problem as a process of estimating homography decomposition coefficients, and design an iterative coefficient predictor and minimal semantic distortion constraint to identify the optimal plane. This scheme is finally incorporated into \textit{RopStitch} by warping both views onto the optimal plane bidirectionally. Extensive experiments across various datasets demonstrate that \textit{RopStitch} significantly outperforms existing methods, particularly in scene robustness and content naturalness. The code is available at {\color{red}https://github.com/MmelodYy/RopStitch}.
中文: RopStitch是一种无监督深度图像拼接框架,采用双分支架构整合内容感知与细粒度特征,并通过虚拟最优平面解决对齐与结构保持的矛盾,在多种场景中实现了卓越的鲁棒性和自然度。
English: RopStitch is an unsupervised deep image stitching framework that uses a dual-branch architecture to integrate content perception and fine-grained features, along with virtual optimal planes to resolve alignment-structure conflicts, achieving superior robustness and naturalness across diverse scenes.

Authors:Hamidreza Dastmalchi, Aijun An, Ali cheraghian
Title: ETTA: Efficient Test-Time Adaptation for Vision-Language Models through Dynamic Embedding Updates
Abstract:
Pretrained vision-language models (VLMs) like CLIP show strong zero-shot performance but struggle with generalization under distribution shifts. Test-Time Adaptation (TTA) addresses this by adapting VLMs to unlabeled test data in new domains. While some TTA methods rely on prompt-tuning, training-free cache-based approaches are preferred for efficiency. However, current cache-based TTA models store only a limited set of high-confidence samples, restricting the decision boundary to these samples and ignoring the influence of other incoming test data. To address this, we propose Efficient Test-Time Adaptation (ETTA), introducing a Recursive Updating module that integrates all incoming test samples, progressively refining the decision boundary. This strategy mimics an unbounded cache, dynamically updating contextual embeddings for improved accuracy with minimal memory and computational overhead. ETTA also includes an Adaptive Ensemble module to reduce prompt dependency in image-to-text scores by dynamically selecting optimal prompts for each class. Furthermore, ETTA adaptively combines scores from both modules based on confidence levels, leveraging their complementary strengths. Extensive experiments on two benchmarks confirm that ETTA surpasses the state-of-the-art TTA models in computational complexity and accuracy, setting a new standard for effective, efficient test-time adaptation. The code has been released at https://github.com/hamidreza-dastmalchi/ETTA.
中文摘要:提出的高效测试时自适应方法通过动态整合全部测试样本优化决策边界,并自适应融合互补评分模块,在计算复杂度和准确率上均超越了现有最优测试时自适应模型。
English Summary: The proposed Efficient Test-Time Adaptation (ETTA) method overcomes limitations of existing cache-based approaches by dynamically integrating all test samples to refine decision boundaries and adaptively combining complementary scoring modules, achieving superior accuracy and efficiency compared to state-of-the-art models.

Authors:Guoping Xu, Hua-Chieh Shao, You Zhang
Title: TSMS-SAM2: Multi-scale Temporal Sampling Augmentation and Memory-Splitting Pruning for Promptable Video Object Segmentation and Tracking in Surgical Scenarios
Abstract:
Promptable video object segmentation and tracking (VOST) has seen significant advances with the emergence of foundation models like Segment Anything Model 2 (SAM2); however, their application in surgical video analysis remains challenging due to complex motion dynamics and the redundancy of memory that impedes effective learning. In this work, we propose TSMS-SAM2, a novel framework that enhances promptable VOST in surgical videos by addressing challenges of rapid object motion and memory redundancy in SAM2. TSMS-SAM2 introduces two key strategies: multi-temporal-scale video sampling augmentation to improve robustness against motion variability, and a memory splitting and pruning mechanism that organizes and filters past frame features for more efficient and accurate segmentation. Evaluated on EndoVis2017 and EndoVis2018 datasets, TSMS-SAM2 achieved the highest mean Dice scores of 95.24 and 86.73, respectively, outperforming prior SAM-based and task-specific methods. Extensive ablation studies confirm the effectiveness of multiscale temporal augmentation and memory splitting, highlighting the framework's potential for robust, efficient segmentation in complex surgical scenarios. Our source code will be available at https://github.com/apple1986/TSMS-SAM2.
Chinese: TSMS-SAM2框架通过多时间尺度采样和内存优化策略,有效解决了手术视频中物体快速运动和内存冗余的挑战,在基准数据集上实现了当前最优的分割性能。
English: The TSMS-SAM2 framework enhances promptable video object segmentation and tracking in surgical videos by addressing motion dynamics and memory redundancy through multi-temporal-scale sampling and memory optimization, achieving superior performance on benchmark datasets.

Authors:Raphael Du Sablon, David Hart
Title: Optimization-Free Style Transfer for 3D Gaussian Splats
Abstract:
The task of style transfer for 3D Gaussian splats has been explored in many previous works, but these require reconstructing or fine-tuning the splat while incorporating style information or optimizing a feature extraction network on the splat representation. We propose a reconstruction- and optimization-free approach to stylizing 3D Gaussian splats. This is done by generating a graph structure across the implicit surface of the splat representation. A feed-forward, surface-based stylization method is then used and interpolated back to the individual splats in the scene. This allows for any style image and 3D Gaussian splat to be used without any additional training or optimization. This also allows for fast stylization of splats, achieving speeds under 2 minutes even on consumer-grade hardware. We demonstrate the quality results this approach achieves and compare to other 3D Gaussian splat style transfer methods. Code is publicly available at https://github.com/davidmhart/FastSplatStyler.
Chinese: 本文提出了一种无需优化即可快速风格化3D高斯泼溅的方法,通过在隐式表面构建图结构并应用前向风格化技术,无需额外训练即可在消费级硬件上两分钟内实现风格化效果。
English: This paper introduces a fast, optimization-free method for stylizing 3D Gaussian splats by generating a graph structure on their implicit surface and applying a feed-forward stylization technique, enabling rapid results under two minutes on consumer hardware without additional training.

Authors:Seyed Hadi Seyed, Ayberk Cansever, David Hart
Title: Improving Masked Style Transfer using Blended Partial Convolution
Abstract:
Artistic style transfer has long been possible with the advancements of convolution- and transformer-based neural networks. Most algorithms apply the artistic style transfer to the whole image, but individual users may only need to apply a style transfer to a specific region in the image. The standard practice is to simply mask the image after the stylization. This work shows that this approach tends to improperly capture the style features in the region of interest. We propose a partial-convolution-based style transfer network that accurately applies the style features exclusively to the region of interest. Additionally, we present network-internal blending techniques that account for imperfections in the region selection. We show that this visually and quantitatively improves stylization using examples from the SA-1B dataset. Code is publicly available at https://github.com/davidmhart/StyleTransferMasked.
Chinese: 本文提出了一种基于部分卷积的风格迁移网络,能够精确地将艺术风格应用于图像选定区域,解决了传统方法在风格化后遮罩处理时特征捕捉不准确的问题。
English: This paper introduces a partial-convolution-based style transfer network that precisely applies artistic styles to selected image regions, overcoming the limitations of traditional methods that inaccurately capture features when masking after stylization.

Authors:Jinjia Peng, Zeze Tao, Huibing Wang, Meng Wang, Yang Wang
Title: Boosting Adversarial Transferability via Residual Perturbation Attack
Abstract:
Deep neural networks are susceptible to adversarial examples while suffering from incorrect predictions via imperceptible perturbations. Transfer-based attacks create adversarial examples for surrogate models and transfer these examples to target models under black-box scenarios. Recent studies reveal that adversarial examples in flat loss landscapes exhibit superior transferability to alleviate overfitting on surrogate models. However, the prior arts overlook the influence of perturbation directions, resulting in limited transferability. In this paper, we propose a novel attack method, named Residual Perturbation Attack (ResPA), relying on the residual gradient as the perturbation direction to guide the adversarial examples toward the flat regions of the loss function. Specifically, ResPA conducts an exponential moving average on the input gradients to obtain the first moment as the reference gradient, which encompasses the direction of historical gradients. Instead of heavily relying on the local flatness that stems from the current gradients as the perturbation direction, ResPA further considers the residual between the current gradient and the reference gradient to capture the changes in the global perturbation direction. The experimental results demonstrate the better transferability of ResPA than the existing typical transfer-based attack methods, while the transferability can be further improved by combining ResPA with the current input transformation methods. The code is available at https://github.com/ZezeTao/ResPA.
Chinese: 提出的残差扰动攻击(ResPA)通过利用残差梯度引导扰动朝向损失函数的平坦区域,显著提升了对抗样本的可迁移性,优于现有方法,并结合输入变换技术进一步增强了攻击效果。
English: The proposed Residual Perturbation Attack (ResPA) enhances adversarial transferability by using residual gradients to guide perturbations toward flat loss landscapes, outperforming existing methods and further improving when combined with input transformations.

Authors:Jin Khye Tan, En Jun Choong, Ethan Jeremiah Chitty, Yan Pheng Choo, John Hsin Yang Wong, Chern Eu Cheah
Title: Fine-Tuning Vision-Language Models for Markdown Conversion of Financial Tables in Malaysian Audited Financial Reports
Abstract:
Accurately extracting and representing the structure of tabular data from financial documents remains a critical challenge in document understanding, particularly for regulatory and analytical use cases. This study addresses the complexity of converting financial tables from Malaysian audited financial reports into Markdown format, a task complicated by rotated layouts, multi-level headers, and implicit structural cues. We propose a fine-tuned vision-language model (VLM), based on Qwen2.5-VL-7B, optimized for high-fidelity Markdown generation from document images. Our approach includes a curated dataset of 2,152 image-text pairs with augmentations and a supervised fine-tuning strategy using LoRA. To assess performance, we evaluated our model on 100 out-of-sample tables using a dual framework: a criteria-based LLM-as-a-judge for fine-grained accuracy and our novel Markdown Tree-Edit-Distance-based Similarity (TEDS) metric for holistic structural fidelity. Our model achieves a 92.20% overall accuracy on the criteria-based assessment and a 96.53% Markdown TEDS score. This performance significantly surpasses its Qwen2.5-VL-7B base model, larger-scale VLMs, and specialized reasoning-enabled models. Compared to these self-hosted alternatives, it also significantly reduces inference time. Furthermore, its accuracy exceeds that of widely used proprietary models such as OpenAI's GPT-4o and Gemini 2.5 Flash. These results demonstrate that domain-specific fine-tuning provides an effective and efficient method to bridge the gap between unstructured financial documents and downstream automation, rivalling much larger and more general models without their computational overhead.
中文: 本研究基于Qwen2.5-VL-7B开发了优化的视觉语言模型,在将马来西亚复杂财务报表转换为Markdown格式时准确率超过92%,其性能优于专有模型和更大规模模型,同时显著降低了计算成本。
English: This study introduces a fine-tuned vision-language model based on Qwen2.5-VL-7B that achieves over 92% accuracy in converting complex Malaysian financial tables to Markdown format, outperforming both proprietary and larger models while reducing computational costs.

Authors:Mohammed Talha Alam, Fahad Shamshad, Fakhri Karray, Karthik Nandakumar
Title: FaceAnonyMixer: Cancelable Faces via Identity Consistent Latent Space Mixing
Abstract:
Advancements in face recognition (FR) technologies have amplified privacy concerns, necessitating methods that protect identity while maintaining recognition utility. Existing face anonymization methods typically focus on obscuring identity but fail to meet the requirements of biometric template protection, including revocability, unlinkability, and irreversibility. We propose FaceAnonyMixer, a cancelable face generation framework that leverages the latent space of a pre-trained generative model to synthesize privacy-preserving face images. The core idea of FaceAnonyMixer is to irreversibly mix the latent code of a real face image with a synthetic code derived from a revocable key. The mixed latent code is further refined through a carefully designed multi-objective loss to satisfy all cancelable biometric requirements. FaceAnonyMixer is capable of generating high-quality cancelable faces that can be directly matched using existing FR systems without requiring any modifications. Extensive experiments on benchmark datasets demonstrate that FaceAnonyMixer delivers superior recognition accuracy while providing significantly stronger privacy protection, achieving over an 11% gain on commercial API compared to recent cancelable biometric methods. Code is available at: https://github.com/talha-alam/faceanonymixer.
中文摘要:FaceAnonyMixer是一种可撤销的人脸生成框架,通过将真实人脸的潜在代码与可撤销密钥生成的合成代码不可逆地混合,在保持高识别精度的同时提供了更强的隐私保护,性能优于现有可撤销生物特征方法。
English Summary: FaceAnonyMixer is a cancelable face generation framework that synthesizes privacy-preserving face images by irreversibly mixing real face latent codes with synthetic codes from revocable keys, achieving superior recognition accuracy and stronger privacy protection compared to existing methods.

Authors:Jianpeng Yao, Xiaopan Zhang, Yu Xia, Zejin Wang, Amit K. Roy-Chowdhury, Jiachen Li
Title: Towards Generalizable Safety in Crowd Navigation via Conformal Uncertainty Handling
Abstract:
Mobile robots navigating in crowds trained using reinforcement learning are known to suffer performance degradation when faced with out-of-distribution scenarios. We propose that by properly accounting for the uncertainties of pedestrians, a robot can learn safe navigation policies that are robust to distribution shifts. Our method augments agent observations with prediction uncertainty estimates generated by adaptive conformal inference, and it uses these estimates to guide the agent's behavior through constrained reinforcement learning. The system helps regulate the agent's actions and enables it to adapt to distribution shifts. In the in-distribution setting, our approach achieves a 96.93% success rate, which is over 8.80% higher than the previous state-of-the-art baselines with over 3.72 times fewer collisions and 2.43 times fewer intrusions into ground-truth human future trajectories. In three out-of-distribution scenarios, our method shows much stronger robustness when facing distribution shifts in velocity variations, policy changes, and transitions from individual to group dynamics. We deploy our method on a real robot, and experiments show that the robot makes safe and robust decisions when interacting with both sparse and dense crowds. Our code and videos are available on https://gen-safe-nav.github.io/.

Authors:Weiqi Zhang, Junsheng Zhou, Haotian Geng, Wenyuan Zhang, Yu-Shen Liu
Title: GAP: Gaussianize Any Point Clouds with Text Guidance
Abstract:
3D Gaussian Splatting (3DGS) has demonstrated its advantages in achieving fast and high-quality rendering. As point clouds serve as a widely-used and easily accessible form of 3D representation, bridging the gap between point clouds and Gaussians becomes increasingly important. Recent studies have explored how to convert the colored points into Gaussians, but directly generating Gaussians from colorless 3D point clouds remains an unsolved challenge. In this paper, we propose GAP, a novel approach that gaussianizes raw point clouds into high-fidelity 3D Gaussians with text guidance. Our key idea is to design a multi-view optimization framework that leverages a depth-aware image diffusion model to synthesize consistent appearances across different viewpoints. To ensure geometric accuracy, we introduce a surface-anchoring mechanism that effectively constrains Gaussians to lie on the surfaces of 3D shapes during optimization. Furthermore, GAP incorporates a diffuse-based inpainting strategy that specifically targets at completing hard-to-observe regions. We evaluate GAP on the Point-to-Gaussian generation task across varying complexity levels, from synthetic point clouds to challenging real-world scans, and even large-scale scenes. Project Page: https://weiqi-zhang.github.io/GAP.
中文: 本文提出GAP方法,通过文本引导将原始点云转化为高保真3D高斯模型,在优化过程中确保几何精度和多视角外观一致性。
English: This paper introduces GAP, a novel method that transforms raw point clouds into high-fidelity 3D Gaussians using text guidance, ensuring geometric accuracy and appearance consistency across viewpoints.

Authors:Henghui Ding, Kaining Ying, Chang Liu, Shuting He, Xudong Jiang, Yu-Gang Jiang, Philip H. S. Torr, Song Bai
Title: MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes
Abstract:
Video object segmentation (VOS) aims to segment specified target objects throughout a video. Although state-of-the-art methods have achieved impressive performance (e.g., 90+% J&F) on benchmarks such as DAVIS and YouTube-VOS, these datasets primarily contain salient, dominant, and isolated objects, limiting their generalization to real-world scenarios. To bridge this gap, the coMplex video Object SEgmentation (MOSEv1) dataset was introduced to facilitate VOS research in complex scenes. Building on the foundations and insights of MOSEv1, we present MOSEv2, a significantly more challenging dataset designed to further advance VOS methods under real-world conditions. MOSEv2 consists of 5,024 videos and 701,976 high-quality masks for 10,074 objects across 200 categories. Compared to its predecessor, MOSEv2 introduces much greater scene complexity, including {more frequent object disappearance and reappearance, severe occlusions and crowding, smaller objects, as well as a range of new challenges such as adverse weather (e.g., rain, snow, fog), low-light scenes (e.g., nighttime, underwater), multi-shot sequences, camouflaged objects, non-physical targets (e.g., shadows, reflections), and scenarios requiring external knowledge.} We benchmark 20 representative VOS methods under 5 different settings and observe consistent performance drops on MOSEv2. For example, SAM2 drops from 76.4% on MOSEv1 to only 50.9% on MOSEv2. We further evaluate 9 video object tracking methods and observe similar declines, demonstrating that MOSEv2 poses challenges across tasks. These results highlight that despite strong performance on existing datasets, current VOS methods still fall short under real-world complexities. Based on our analysis of the observed challenges, we further propose several practical tricks that enhance model performance. MOSEv2 is publicly available at https://MOSE.video.
中文: MOSEv2数据集通过引入更复杂的现实场景挑战,显著降低了现有视频对象分割方法的性能,揭示了当前技术在实际应用中的不足。
English: The MOSEv2 dataset introduces greater complexity to video object segmentation with challenging real-world scenarios, causing significant performance drops in current methods and highlighting their limitations.

Authors:Yong Du, Yuchen Yan, Fei Tang, Zhengxi Lu, Chang Zong, Weiming Lu, Shengpei Jiang, Yongliang Shen
Title: Test-Time Reinforcement Learning for GUI Grounding via Region Consistency
Abstract:
Graphical User Interface (GUI) grounding, the task of mapping natural language instructions to precise screen coordinates, is fundamental to autonomous GUI agents. While existing methods achieve strong performance through extensive supervised training or reinforcement learning with labeled rewards, they remain constrained by the cost and availability of pixel-level annotations. We observe that when models generate multiple predictions for the same GUI element, the spatial overlap patterns reveal implicit confidence signals that can guide more accurate localization. Leveraging this insight, we propose GUI-RC (Region Consistency), a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions where models show highest agreement. Without any training, GUI-RC improves accuracy by 2-3% across various architectures on ScreenSpot benchmarks. We further introduce GUI-RCPO (Region Consistency Policy Optimization), which transforms these consistency patterns into rewards for test-time reinforcement learning. By computing how well each prediction aligns with the collective consensus, GUI-RCPO enables models to iteratively refine their outputs on unlabeled data during inference. Extensive experiments demonstrate the generality of our approach: GUI-RC boosts Qwen2.5-VL-3B-Instruct from 80.11% to 83.57% on ScreenSpot-v2, while GUI-RCPO further improves it to 85.14% through self-supervised optimization. Our approach reveals the untapped potential of test-time scaling and test-time reinforcement learning for GUI grounding, offering a promising path toward more robust and data-efficient GUI agents.
Chinese Summary: 本研究提出了GUI-RC和GUI-RCPO方法,通过利用多预测的空间一致性来提升图形用户界面定位精度,无需额外训练即可实现高达5%的性能提升,或通过自监督优化进一步增强效果。
English Summary: The study introduces GUI-RC and GUI-RCPO, two methods that enhance GUI grounding accuracy by leveraging spatial consensus from multiple predictions, achieving up to 5% improvement without additional training or through self-supervised optimization.

Authors:Yuhan Zhang, Long Zhuo, Ziyang Chu, Tong Wu, Zhibing Li, Liang Pan, Dahua Lin, Ziwei Liu
Title: Hi3DEval: Advancing 3D Generation Evaluation with Hierarchical Validity
Abstract:
Despite rapid advances in 3D content generation, quality assessment for the generated 3D assets remains challenging. Existing methods mainly rely on image-based metrics and operate solely at the object level, limiting their ability to capture spatial coherence, material authenticity, and high-fidelity local details. 1) To address these challenges, we introduce Hi3DEval, a hierarchical evaluation framework tailored for 3D generative content. It combines both object-level and part-level evaluation, enabling holistic assessments across multiple dimensions as well as fine-grained quality analysis. Additionally, we extend texture evaluation beyond aesthetic appearance by explicitly assessing material realism, focusing on attributes such as albedo, saturation, and metallicness. 2) To support this framework, we construct Hi3DBench, a large-scale dataset comprising diverse 3D assets and high-quality annotations, accompanied by a reliable multi-agent annotation pipeline. We further propose a 3D-aware automated scoring system based on hybrid 3D representations. Specifically, we leverage video-based representations for object-level and material-subject evaluations to enhance modeling of spatio-temporal consistency and employ pretrained 3D features for part-level perception. Extensive experiments demonstrate that our approach outperforms existing image-based metrics in modeling 3D characteristics and achieves superior alignment with human preference, providing a scalable alternative to manual evaluations. The project page is available at https://zyh482.github.io/Hi3DEval/.
Chinese: Hi3DEval框架采用分层评估方法,结合物体级和部件级评估,在多维度上分析3D内容,通过增强空间连贯性和材质真实性的建模,在反映人类偏好方面显著优于现有基于图像的评估指标。
English: The Hi3DEval framework introduces a hierarchical approach combining object-level and part-level evaluations to assess 3D content across multiple dimensions, outperforming existing image-based metrics in capturing spatial coherence and material authenticity while aligning better with human preferences.

Authors:Luozheng Qin, Jia Gong, Yuqing Sun, Tianjiao Li, Mengping Yang, Xiaomeng Yang, Chao Qu, Zhiyu Tan, Hao Li
Title: Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision
Abstract:
Chain-of-Thought (CoT) reasoning has been widely adopted to enhance Large Language Models (LLMs) by decomposing complex tasks into simpler, sequential subtasks. However, extending CoT to vision-language reasoning tasks remains challenging, as it often requires interpreting transitions of visual states to support reasoning. Existing methods often struggle with this due to limited capacity of modeling visual state transitions or incoherent visual trajectories caused by fragmented architectures. To overcome these limitations, we propose Uni-CoT, a Unified Chain-of-Thought framework that enables coherent and grounded multimodal reasoning within a single unified model. The key idea is to leverage a model capable of both image understanding and generation to reason over visual content and model evolving visual states. However, empowering a unified model to achieve that is non-trivial, given the high computational cost and the burden of training. To address this, Uni-CoT introduces a novel two-level reasoning paradigm: A Macro-Level CoT for high-level task planning and A Micro-Level CoT for subtask execution. This design significantly reduces the computational overhead. Furthermore, we introduce a structured training paradigm that combines interleaved image-text supervision for macro-level CoT with multi-task objectives for micro-level CoT. Together, these innovations allow Uni-CoT to perform scalable and coherent multi-modal reasoning. Furthermore, thanks to our design, all experiments can be efficiently completed using only 8 A100 GPUs with 80GB VRAM each. Experimental results on reasoning-driven image generation benchmark (WISE) and editing benchmarks (RISE and KRIS) indicates that Uni-CoT demonstrates SOTA performance and strong generalization, establishing Uni-CoT as a promising solution for multi-modal reasoning. Project Page and Code: https://sais-fuxi.github.io/projects/uni-cot/
中文摘要:Uni-CoT通过创新的双层推理范式和结构化训练方法,在统一模型中实现了连贯的多模态推理,在视觉语言任务上取得了最优性能,同时显著降低了计算成本。
English Summary: Uni-CoT is a unified Chain-of-Thought framework that enables coherent multimodal reasoning through a two-level reasoning paradigm and structured training, achieving state-of-the-art performance on vision-language tasks with efficient computational requirements.

Authors:Shaobin Zhuang, Yiwei Guo, Canmiao Fu, Zhipeng Huang, Zeyue Tian, Fangyikang Wang, Ying Zhang, Chen Li, Yali Wang
Title: WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction
Abstract:
Visual tokenizer is a critical component for vision generation. However, the existing tokenizers often face unsatisfactory trade-off between compression ratios and reconstruction fidelity. To fill this gap, we introduce a powerful and concise WeTok tokenizer, which surpasses the previous leading tokenizers via two core innovations. (1) Group-wise lookup-free Quantization (GQ). We partition the latent features into groups, and perform lookup-free quantization for each group. As a result, GQ can efficiently overcome memory and computation limitations of prior tokenizers, while achieving a reconstruction breakthrough with more scalable codebooks. (2) Generative Decoding (GD). Different from prior tokenizers, we introduce a generative decoder with a prior of extra noise variable. In this case, GD can probabilistically model the distribution of visual data conditioned on discrete tokens, allowing WeTok to reconstruct visual details, especially at high compression ratios. Extensive experiments on mainstream benchmarks show superior performance of our WeTok. On the ImageNet 50k validation set, WeTok achieves a record-low zero-shot rFID (WeTok: 0.12 vs. FLUX-VAE: 0.18 vs. SD-VAE 3.5: 0.19) with a 400% compression ratio. Furthermore, our highest compression model achieves a zero-shot rFID of 3.49 with a compression ratio of 768, outperforming Cosmos (384) 4.57 which has only 50% compression rate of ours. Code and models are available: https://github.com/zhuangshaobin/WeTok.
中文:WeTok分词器通过分组无查找量化和生成式解码技术,在保持高压缩率的同时实现了卓越的重建保真度,在主流基准测试中创下性能新纪录。
English: The WeTok tokenizer introduces group-wise lookup-free quantization and generative decoding to achieve superior reconstruction fidelity and compression ratios, setting new performance records on benchmarks.

Authors:Hao Dong, Lijun Sheng, Jian Liang, Ran He, Eleni Chatzi, Olga Fink
Title: Adapting Vision-Language Models Without Labels: A Comprehensive Survey
Abstract:
Vision-Language Models (VLMs) have demonstrated remarkable generalization capabilities across a wide range of tasks. However, their performance often remains suboptimal when directly applied to specific downstream scenarios without task-specific adaptation. To enhance their utility while preserving data efficiency, recent research has increasingly focused on unsupervised adaptation methods that do not rely on labeled data. Despite the growing interest in this area, there remains a lack of a unified, task-oriented survey dedicated to unsupervised VLM adaptation. To bridge this gap, we present a comprehensive and structured overview of the field. We propose a taxonomy based on the availability and nature of unlabeled visual data, categorizing existing approaches into four key paradigms: Data-Free Transfer (no data), Unsupervised Domain Transfer (abundant data), Episodic Test-Time Adaptation (batch data), and Online Test-Time Adaptation (streaming data). Within this framework, we analyze core methodologies and adaptation strategies associated with each paradigm, aiming to establish a systematic understanding of the field. Additionally, we review representative benchmarks across diverse applications and highlight open challenges and promising directions for future research. An actively maintained repository of relevant literature is available at https://github.com/tim-learn/Awesome-LabelFree-VLMs.
Chinese: 本综述系统梳理了视觉语言模型的无监督自适应方法,依据未标注数据的可用性将其划分为四种范式,并分析各类方法以提升模型在特定下游任务中的性能表现。
English: This survey provides a structured overview of unsupervised adaptation methods for Vision-Language Models, categorizing them into four paradigms based on unlabeled data availability and analyzing their methodologies to address performance gaps in downstream tasks.

Authors:Shaowu Chen, Wei Ma, Binhua Huang, Qingyuan Wang, Guoxin Wang, Weize Sun, Lei Huang, Deepu John
Title: Optimal Brain Connection: Towards Efficient Structural Pruning
Abstract:
Structural pruning has been widely studied for its effectiveness in compressing neural networks. However, existing methods often neglect the interconnections among parameters. To address this limitation, this paper proposes a structural pruning framework termed Optimal Brain Connection. First, we introduce the Jacobian Criterion, a first-order metric for evaluating the saliency of structural parameters. Unlike existing first-order methods that assess parameters in isolation, our criterion explicitly captures both intra-component interactions and inter-layer dependencies. Second, we propose the Equivalent Pruning mechanism, which utilizes autoencoders to retain the contributions of all original connection--including pruned ones--during fine-tuning. Experimental results demonstrate that the Jacobian Criterion outperforms several popular metrics in preserving model performance, while the Equivalent Pruning mechanism effectively mitigates performance degradation after fine-tuning. Code: https://github.com/ShaowuChen/Optimal_Brain_Connection
中文: 本文提出名为"最优脑连接"的结构剪枝框架,其创新包括能评估参数显著性的雅可比准则——通过捕捉层内交互与层间依赖关系,以及利用自编码器在微调中保持原始连接贡献的等效剪枝机制,实验证明二者均能有效维持模型性能。
English: This paper introduces Optimal Brain Connection, a structural pruning framework featuring the Jacobian Criterion that evaluates parameter saliency by capturing intra- and inter-layer dependencies, along with an Equivalent Pruning mechanism using autoencoders to maintain original connection contributions during fine-tuning, both proven effective in preserving model performance.

Authors:Lin Zhu, Ruonan Liu, Xiao Wang, Lizhi Wang, Hua Huang
Title: Revealing Latent Information: A Physics-inspired Self-supervised Pre-training Framework for Noisy and Sparse Events
Abstract:
Event camera, a novel neuromorphic vision sensor, records data with high temporal resolution and wide dynamic range, offering new possibilities for accurate visual representation in challenging scenarios. However, event data is inherently sparse and noisy, mainly reflecting brightness changes, which complicates effective feature extraction. To address this, we propose a self-supervised pre-training framework to fully reveal latent information in event data, including edge information and texture cues. Our framework consists of three stages: Difference-guided Masked Modeling, inspired by the event physical sampling process, reconstructs temporal intensity difference maps to extract enhanced information from raw event data. Backbone-fixed Feature Transition contrasts event and image features without updating the backbone to preserve representations learned from masked modeling and stabilizing their effect on contrastive learning. Focus-aimed Contrastive Learning updates the entire model to improve semantic discrimination by focusing on high-value regions. Extensive experiments show our framework is robust and consistently outperforms state-of-the-art methods on various downstream tasks, including object recognition, semantic segmentation, and optical flow estimation. The code and dataset are available at https://github.com/BIT-Vision/EventPretrain.
中文摘要:本文提出一种自监督预训练框架,通过三个阶段创新性地从稀疏事件数据中提取增强特征,在多种视觉任务中展现出卓越性能。
English Summary: This paper introduces a self-supervised pre-training framework that enhances feature extraction from sparse event data through three innovative stages, demonstrating superior performance across multiple vision tasks.

Authors:Weikang Wang, Tobias Weißberg, Nafie El Amrani, Florian Bernard
Title: Symmetry Understanding of 3D Shapes via Chirality Disentanglement
Abstract:
Chirality information (i.e. information that allows distinguishing left from right) is ubiquitous for various data modes in computer vision, including images, videos, point clouds, and meshes. While chirality has been extensively studied in the image domain, its exploration in shape analysis (such as point clouds and meshes) remains underdeveloped. Although many shape vertex descriptors have shown appealing properties (e.g. robustness to rigid-body transformations), they are often not able to disambiguate between left and right symmetric parts. Considering the ubiquity of chirality information in different shape analysis problems and the lack of chirality-aware features within current shape descriptors, developing a chirality feature extractor becomes necessary and urgent. Based on the recent Diff3F framework, we propose an unsupervised chirality feature extraction pipeline to decorate shape vertices with chirality-aware information, extracted from 2D foundation models. We evaluated the extracted chirality features through quantitative and qualitative experiments across diverse datasets. Results from downstream tasks including left-right disentanglement, shape matching, and part segmentation demonstrate their effectiveness and practical utility. Project page: https://wei-kang-wang.github.io/chirality/

Authors:Lumin Chen, Zhiying Wu, Tianye Lei, Xuexue Bai, Ming Feng, Yuxi Wang, Gaofeng Meng, Zhen Lei, Hongbin Liu
Title: F2PASeg: Feature Fusion for Pituitary Anatomy Segmentation in Endoscopic Surgery
Abstract:
Pituitary tumors often cause deformation or encapsulation of adjacent vital structures. Anatomical structure segmentation can provide surgeons with early warnings of regions that pose surgical risks, thereby enhancing the safety of pituitary surgery. However, pixel-level annotated video stream datasets for pituitary surgeries are extremely rare. To address this challenge, we introduce a new dataset for Pituitary Anatomy Segmentation (PAS). PAS comprises 7,845 time-coherent images extracted from 120 videos. To mitigate class imbalance, we apply data augmentation techniques that simulate the presence of surgical instruments in the training data. One major challenge in pituitary anatomy segmentation is the inconsistency in feature representation due to occlusions, camera motion, and surgical bleeding. By incorporating a Feature Fusion module, F2PASeg is proposed to refine anatomical structure segmentation by leveraging both high-resolution image features and deep semantic embeddings, enhancing robustness against intraoperative variations. Experimental results demonstrate that F2PASeg consistently segments critical anatomical structures in real time, providing a reliable solution for intraoperative pituitary surgery planning. Code: https://github.com/paulili08/F2PASeg.
中文: 本研究提出了垂体解剖分割(PAS)数据集和F2PASeg模型,该模型通过特征融合模块增强了对关键解剖结构的实时分割能力,有效应对遮挡和类别不平衡等挑战,提升了垂体手术的安全性。
English: The study introduces the Pituitary Anatomy Segmentation (PAS) dataset and the F2PASeg model, which uses a Feature Fusion module to enhance real-time segmentation of critical anatomical structures in pituitary surgery, improving surgical safety despite challenges like occlusions and class imbalance.

Authors:Rui Yu, Xianghang Zhang, Runkai Zhao, Huaicheng Yan, Meng Wang
Title: DistillDrive: End-to-End Multi-Mode Autonomous Driving Distillation by Isomorphic Hetero-Source Planning Model
Abstract:
End-to-end autonomous driving has been recently seen rapid development, exerting a profound influence on both industry and academia. However, the existing work places excessive focus on ego-vehicle status as their sole learning objectives and lacks of planning-oriented understanding, which limits the robustness of the overall decision-making prcocess. In this work, we introduce DistillDrive, an end-to-end knowledge distillation-based autonomous driving model that leverages diversified instance imitation to enhance multi-mode motion feature learning. Specifically, we employ a planning model based on structured scene representations as the teacher model, leveraging its diversified planning instances as multi-objective learning targets for the end-to-end model. Moreover, we incorporate reinforcement learning to enhance the optimization of state-to-decision mappings, while utilizing generative modeling to construct planning-oriented instances, fostering intricate interactions within the latent space. We validate our model on the nuScenes and NAVSIM datasets, achieving a 50\% reduction in collision rate and a 3-point improvement in closed-loop performance compared to the baseline model. Code and model are publicly available at https://github.com/YuruiAI/DistillDrive
中文:DistillDrive是一种基于知识蒸馏的端到端自动驾驶模型,通过多样化实例模仿增强运动特征学习,在基准数据集上实现了碰撞率降低50%和闭环性能提升3个点的显著改进。
English: DistillDrive is an end-to-end autonomous driving model that enhances decision-making robustness through knowledge distillation from a teacher model, achieving a 50% reduction in collision rate and improved closed-loop performance on benchmark datasets.

Authors:Wonjun Kang, Byeongkeun Ahn, Minjae Lee, Kevin Galim, Seunghyuk Oh, Hyung Il Koo, Nam Ik Cho
Title: UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text-to-Image Generation
Abstract:
Text-to-image (T2I) generation has been actively studied using Diffusion Models and Autoregressive Models. Recently, Masked Generative Transformers have gained attention as an alternative to Autoregressive Models to overcome the inherent limitations of causal attention and autoregressive decoding through bidirectional attention and parallel decoding, enabling efficient and high-quality image generation. However, compositional T2I generation remains challenging, as even state-of-the-art Diffusion Models often fail to accurately bind attributes and achieve proper text-image alignment. While Diffusion Models have been extensively studied for this issue, Masked Generative Transformers exhibit similar limitations but have not been explored in this context. To address this, we propose Unmasking with Contrastive Attention Guidance (UNCAGE), a novel training-free method that improves compositional fidelity by leveraging attention maps to prioritize the unmasking of tokens that clearly represent individual objects. UNCAGE consistently improves performance in both quantitative and qualitative evaluations across multiple benchmarks and metrics, with negligible inference overhead. Our code is available at https://github.com/furiosa-ai/uncage.
中文: 提出的UNCAGE方法通过对比注意力引导在去掩码过程中优先处理对象标记,无需额外训练即可提升组合式文本到图像生成的准确性。
English: The proposed UNCAGE method enhances compositional text-to-image generation by using contrastive attention guidance to prioritize object tokens during unmasking, improving fidelity without additional training.

Authors:Hamza Kalisch, Fabian Hörst, Jens Kleesiek, Ken Herrmann, Constantin Seibold
Title: CT-GRAPH: Hierarchical Graph Attention Network for Anatomy-Guided CT Report Generation
Abstract:
As medical imaging is central to diagnostic processes, automating the generation of radiology reports has become increasingly relevant to assist radiologists with their heavy workloads. Most current methods rely solely on global image features, failing to capture fine-grained organ relationships crucial for accurate reporting. To this end, we propose CT-GRAPH, a hierarchical graph attention network that explicitly models radiological knowledge by structuring anatomical regions into a graph, linking fine-grained organ features to coarser anatomical systems and a global patient context. Our method leverages pretrained 3D medical feature encoders to obtain global and organ-level features by utilizing anatomical masks. These features are further refined within the graph and then integrated into a large language model to generate detailed medical reports. We evaluate our approach for the task of report generation on the large-scale chest CT dataset CT-RATE. We provide an in-depth analysis of pretrained feature encoders for CT report generation and show that our method achieves a substantial improvement of absolute 7.9\% in F1 score over current state-of-the-art methods. The code is publicly available at https://github.com/hakal104/CT-GRAPH.
Chinese: CT-GRAPH提出了一种分层图注意力网络,通过建模解剖区域关系来改进放射学报告生成,相比现有方法在F1分数上实现了7.9%的绝对提升。
English: CT-GRAPH introduces a hierarchical graph attention network that models anatomical relationships to enhance radiology report generation, achieving a 7.9% F1 score improvement over existing methods.

Authors:Yongjun Zhang, Mingtao Xiong, Yi Wan, Gui-Song Xia
Title: Cross-View Localization via Redundant Sliced Observations and A-Contrario Validation
Abstract:
Cross-view localization (CVL) matches ground-level images with aerial references to determine the geo-position of a camera, enabling smart vehicles to self-localize offline in GNSS-denied environments. However, most CVL methods output only a single observation, the camera pose, and lack the redundant observations required by surveying principles, making it challenging to assess localization reliability through the mutual validation of observational data. To tackle this, we introduce Slice-Loc, a two-stage method featuring an a-contrario reliability validation for CVL. Instead of using the query image as a single input, Slice-Loc divides it into sub-images and estimates the 3-DoF pose for each slice, creating redundant and independent observations. Then, a geometric rigidity formula is proposed to filter out the erroneous 3-DoF poses, and the inliers are merged to generate the final camera pose. Furthermore, we propose a model that quantifies the meaningfulness of localization by estimating the number of false alarms (NFA), according to the distribution of the locations of the sliced images. By eliminating gross errors, Slice-Loc boosts localization accuracy and effectively detects failures. After filtering out mislocalizations, Slice-Loc reduces the proportion of errors exceeding 10 m to under 3\%. In cross-city tests on the DReSS dataset, Slice-Loc cuts the mean localization error from 4.47 m to 1.86 m and the mean orientation error from $\mathbf{3.42^{\circ}}$ to $\mathbf{1.24^{\circ}}$, outperforming state-of-the-art methods. Code and dataset will be available at: https://github.com/bnothing/Slice-Loc.
Chinese: Slice-Loc通过将地面图像分割为子图进行冗余位姿估计,并利用几何刚性和反偶然验证来提高跨视角定位的精度和可靠性,在GNSS缺失环境中显著降低了定位误差。
English: Slice-Loc enhances cross-view localization by dividing ground images into slices for redundant pose estimation and using geometric rigidity with a-contrario validation to improve accuracy and reliability, significantly reducing errors in GNSS-denied environments.

Authors:Yue Duan, Taicai Chen, Lei Qi, Yinghuan Shi
Title: Divide-and-Conquer for Enhancing Unlabeled Learning, Stability, and Plasticity in Semi-supervised Continual Learning
Abstract:
Semi-supervised continual learning (SSCL) seeks to leverage both labeled and unlabeled data in a sequential learning setup, aiming to reduce annotation costs while managing continual data arrival. SSCL introduces complex challenges, including ensuring effective unlabeled learning (UL), while balancing memory stability (MS) and learning plasticity (LP). Previous SSCL efforts have typically focused on isolated aspects of the three, while this work presents USP, a divide-and-conquer framework designed to synergistically enhance these three aspects: (1) Feature Space Reservation (FSR) strategy for LP, which constructs reserved feature locations for future classes by shaping old classes into an equiangular tight frame; (2) Divide-and-Conquer Pseudo-labeling (DCP) approach for UL, which assigns reliable pseudo-labels across both high- and low-confidence unlabeled data; and (3) Class-mean-anchored Unlabeled Distillation (CUD) for MS, which reuses DCP's outputs to anchor unlabeled data to stable class means for distillation to prevent forgetting. Comprehensive evaluations show USP outperforms prior SSCL methods, with gains up to 5.94% in the last accuracy, validating its effectiveness. The code is available at https://github.com/NJUyued/USP4SSCL.
Chinese: USP框架通过特征空间预留增强学习可塑性、分治伪标记处理未标记数据以及类均值锚定蒸馏保障记忆稳定性,在持续学习中实现了比现有方法高达5.94%的精度提升。
English: The USP framework enhances semi-supervised continual learning by integrating feature space reservation for plasticity, divide-and-conquer pseudo-labeling for unlabeled learning, and class-mean-anchored distillation for memory stability, achieving up to 5.94% higher accuracy than previous methods.

Authors:Meiqi Wu, Yaxuan Kang, Xuchen Li, Shiyu Hu, Xiaotang Chen, Yunfeng Kang, Weiqiang Wang, Kaiqi Huang
Title: VS-LLM: Visual-Semantic Depression Assessment based on LLM for Drawing Projection Test
Abstract:
The Drawing Projection Test (DPT) is an essential tool in art therapy, allowing psychologists to assess participants' mental states through their sketches. Specifically, through sketches with the theme of "a person picking an apple from a tree (PPAT)", it can be revealed whether the participants are in mental states such as depression. Compared with scales, the DPT can enrich psychologists' understanding of an individual's mental state. However, the interpretation of the PPAT is laborious and depends on the experience of the psychologists. To address this issue, we propose an effective identification method to support psychologists in conducting a large-scale automatic DPT. Unlike traditional sketch recognition, DPT more focus on the overall evaluation of the sketches, such as color usage and space utilization. Moreover, PPAT imposes a time limit and prohibits verbal reminders, resulting in low drawing accuracy and a lack of detailed depiction. To address these challenges, we propose the following efforts: (1) Providing an experimental environment for automated analysis of PPAT sketches for depression assessment; (2) Offering a Visual-Semantic depression assessment based on LLM (VS-LLM) method; (3) Experimental results demonstrate that our method improves by 17.6% compared to the psychologist assessment method. We anticipate that this work will contribute to the research in mental state assessment based on PPAT sketches' elements recognition. Our datasets and codes are available at https://github.com/wmeiqi/VS-LLM.
Chinese: 该研究提出了一种基于视觉语义分析和大型语言模型的自动化方法,用于通过PPAT绘画评估抑郁状态,相比心理学家评估方法准确率提升了17.6%。
English: The study introduces an automated method using Visual-Semantic analysis with LLMs to efficiently assess depression through PPAT sketches, improving accuracy by 17.6% over traditional psychologist evaluations.

Authors:Xiaoyang Zhang, Guodong Fan, Guang-Yong Chen, Zhen Hua, Jinjiang Li, Min Gan, C. L. Philip Chen
Title: Wavelet-Guided Dual-Frequency Encoding for Remote Sensing Change Detection
Abstract:
Change detection in remote sensing imagery plays a vital role in various engineering applications, such as natural disaster monitoring, urban expansion tracking, and infrastructure management. Despite the remarkable progress of deep learning in recent years, most existing methods still rely on spatial-domain modeling, where the limited diversity of feature representations hinders the detection of subtle change regions. We observe that frequency-domain feature modeling particularly in the wavelet domain an amplify fine-grained differences in frequency components, enhancing the perception of edge changes that are challenging to capture in the spatial domain. Thus, we propose a method called Wavelet-Guided Dual-Frequency Encoding (WGDF). Specifically, we first apply Discrete Wavelet Transform (DWT) to decompose the input images into high-frequency and low-frequency components, which are used to model local details and global structures, respectively. In the high-frequency branch, we design a Dual-Frequency Feature Enhancement (DFFE) module to strengthen edge detail representation and introduce a Frequency-Domain Interactive Difference (FDID) module to enhance the modeling of fine-grained changes. In the low-frequency branch, we exploit Transformers to capture global semantic relationships and employ a Progressive Contextual Difference Module (PCDM) to progressively refine change regions, enabling precise structural semantic characterization. Finally, the high- and low-frequency features are synergistically fused to unify local sensitivity with global discriminability. Extensive experiments on multiple remote sensing datasets demonstrate that WGDF significantly alleviates edge ambiguity and achieves superior detection accuracy and robustness compared to state-of-the-art methods. The code will be available at https://github.com/boshizhang123/WGDF.
中文: 本研究提出了一种名为小波引导双频编码(WGDF)的方法,通过小波变换进行频域分析,增强边缘细节表征和全局结构建模,从而提升遥感图像中细微变化区域的检测能力,相比现有方法实现了更优的准确性和鲁棒性。
English: This study introduces Wavelet-Guided Dual-Frequency Encoding (WGDF), a method that leverages frequency-domain analysis through wavelet transforms to enhance the detection of subtle changes in remote sensing imagery by improving edge detail representation and global structural modeling, achieving superior accuracy and robustness over existing approaches.

Authors:Changho Choi, Youngwoo Shin, Gyojin Han, Dong-Jae Lee, Junmo Kim
Title: B4DL: A Benchmark for 4D LiDAR LLM in Spatio-Temporal Understanding
Abstract:
Understanding dynamic outdoor environments requires capturing complex object interactions and their evolution over time. LiDAR-based 4D point clouds provide precise spatial geometry and rich temporal cues, making them ideal for representing real-world scenes. However, despite their potential, 4D LiDAR remains underexplored in the context of Multimodal Large Language Models (MLLMs) due to the absence of high-quality, modality-specific annotations and the lack of MLLM architectures capable of processing its high-dimensional composition. To address these challenges, we introduce B4DL, a new benchmark specifically designed for training and evaluating MLLMs on 4D LiDAR understanding. In addition, we propose a scalable data generation pipeline and an MLLM model that, for the first time, directly processes raw 4D LiDAR by bridging it with language understanding. Combined with our dataset and benchmark, our model offers a unified solution for spatio-temporal reasoning in dynamic outdoor environments. We provide rendered 4D LiDAR videos, generated dataset, and inference outputs on diverse scenarios at: https://mmb4dl.github.io/mmb4dl/

Authors:Xiaoyang Zhang, jinjiang Li, Guodong Fan, Yakun Ju, Linwei Fan, Jun Liu, Alex C. Kot
Title: SGDFuse: SAM-Guided Diffusion for High-Fidelity Infrared and Visible Image Fusion
Abstract:
Infrared and visible image fusion (IVIF) aims to combine the thermal radiation information from infrared images with the rich texture details from visible images to enhance perceptual capabilities for downstream visual tasks. However, existing methods often fail to preserve key targets due to a lack of deep semantic understanding of the scene, while the fusion process itself can also introduce artifacts and detail loss, severely compromising both image quality and task performance. To address these issues, this paper proposes SGDFuse, a conditional diffusion model guided by the Segment Anything Model (SAM), to achieve high-fidelity and semantically-aware image fusion. The core of our method is to utilize high-quality semantic masks generated by SAM as explicit priors to guide the optimization of the fusion process via a conditional diffusion model. Specifically, the framework operates in a two-stage process: it first performs a preliminary fusion of multi-modal features, and then utilizes the semantic masks from SAM jointly with the preliminary fused image as a condition to drive the diffusion model's coarse-to-fine denoising generation. This ensures the fusion process not only has explicit semantic directionality but also guarantees the high fidelity of the final result. Extensive experiments demonstrate that SGDFuse achieves state-of-the-art performance in both subjective and objective evaluations, as well as in its adaptability to downstream tasks, providing a powerful solution to the core challenges in image fusion. The code of SGDFuse is available at https://github.com/boshizhang123/SGDFuse.
中文:本文提出SGDFuse,一种由Segment Anything Model(SAM)引导的条件扩散模型,利用高质量语义掩码实现高保真和语义感知的红外与可见光图像融合,在主观和客观评估中均优于现有方法。
English: This paper introduces SGDFuse, a conditional diffusion model guided by the Segment Anything Model (SAM), which leverages high-quality semantic masks to achieve high-fidelity and semantically-aware infrared and visible image fusion, outperforming existing methods in both subjective and objective evaluations.

Authors:Hyunjoon Lee, Joonkyu Min, Jaesik Park
Title: CF3: Compact and Fast 3D Feature Fields
Abstract:
3D Gaussian Splatting (3DGS) has begun incorporating rich information from 2D foundation models. However, most approaches rely on a bottom-up optimization process that treats raw 2D features as ground truth, incurring increased computational costs. We propose a top-down pipeline for constructing compact and fast 3D Gaussian feature fields, namely, CF3. We first perform a fast weighted fusion of multi-view 2D features with pre-trained Gaussians. This approach enables training a per-Gaussian autoencoder directly on the lifted features, instead of training autoencoders in the 2D domain. As a result, the autoencoder better aligns with the feature distribution. More importantly, we introduce an adaptive sparsification method that optimizes the Gaussian attributes of the feature field while pruning and merging the redundant Gaussians, constructing an efficient representation with preserved geometric details. Our approach achieves a competitive 3D feature field using as little as 5% of the Gaussians compared to Feature-3DGS.

Authors:Xiao Wang, Liye Jin, Xufeng Lou, Shiao Wang, Lan Chen, Bo Jiang, Zhipeng Zhang
Title: ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking
Abstract:
Vision-language tracking has received increasing attention in recent years, as textual information can effectively address the inflexibility and inaccuracy associated with specifying the target object to be tracked. Existing works either directly fuse the fixed language with vision features or simply modify using attention, however, their performance is still limited. Recently, some researchers have explored using text generation to adapt to the variations in the target during tracking, however, these works fail to provide insights into the model's reasoning process and do not fully leverage the advantages of large models, which further limits their overall performance. To address the aforementioned issues, this paper proposes a novel reasoning-based vision-language tracking framework, named ReasoningTrack, based on a pre-trained vision-language model Qwen2.5-VL. Both SFT (Supervised Fine-Tuning) and reinforcement learning GRPO are used for the optimization of reasoning and language generation. We embed the updated language descriptions and feed them into a unified tracking backbone network together with vision features. Then, we adopt a tracking head to predict the specific location of the target object. In addition, we propose a large-scale long-term vision-language tracking benchmark dataset, termed TNLLT, which contains 200 video sequences. 20 baseline visual trackers are re-trained and evaluated on this dataset, which builds a solid foundation for the vision-language visual tracking task. Extensive experiments on multiple vision-language tracking benchmark datasets fully validated the effectiveness of our proposed reasoning-based natural language generation strategy. The source code of this paper will be released on https://github.com/Event-AHU/Open_VLTrack
中文: 本文提出了一种基于推理的视觉语言跟踪框架ReasoningTrack,通过预训练视觉语言模型融合动态语言描述与视觉特征,有效提升目标跟踪性能,并在多个基准数据集上验证了其优越性。
English: This paper introduces ReasoningTrack, a novel reasoning-based vision-language tracking framework that leverages a pre-trained vision-language model and integrates updated language descriptions with visual features to enhance target tracking accuracy, validated through extensive experiments on multiple benchmarks.

Authors:Jianming Liu, Wenlong Qiu, Haitao Wei
Title: Textual and Visual Guided Task Adaptation for Source-Free Cross-Domain Few-Shot Segmentation
Abstract:
Few-Shot Segmentation(FSS) aims to efficient segmentation of new objects with few labeled samples. However, its performance significantly degrades when domain discrepancies exist between training and deployment. Cross-Domain Few-Shot Segmentation(CD-FSS) is proposed to mitigate such performance degradation. Current CD-FSS methods primarily sought to develop segmentation models on a source domain capable of cross-domain generalization. However, driven by escalating concerns over data privacy and the imperative to minimize data transfer and training expenses, the development of source-free CD-FSS approaches has become essential. In this work, we propose a source-free CD-FSS method that leverages both textual and visual information to facilitate target domain task adaptation without requiring source domain data. Specifically, we first append Task-Specific Attention Adapters (TSAA) to the feature pyramid of a pretrained backbone, which adapt multi-level features extracted from the shared pre-trained backbone to the target task. Then, the parameters of the TSAA are trained through a Visual-Visual Embedding Alignment (VVEA) module and a Text-Visual Embedding Alignment (TVEA) module. The VVEA module utilizes global-local visual features to align image features across different views, while the TVEA module leverages textual priors from pre-aligned multi-modal features (e.g., from CLIP) to guide cross-modal adaptation. By combining the outputs of these modules through dense comparison operations and subsequent fusion via skip connections, our method produces refined prediction masks. Under both 1-shot and 5-shot settings, the proposed approach achieves average segmentation accuracy improvements of 2.18\% and 4.11\%, respectively, across four cross-domain datasets, significantly outperforming state-of-the-art CD-FSS methods. Code are available at https://github.com/ljm198134/TVGTANet.
中文: 本研究提出一种无源跨域小样本分割方法,通过融合文本与视觉信息提升目标域适应能力,在不使用源域数据的情况下显著提高了分割精度。
English: This study introduces a source-free cross-domain few-shot segmentation method that integrates textual and visual cues to enhance target domain adaptation, achieving notable accuracy improvements without using source domain data.

Authors:Bingyu Yang, Qingyao Tian, Yimeng Geng, Huai Liao, Xinyan Huang, Jiebo Luo, Hongbin Liu
Title: EndoMatcher: Generalizable Endoscopic Image Matcher via Multi-Domain Pre-training for Robot-Assisted Surgery
Abstract:
Generalizable dense feature matching in endoscopic images is crucial for robot-assisted tasks, including 3D reconstruction, navigation, and surgical scene understanding. Yet, it remains a challenge due to difficult visual conditions (e.g., weak textures, large viewpoint variations) and a scarcity of annotated data. To address these challenges, we propose EndoMatcher, a generalizable endoscopic image matcher via large-scale, multi-domain data pre-training. To address difficult visual conditions, EndoMatcher employs a two-branch Vision Transformer to extract multi-scale features, enhanced by dual interaction blocks for robust correspondence learning. To overcome data scarcity and improve domain diversity, we construct Endo-Mix6, the first multi-domain dataset for endoscopic matching. Endo-Mix6 consists of approximately 1.2M real and synthetic image pairs across six domains, with correspondence labels generated using Structure-from-Motion and simulated transformations. The diversity and scale of Endo-Mix6 introduce new challenges in training stability due to significant variations in dataset sizes, distribution shifts, and error imbalance. To address them, a progressive multi-objective training strategy is employed to promote balanced learning and improve representation quality across domains. This enables EndoMatcher to generalize across unseen organs and imaging conditions in a zero-shot fashion. Extensive zero-shot matching experiments demonstrate that EndoMatcher increases the number of inlier matches by 140.69% and 201.43% on the Hamlyn and Bladder datasets over state-of-the-art methods, respectively, and improves the Matching Direction Prediction Accuracy (MDPA) by 9.40% on the Gastro-Matching dataset, achieving dense and accurate matching under challenging endoscopic conditions. The code is publicly available at https://github.com/Beryl2000/EndoMatcher.
中文: EndoMatcher是一种通用的内窥镜图像匹配器,通过大规模多领域预训练和渐进式训练策略,在挑战性内窥镜条件下实现了鲁棒的特征匹配,在零样本实验中显著优于现有最优方法。
English: EndoMatcher is a generalizable endoscopic image matcher that leverages large-scale multi-domain pre-training and a progressive training strategy to achieve robust feature matching across challenging endoscopic conditions, significantly outperforming state-of-the-art methods in zero-shot experiments.

Authors:Dongchen Si, Di Wang, Erzhong Gao, Xiaolei Qin, Liu Zhao, Jing Zhang, Minqiang Xu, Jianbo Zhan, Jianshe Wang, Lin Liu, Bo Du, Liangpei Zhang
Title: SPEX: A Vision-Language Model for Land Cover Extraction on Spectral Remote Sensing Images
Abstract:
Spectral information has long been recognized as a critical cue in remote sensing observations. Although numerous vision-language models have been developed for pixel-level interpretation, spectral information remains underutilized, resulting in suboptimal performance, particularly in multispectral scenarios. To address this limitation, we construct a vision-language instruction-following dataset named SPIE, which encodes spectral priors of land-cover objects into textual attributes recognizable by large language models (LLMs), based on classical spectral index computations. Leveraging this dataset, we propose SPEX, a multimodal LLM designed for instruction-driven land cover extraction. To this end, we introduce several carefully designed components and training strategies, including multiscale feature aggregation, token context condensation, and multispectral visual pre-training, to achieve precise and flexible pixel-level interpretation. To the best of our knowledge, SPEX is the first multimodal vision-language model dedicated to land cover extraction in spectral remote sensing imagery. Extensive experiments on five public multispectral datasets demonstrate that SPEX consistently outperforms existing state-of-the-art methods in extracting typical land cover categories such as vegetation, buildings, and water bodies. Moreover, SPEX is capable of generating textual explanations for its predictions, thereby enhancing interpretability and user-friendliness. Code will be released at: https://github.com/MiliLab/SPEX.
中文摘要:通过SPIE数据集编码光谱先验知识开发的SPEX多模态视觉语言模型,在多光谱遥感影像地物提取中显著优于现有方法,并能提供可解释的预测结果。
English Summary: The SPEX multimodal vision-language model, developed using the SPIE dataset with encoded spectral priors, significantly outperforms existing methods in land cover extraction from multispectral imagery while providing interpretable predictions.

Authors:Zhuohang Jiang, Pangjing Wu, Xu Yuan, Wenqi Fan, Qing Li
Title: QA-Dragon: Query-Aware Dynamic RAG System for Knowledge-Intensive Visual Question Answering
Abstract:
Retrieval-Augmented Generation (RAG) has been introduced to mitigate hallucinations in Multimodal Large Language Models (MLLMs) by incorporating external knowledge into the generation process, and it has become a widely adopted approach for knowledge-intensive Visual Question Answering (VQA). However, existing RAG methods typically retrieve from either text or images in isolation, limiting their ability to address complex queries that require multi-hop reasoning or up-to-date factual knowledge. To address this limitation, we propose QA-Dragon, a Query-Aware Dynamic RAG System for Knowledge-Intensive VQA. Specifically, QA-Dragon introduces a domain router to identify the query's subject domain for domain-specific reasoning, along with a search router that dynamically selects optimal retrieval strategies. By orchestrating both text and image search agents in a hybrid setup, our system supports multimodal, multi-turn, and multi-hop reasoning, enabling it to tackle complex VQA tasks effectively. We evaluate our QA-Dragon on the Meta CRAG-MM Challenge at KDD Cup 2025, where it significantly enhances the reasoning performance of base models under challenging scenarios. Our framework achieves substantial improvements in both answer accuracy and knowledge overlap scores, outperforming baselines by 5.06% on the single-source task, 6.35% on the multi-source task, and 5.03% on the multi-turn task.
中文: 提出的QA-Dragon系统通过动态选择最优检索策略并结合文本与图像搜索,增强了检索增强生成方法,显著提升了复杂视觉问答任务中的推理能力和准确性。
English: Retrieval-Augmented Generation (RAG) is enhanced by the proposed QA-Dragon system, which dynamically selects optimal retrieval strategies and combines text and image search to improve reasoning and accuracy in complex Visual Question Answering tasks.

Authors:Yongjie Bai, Zhouxia Wang, Yang Liu, Weixing Chen, Ziliang Chen, Mingtong Dai, Yongsen Zheng, Lingbo Liu, Guanbin Li, Liang Lin
Title: Learning to See and Act: Task-Aware View Planning for Robotic Manipulation
Abstract:
Recent vision-language-action (VLA) models for multi-task robotic manipulation commonly rely on static viewpoints and shared visual encoders, which limit 3D perception and cause task interference, hindering robustness and generalization. In this work, we propose Task-Aware View Planning (TAVP), a framework designed to overcome these challenges by integrating active view planning with task-specific representation learning. TAVP employs an efficient exploration policy, accelerated by a novel pseudo-environment, to actively acquire informative views. Furthermore, we introduce a Mixture-of-Experts (MoE) visual encoder to disentangle features across different tasks, boosting both representation fidelity and task generalization. By learning to see the world in a task-aware way, TAVP generates more complete and discriminative visual representations, demonstrating significantly enhanced action prediction across a wide array of manipulation challenges. Extensive experiments on RLBench tasks show that our proposed TAVP model achieves superior performance over state-of-the-art fixed-view approaches. Visual results and code are provided at: https://hcplab-sysu.github.io/TAVP.
中文: 提出的任务感知视角规划(TAVP)框架通过主动选择信息丰富的视角并采用混合专家视觉编码器分离任务特定特征,显著提升了机器人操作性能,优于固定视角方法。
English: The proposed Task-Aware View Planning (TAVP) framework enhances robotic manipulation by actively selecting informative viewpoints and employing a Mixture-of-Experts visual encoder to disentangle task-specific features, achieving superior performance over fixed-view methods.

Authors:Qi Xie, Jiahong Fu, Zongben Xu, Deyu Meng
Title: Rotation Equivariant Arbitrary-scale Image Super-Resolution
Abstract:
The arbitrary-scale image super-resolution (ASISR), a recent popular topic in computer vision, aims to achieve arbitrary-scale high-resolution recoveries from a low-resolution input image. This task is realized by representing the image as a continuous implicit function through two fundamental modules, a deep-network-based encoder and an implicit neural representation (INR) module. Despite achieving notable progress, a crucial challenge of such a highly ill-posed setting is that many common geometric patterns, such as repetitive textures, edges, or shapes, are seriously warped and deformed in the low-resolution images, naturally leading to unexpected artifacts appearing in their high-resolution recoveries. Embedding rotation equivariance into the ASISR network is thus necessary, as it has been widely demonstrated that this enhancement enables the recovery to faithfully maintain the original orientations and structural integrity of geometric patterns underlying the input image. Motivated by this, we make efforts to construct a rotation equivariant ASISR method in this study. Specifically, we elaborately redesign the basic architectures of INR and encoder modules, incorporating intrinsic rotation equivariance capabilities beyond those of conventional ASISR networks. Through such amelioration, the ASISR network can, for the first time, be implemented with end-to-end rotational equivariance maintained from input to output. We also provide a solid theoretical analysis to evaluate its intrinsic equivariance error, demonstrating its inherent nature of embedding such an equivariance structure. The superiority of the proposed method is substantiated by experiments conducted on both simulated and real datasets. We also validate that the proposed framework can be readily integrated into current ASISR methods in a plug \& play manner to further enhance their performance.
中文摘要:本研究提出了一种旋转等变的任意尺度图像超分辨率方法,通过重构编码器和隐式神经表示模块,实现了从输入到输出的端到端旋转等变性,有效保持几何图案的结构完整性,并在多个数据集上验证了其优越性能。
English Summary: This study introduces a rotation equivariant arbitrary-scale image super-resolution (ASISR) method that redesigns encoder and implicit neural representation modules to preserve geometric pattern integrity, achieving end-to-end rotational equivariance and demonstrating superior performance on various datasets.

Authors:Yifu Guo, Yuquan Lu, Wentao Zhang, Zishan Xu, Dexia Chen, Siyu Zhang, Yizhe Zhang, Ruixuan Wang
Title: Decoupling Continual Semantic Segmentation
Abstract:
Continual Semantic Segmentation (CSS) requires learning new classes without forgetting previously acquired knowledge, addressing the fundamental challenge of catastrophic forgetting in dense prediction tasks. However, existing CSS methods typically employ single-stage encoder-decoder architectures where segmentation masks and class labels are tightly coupled, leading to interference between old and new class learning and suboptimal retention-plasticity balance. We introduce DecoupleCSS, a novel two-stage framework for CSS. By decoupling class-aware detection from class-agnostic segmentation, DecoupleCSS enables more effective continual learning, preserving past knowledge while learning new classes. The first stage leverages pre-trained text and image encoders, adapted using LoRA, to encode class-specific information and generate location-aware prompts. In the second stage, the Segment Anything Model (SAM) is employed to produce precise segmentation masks, ensuring that segmentation knowledge is shared across both new and previous classes. This approach improves the balance between retention and adaptability in CSS, achieving state-of-the-art performance across a variety of challenging tasks. Our code is publicly available at: https://github.com/euyis1019/Decoupling-Continual-Semantic-Segmentation.
中文: DecoupleCSS提出了一种两阶段框架,通过解耦类别感知检测与类别无关分割来缓解持续语义分割中的灾难性遗忘问题,借助改进的保留-可塑性平衡实现了最先进的性能。
English: DecoupleCSS introduces a two-stage framework that separates class-aware detection from class-agnostic segmentation to mitigate catastrophic forgetting in continual semantic segmentation, achieving state-of-the-art performance through enhanced retention-plasticity balance.

Authors:Md Redwanul Haque, Manzur Murshed, Manoranjan Paul, Tsz-Kwan Lee
Title: A Novel Image Similarity Metric for Scene Composition Structure
Abstract:
The rapid advancement of generative AI models necessitates novel methods for evaluating image quality that extend beyond human perception. A critical concern for these models is the preservation of an image's underlying Scene Composition Structure (SCS), which defines the geometric relationships among objects and the background, their relative positions, sizes, orientations, etc. Maintaining SCS integrity is paramount for ensuring faithful and structurally accurate GenAI outputs. Traditional image similarity metrics often fall short in assessing SCS. Pixel-level approaches are overly sensitive to minor visual noise, while perception-based metrics prioritize human aesthetic appeal, neither adequately capturing structural fidelity. Furthermore, recent neural-network-based metrics introduce training overheads and potential generalization issues. We introduce the SCS Similarity Index Measure (SCSSIM), a novel, analytical, and training-free metric that quantifies SCS preservation by exploiting statistical measures derived from the Cuboidal hierarchical partitioning of images, robustly capturing non-object-based structural relationships. Our experiments demonstrate SCSSIM's high invariance to non-compositional distortions, accurately reflecting unchanged SCS. Conversely, it shows a strong monotonic decrease for compositional distortions, precisely indicating when SCS has been altered. Compared to existing metrics, SCSSIM exhibits superior properties for structural evaluation, making it an invaluable tool for developing and evaluating generative models, ensuring the integrity of scene composition.
中文: 本文提出SCSSIM这一无需训练的新指标,通过立方体分层划分统计量化场景构图结构的保持度,相比传统方法对构图扭曲展现出更优的鲁棒性评估能力。
English: This paper introduces SCSSIM, a novel training-free metric that evaluates image quality by quantifying the preservation of Scene Composition Structure through cuboidal partitioning, demonstrating superior robustness to compositional distortions compared to traditional methods.

Authors:Shushi Wang, Chunyi Li, Zicheng Zhang, Han Zhou, Wei Dong, Jun Chen, Guangtao Zhai, Xiaohong Liu
Title: AU-IQA: A Benchmark Dataset for Perceptual Quality Assessment of AI-Enhanced User-Generated Content
Abstract:
AI-based image enhancement techniques have been widely adopted in various visual applications, significantly improving the perceptual quality of user-generated content (UGC). However, the lack of specialized quality assessment models has become a significant limiting factor in this field, limiting user experience and hindering the advancement of enhancement methods. While perceptual quality assessment methods have shown strong performance on UGC and AIGC individually, their effectiveness on AI-enhanced UGC (AI-UGC) which blends features from both, remains largely unexplored. To address this gap, we construct AU-IQA, a benchmark dataset comprising 4,800 AI-UGC images produced by three representative enhancement types which include super-resolution, low-light enhancement, and denoising. On this dataset, we further evaluate a range of existing quality assessment models, including traditional IQA methods and large multimodal models. Finally, we provide a comprehensive analysis of how well current approaches perform in assessing the perceptual quality of AI-UGC. The access link to the AU-IQA is https://github.com/WNNGGU/AU-IQA-Dataset.
中文: 基于AI的图像增强技术提升了用户生成内容的质量,但缺乏专门的评估模型,因此构建了AU-IQA数据集来测试现有方法在AI增强图像上的表现。
English: AI-based image enhancement improves UGC quality but lacks specialized assessment models, leading to the creation of the AU-IQA dataset to evaluate existing methods on AI-enhanced images.

Authors:Shenglun Chen, Xinzhu Ma, Hong Zhang, Haojie Li, Zhihui Wang
Title: Propagating Sparse Depth via Depth Foundation Model for Out-of-Distribution Depth Completion
Abstract:
Depth completion is a pivotal challenge in computer vision, aiming at reconstructing the dense depth map from a sparse one, typically with a paired RGB image. Existing learning based models rely on carefully prepared but limited data, leading to significant performance degradation in out-of-distribution (OOD) scenarios. Recent foundation models have demonstrated exceptional robustness in monocular depth estimation through large-scale training, and using such models to enhance the robustness of depth completion models is a promising solution. In this work, we propose a novel depth completion framework that leverages depth foundation models to attain remarkable robustness without large-scale training. Specifically, we leverage a depth foundation model to extract environmental cues, including structural and semantic context, from RGB images to guide the propagation of sparse depth information into missing regions. We further design a dual-space propagation approach, without any learnable parameters, to effectively propagates sparse depth in both 3D and 2D spaces to maintain geometric structure and local consistency. To refine the intricate structure, we introduce a learnable correction module to progressively adjust the depth prediction towards the real depth. We train our model on the NYUv2 and KITTI datasets as in-distribution datasets and extensively evaluate the framework on 16 other datasets. Our framework performs remarkably well in the OOD scenarios and outperforms existing state-of-the-art depth completion methods. Our models are released in https://github.com/shenglunch/PSD.
Chinese: 本文提出了一种新颖的深度补全框架,利用深度基础模型从RGB图像中提取环境线索,无需大规模训练即可在分布外场景中实现卓越的鲁棒性。
English: This paper introduces a novel depth completion framework that utilizes depth foundation models to extract environmental cues from RGB images, enabling robust performance in out-of-distribution scenarios without large-scale training.

Authors:Zheng Chen, Mingde Zhou, Jinpei Guo, Jiale Yuan, Yifei Ji, Yulun Zhang
Title: Steering One-Step Diffusion Model with Fidelity-Rich Decoder for Fast Image Compression
Abstract:
Diffusion-based image compression has demonstrated impressive perceptual performance. However, it suffers from two critical drawbacks: (1) excessive decoding latency due to multi-step sampling, and (2) poor fidelity resulting from over-reliance on generative priors. To address these issues, we propose SODEC, a novel single-step diffusion image compression model. We argue that in image compression, a sufficiently informative latent renders multi-step refinement unnecessary. Based on this insight, we leverage a pre-trained VAE-based model to produce latents with rich information, and replace the iterative denoising process with a single-step decoding. Meanwhile, to improve fidelity, we introduce the fidelity guidance module, encouraging output that is faithful to the original image. Furthermore, we design the rate annealing training strategy to enable effective training under extremely low bitrates. Extensive experiments show that SODEC significantly outperforms existing methods, achieving superior rate-distortion-perception performance. Moreover, compared to previous diffusion-based compression models, SODEC improves decoding speed by more than 20$\times$. Code is released at: https://github.com/zhengchen1999/SODEC.
中文: SODEC是一种创新的单步扩散图像压缩模型,通过利用信息丰富的VAE潜在表示和保真度引导模块,解决了传统方法解码延迟高和保真度差的问题,在实现卓越性能的同时将解码速度提升了20倍以上。
English: SODEC is a novel single-step diffusion image compression model that overcomes the decoding latency and fidelity issues of previous methods by using informative VAE latents and a fidelity guidance module, achieving superior performance and over 20x faster decoding speed.

Authors:Zhu Xu, Ting Lei, Zhimin Li, Guan Wang, Qingchao Chen, Yuxin Peng, Yang liu
Title: TRKT: Weakly Supervised Dynamic Scene Graph Generation with Temporal-enhanced Relation-aware Knowledge Transferring
Abstract:
Dynamic Scene Graph Generation (DSGG) aims to create a scene graph for each video frame by detecting objects and predicting their relationships. Weakly Supervised DSGG (WS-DSGG) reduces annotation workload by using an unlocalized scene graph from a single frame per video for training. Existing WS-DSGG methods depend on an off-the-shelf external object detector to generate pseudo labels for subsequent DSGG training. However, detectors trained on static, object-centric images struggle in dynamic, relation-aware scenarios required for DSGG, leading to inaccurate localization and low-confidence proposals. To address the challenges posed by external object detectors in WS-DSGG, we propose a Temporal-enhanced Relation-aware Knowledge Transferring (TRKT) method, which leverages knowledge to enhance detection in relation-aware dynamic scenarios. TRKT is built on two key components:(1)Relation-aware knowledge mining: we first employ object and relation class decoders that generate category-specific attention maps to highlight both object regions and interactive areas. Then we propose an Inter-frame Attention Augmentation strategy that exploits optical flow for neighboring frames to enhance the attention maps, making them motion-aware and robust to motion blur. This step yields relation- and motion-aware knowledge mining for WS-DSGG. (2) we introduce a Dual-stream Fusion Module that integrates category-specific attention maps into external detections to refine object localization and boost confidence scores for object proposals. Extensive experiments demonstrate that TRKT achieves state-of-the-art performance on Action Genome dataset. Our code is avaliable at https://github.com/XZPKU/TRKT.git.
中文: 提出的时序增强关系感知知识迁移(TRKT)方法通过结合关系感知知识挖掘与光流增强及双流融合模块,解决了弱监督动态场景图生成中外部物体检测器的局限性,从而提升了物体定位精度和置信度,在Action Genome数据集上取得了领先性能。
English: The proposed Temporal-enhanced Relation-aware Knowledge Transferring (TRKT) method addresses the limitations of external object detectors in weakly supervised dynamic scene graph generation by integrating relation-aware knowledge mining with optical flow enhancement and a dual-stream fusion module to improve object localization and confidence, achieving state-of-the-art results on the Action Genome dataset.

Authors:Suchisrit Gangopadhyay, Jung-Hee Kim, Xien Chen, Patrick Rim, Hyoungseob Park, Alex Wong
Title: Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens
Abstract:
We propose a method to extend foundational monocular depth estimators (FMDEs), trained on perspective images, to fisheye images. Despite being trained on tens of millions of images, FMDEs are susceptible to the covariate shift introduced by changes in camera calibration (intrinsic, distortion) parameters, leading to erroneous depth estimates. Our method aligns the distribution of latent embeddings encoding fisheye images to those of perspective images, enabling the reuse of FMDEs for fisheye cameras without retraining or finetuning. To this end, we introduce a set of Calibration Tokens as a light-weight adaptation mechanism that modulates the latent embeddings for alignment. By exploiting the already expressive latent space of FMDEs, we posit that modulating their embeddings avoids the negative impact of artifacts and loss introduced in conventional recalibration or map projection to a canonical reference frame in the image space. Our method is self-supervised and does not require fisheye images but leverages publicly available large-scale perspective image datasets. This is done by recalibrating perspective images to fisheye images, and enforcing consistency between their estimates during training. We evaluate our approach with several FMDEs, on both indoors and outdoors, where we consistently improve over state-of-the-art methods using a single set of tokens for both. Code available at: https://github.com/JungHeeKim29/calibration-token.
Chinese: 该方法通过校准令牌将鱼眼图像的潜在嵌入与透视图像对齐,无需重新训练或鱼眼数据即可扩展基础单目深度估计器至鱼眼图像。
English: This method extends foundational monocular depth estimators to fisheye images by aligning their latent embeddings with perspective images using calibration tokens, enabling adaptation without retraining or fisheye data.

Authors:Shuonan Yang, Tailin Chen, Rahul Singh, Jiangbei Yue, Jianbo Jiao, Zeyu Fu
Title: Revealing Temporal Label Noise in Multimodal Hateful Video Classification
Abstract:
The rapid proliferation of online multimedia content has intensified the spread of hate speech, presenting critical societal and regulatory challenges. While recent work has advanced multimodal hateful video detection, most approaches rely on coarse, video-level annotations that overlook the temporal granularity of hateful content. This introduces substantial label noise, as videos annotated as hateful often contain long non-hateful segments. In this paper, we investigate the impact of such label ambiguity through a fine-grained approach. Specifically, we trim hateful videos from the HateMM and MultiHateClip English datasets using annotated timestamps to isolate explicitly hateful segments. We then conduct an exploratory analysis of these trimmed segments to examine the distribution and characteristics of both hateful and non-hateful content. This analysis highlights the degree of semantic overlap and the confusion introduced by coarse, video-level annotations. Finally, controlled experiments demonstrated that time-stamp noise fundamentally alters model decision boundaries and weakens classification confidence, highlighting the inherent context dependency and temporal continuity of hate speech expression. Our findings provide new insights into the temporal dynamics of multimodal hateful videos and highlight the need for temporally aware models and benchmarks for improved robustness and interpretability. Code and data are available at https://github.com/Multimodal-Intelligence-Lab-MIL/HatefulVideoLabelNoise.
中文: 本研究通过分析多模态数据集中经裁剪的仇恨内容片段,揭示了粗粒度视频标注会引入标签噪声,证明时间粒度对模型性能有显著影响,并强调需要开发具备时间感知能力的检测方法。
English: This study examines how coarse video-level annotations introduce label noise in hate speech detection by analyzing trimmed hateful segments from multimodal datasets, revealing that temporal granularity significantly impacts model performance and highlighting the need for temporally-aware approaches.

Authors:Noreen Anwar, Guillaume-Alexandre Bilodeau, Wassim Bouachir
Title: Dual-Stream Attention with Multi-Modal Queries for Object Detection in Transportation Applications
Abstract:
Transformer-based object detectors often struggle with occlusions, fine-grained localization, and computational inefficiency caused by fixed queries and dense attention. We propose DAMM, Dual-stream Attention with Multi-Modal queries, a novel framework introducing both query adaptation and structured cross-attention for improved accuracy and efficiency. DAMM capitalizes on three types of queries: appearance-based queries from vision-language models, positional queries using polygonal embeddings, and random learned queries for general scene coverage. Furthermore, a dual-stream cross-attention module separately refines semantic and spatial features, boosting localization precision in cluttered scenes. We evaluated DAMM on four challenging benchmarks, and it achieved state-of-the-art performance in average precision (AP) and recall, demonstrating the effectiveness of multi-modal query adaptation and dual-stream attention. Source code is at: \href{https://github.com/DET-LIP/DAMM}{GitHub}.
中文: DAMM框架通过结合多模态查询和双流注意力机制,显著提升了遮挡和复杂场景下的物体检测精度与效率,在多项基准测试中表现卓越。
English: The DAMM framework enhances object detection by integrating multi-modal queries—appearance, positional, and learned—with dual-stream attention, achieving superior accuracy and efficiency on benchmarks.

Authors:Chenhui Qiang, Zhaoyang Wei, Xumeng Han, Zipeng Wang, Siyao Li, Xiangyuan Lan, Jianbin Jiao, Zhenjun Han
Title: VER-Bench: Evaluating MLLMs on Reasoning with Fine-Grained Visual Evidence
Abstract:
With the rapid development of MLLMs, evaluating their visual capabilities has become increasingly crucial. Current benchmarks primarily fall into two main types: basic perception benchmarks, which focus on local details but lack deep reasoning (e.g., "what is in the image?"), and mainstream reasoning benchmarks, which concentrate on prominent image elements but may fail to assess subtle clues requiring intricate analysis. However, profound visual understanding and complex reasoning depend more on interpreting subtle, inconspicuous local details than on perceiving salient, macro-level objects. These details, though occupying minimal image area, often contain richer, more critical information for robust analysis. To bridge this gap, we introduce the VER-Bench, a novel framework to evaluate MLLMs' ability to: 1) identify fine-grained visual clues, often occupying on average just 0.25% of the image area; 2) integrate these clues with world knowledge for complex reasoning. Comprising 374 carefully designed questions across Geospatial, Temporal, Situational, Intent, System State, and Symbolic reasoning, each question in VER-Bench is accompanied by structured evidence: visual clues and question-related reasoning derived from them. VER-Bench reveals current models' limitations in extracting subtle visual evidence and constructing evidence-based arguments, highlighting the need to enhance models's capabilities in fine-grained visual evidence extraction, integration, and reasoning for genuine visual understanding and human-like analysis. Dataset and additional materials are available https://github.com/verbta/ACMMM-25-Materials.
中文: VER-Bench是一个新颖的评估框架,旨在检验多模态大语言模型识别细粒度视觉线索并融合世界知识进行复杂推理的能力,揭示了现有模型在提取细微证据和构建证据链方面的局限性。
English: VER-Bench is a novel framework designed to evaluate MLLMs' ability to identify fine-grained visual clues and integrate them with world knowledge for complex reasoning, revealing current models' limitations in extracting subtle evidence and constructing evidence-based arguments.

Authors:Seungyong Lee, Jeong-gi Kwak
Title: Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off
Abstract:
Virtual try-on aims to synthesize a realistic image of a person wearing a target garment, but accurately modeling garment-body correspondence remains a persistent challenge, especially under pose and appearance variation. In this paper, we propose Voost - a unified and scalable framework that jointly learns virtual try-on and try-off with a single diffusion transformer. By modeling both tasks jointly, Voost enables each garment-person pair to supervise both directions and supports flexible conditioning over generation direction and garment category, enhancing garment-body relational reasoning without task-specific networks, auxiliary losses, or additional labels. In addition, we introduce two inference-time techniques: attention temperature scaling for robustness to resolution or mask variation, and self-corrective sampling that leverages bidirectional consistency between tasks. Extensive experiments demonstrate that Voost achieves state-of-the-art results on both try-on and try-off benchmarks, consistently outperforming strong baselines in alignment accuracy, visual fidelity, and generalization.

Authors:Mehrdad Moradi, Marco Grasso, Bianca Maria Colosimo, Kamran Paynabar
Title: Single-Step Reconstruction-Free Anomaly Detection and Segmentation via Diffusion Models
Abstract:
Generative models have demonstrated significant success in anomaly detection and segmentation over the past decade. Recently, diffusion models have emerged as a powerful alternative, outperforming previous approaches such as GANs and VAEs. In typical diffusion-based anomaly detection, a model is trained on normal data, and during inference, anomalous images are perturbed to a predefined intermediate step in the forward diffusion process. The corresponding normal image is then reconstructed through iterative reverse sampling. However, reconstruction-based approaches present three major challenges: (1) the reconstruction process is computationally expensive due to multiple sampling steps, making real-time applications impractical; (2) for complex or subtle patterns, the reconstructed image may correspond to a different normal pattern rather than the original input; and (3) Choosing an appropriate intermediate noise level is challenging because it is application-dependent and often assumes prior knowledge of anomalies, an assumption that does not hold in unsupervised settings. We introduce Reconstruction-free Anomaly Detection with Attention-based diffusion models in Real-time (RADAR), which overcomes the limitations of reconstruction-based anomaly detection. Unlike current SOTA methods that reconstruct the input image, RADAR directly produces anomaly maps from the diffusion model, improving both detection accuracy and computational efficiency. We evaluate RADAR on real-world 3D-printed material and the MVTec-AD dataset. Our approach surpasses state-of-the-art diffusion-based and statistical machine learning models across all key metrics, including accuracy, precision, recall, and F1 score. Specifically, RADAR improves F1 score by 7% on MVTec-AD and 13% on the 3D-printed material dataset compared to the next best model. Code available at: https://github.com/mehrdadmoradi124/RADAR
Chinese: RADAR采用基于注意力的扩散模型,无需重构即可直接生成异常图,在检测精度和计算效率上均显著优于现有方法。
English: RADAR introduces a reconstruction-free approach using attention-based diffusion models to generate anomaly maps directly, significantly enhancing detection accuracy and computational efficiency over current methods.

Authors:Ziyang Leng, Jiawei Yang, Wenlong Yi, Bolei Zhou
Title: Occupancy Learning with Spatiotemporal Memory
Abstract:
3D occupancy becomes a promising perception representation for autonomous driving to model the surrounding environment at a fine-grained scale. However, it remains challenging to efficiently aggregate 3D occupancy over time across multiple input frames due to the high processing cost and the uncertainty and dynamics of voxels. To address this issue, we propose ST-Occ, a scene-level occupancy representation learning framework that effectively learns the spatiotemporal feature with temporal consistency. ST-Occ consists of two core designs: a spatiotemporal memory that captures comprehensive historical information and stores it efficiently through a scene-level representation and a memory attention that conditions the current occupancy representation on the spatiotemporal memory with a model of uncertainty and dynamic awareness. Our method significantly enhances the spatiotemporal representation learned for 3D occupancy prediction tasks by exploiting the temporal dependency between multi-frame inputs. Experiments show that our approach outperforms the state-of-the-art methods by a margin of 3 mIoU and reduces the temporal inconsistency by 29%.

Authors:Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, Jiaqi Wang
Title: SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience
Abstract:
Repurposing large vision-language models (LVLMs) as computer use agents (CUAs) has led to substantial breakthroughs, primarily driven by human-labeled data. However, these models often struggle with novel and specialized software, particularly in scenarios lacking human annotations. To address this challenge, we propose SEAgent, an agentic self-evolving framework enabling CUAs to autonomously evolve through interactions with unfamiliar software. Specifically, SEAgent empowers computer-use agents to autonomously master novel software environments via experiential learning, where agents explore new software, learn through iterative trial-and-error, and progressively tackle auto-generated tasks organized from simple to complex. To achieve this goal, we design a World State Model for step-wise trajectory assessment, along with a Curriculum Generator that generates increasingly diverse and challenging tasks. The agent's policy is updated through experiential learning, comprised of adversarial imitation of failure actions and Group Relative Policy Optimization (GRPO) on successful ones. Furthermore, we introduce a specialist-to-generalist training strategy that integrates individual experiential insights from specialist agents, facilitating the development of a stronger generalist CUA capable of continuous autonomous evolution. This unified agent ultimately achieves performance surpassing ensembles of individual specialist agents on their specialized software. We validate the effectiveness of SEAgent across five novel software environments within OS-World. Our approach achieves a significant improvement of 23.2% in success rate, from 11.3% to 34.5%, over a competitive open-source CUA, i.e., UI-TARS.
中文:SEAgent框架通过经验学习使计算机使用代理能自主掌握新型软件,结合自我进化机制和专家知识,在成功率上比现有模型提升了23.2%。
English: The proposed SEAgent framework enables computer-use agents to autonomously master novel software through experiential learning, achieving a 23.2% improvement in success rate over existing models by integrating self-evolving mechanisms and specialist knowledge.

Authors:Liang Xu, Chengqun Yang, Zili Lin, Fei Xu, Yifan Liu, Congsheng Xu, Yiyi Zhang, Jie Qin, Xingdong Sheng, Yunhui Liu, Xin Jin, Yichao Yan, Wenjun Zeng, Xiaokang Yang
Title: Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions
Abstract:
Learning action models from real-world human-centric interaction datasets is important towards building general-purpose intelligent assistants with efficiency. However, most existing datasets only offer specialist interaction category and ignore that AI assistants perceive and act based on first-person acquisition. We urge that both the generalist interaction knowledge and egocentric modality are indispensable. In this paper, we embed the manual-assisted task into a vision-language-action framework, where the assistant provides services to the instructor following egocentric vision and commands. With our hybrid RGB-MoCap system, pairs of assistants and instructors engage with multiple objects and the scene following GPT-generated scripts. Under this setting, we accomplish InterVLA, the first large-scale human-object-human interaction dataset with 11.4 hours and 1.2M frames of multimodal data, spanning 2 egocentric and 5 exocentric videos, accurate human/object motions and verbal commands. Furthermore, we establish novel benchmarks on egocentric human motion estimation, interaction synthesis, and interaction prediction with comprehensive analysis. We believe that our InterVLA testbed and the benchmarks will foster future works on building AI agents in the physical world.
中文: 本文提出了首个大规模多模态数据集InterVLA,通过第一人称视角和指令记录人-物-人交互,旨在推动AI助手学习通用交互模型以应用于现实世界。
English: This paper introduces InterVLA, the first large-scale multimodal dataset capturing human-object-human interactions through egocentric vision and commands, aiming to advance AI assistants' ability to learn generalist interaction models for real-world applications.

Authors:Gustav Hanning, Kalle Åström, Viktor Larsson
Title: PixCuboid: Room Layout Estimation from Multi-view Featuremetric Alignment
Abstract:
Coarse room layout estimation provides important geometric cues for many downstream tasks. Current state-of-the-art methods are predominantly based on single views and often assume panoramic images. We introduce PixCuboid, an optimization-based approach for cuboid-shaped room layout estimation, which is based on multi-view alignment of dense deep features. By training with the optimization end-to-end, we learn feature maps that yield large convergence basins and smooth loss landscapes in the alignment. This allows us to initialize the room layout using simple heuristics. For the evaluation we propose two new benchmarks based on ScanNet++ and 2D-3D-Semantics, with manually verified ground truth 3D cuboids. In thorough experiments we validate our approach and significantly outperform the competition. Finally, while our network is trained with single cuboids, the flexibility of the optimization-based approach allow us to easily extend to multi-room estimation, e.g. larger apartments or offices. Code and model weights are available at https://github.com/ghanning/PixCuboid.
中文: PixCuboid提出了一种基于优化的立方体房间布局估计方法,通过多视角深度特征对齐实现,在新基准测试中显著优于现有技术,并能灵活扩展到多房间场景。
English: PixCuboid introduces an optimization-based method for cuboid room layout estimation using multi-view alignment of deep features, outperforming existing approaches on new benchmarks and enabling multi-room extension despite single-cuboid training.

Authors:Hao Wang, Limeng Qiao, Zequn Jie, Zhijian Huang, Chengjian Feng, Qingfang Zheng, Lin Ma, Xiangyuan Lan, Xiaodan Liang
Title: X-SAM: From Segment Anything to Any Segmentation
Abstract:
Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from \textit{segment anything} to \textit{any segmentation}. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we present a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding. Code is available at https://github.com/wanghao9610/X-SAM.
中文:X-SAM是一个简化的多模态大语言模型,通过引入统一框架和视觉基础分割,将分割能力从“分割一切”扩展到“任意分割”,在多种基准测试中取得了最先进的性能。
English: X-SAM is a streamlined multimodal large language model that extends segmentation capabilities from "segment anything" to "any segmentation" by introducing a unified framework and visual grounded segmentation, achieving state-of-the-art performance across various benchmarks.

Authors:Baihui Xiao, Chengjian Feng, Zhijian Huang, Feng yan, Yujie Zhong, Lin Ma
Title: RoboTron-Sim: Improving Real-World Driving via Simulated Hard-Case
Abstract:
Collecting real-world data for rare high-risk scenarios, long-tailed driving events, and complex interactions remains challenging, leading to poor performance of existing autonomous driving systems in these critical situations. In this paper, we propose RoboTron-Sim that improves real-world driving in critical situations by utilizing simulated hard cases. First, we develop a simulated dataset called Hard-case Augmented Synthetic Scenarios (HASS), which covers 13 high-risk edge-case categories, as well as balanced environmental conditions such as day/night and sunny/rainy. Second, we introduce Scenario-aware Prompt Engineering (SPE) and an Image-to-Ego Encoder (I2E Encoder) to enable multimodal large language models to effectively learn real-world challenging driving skills from HASS, via adapting to environmental deviations and hardware differences between real-world and simulated scenarios. Extensive experiments on nuScenes show that RoboTron-Sim improves driving performance in challenging scenarios by around 50%, achieving state-of-the-art results in real-world open-loop planning. Qualitative results further demonstrate the effectiveness of RoboTron-Sim in better managing rare high-risk driving scenarios. Project page: https://stars79689.github.io/RoboTron-Sim/
中文:RoboTron-Sim通过生成模拟困难案例并运用多模态学习弥合虚实差距,在关键场景中将自动驾驶性能提升约50%,有效应对罕见高风险驾驶状况。
English: RoboTron-Sim enhances autonomous driving in critical situations by generating simulated hard cases and using multimodal learning to bridge real-simulation gaps, achieving a 50% performance improvement in challenging scenarios.

Authors:Tongfan Guan, Jiaxin Guo, Chen Wang, Yun-Hui Liu
Title: BridgeDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment
Abstract:
Monocular and stereo depth estimation offer complementary strengths: monocular methods capture rich contextual priors but lack geometric precision, while stereo approaches leverage epipolar geometry yet struggle with ambiguities such as reflective or textureless surfaces. Despite post-hoc synergies, these paradigms remain largely disjoint in practice. We introduce a unified framework that bridges both through iterative bidirectional alignment of their latent representations. At its core, a novel cross-attentive alignment mechanism dynamically synchronizes monocular contextual cues with stereo hypothesis representations during stereo reasoning. This mutual alignment resolves stereo ambiguities (e.g., specular surfaces) by injecting monocular structure priors while refining monocular depth with stereo geometry within a single network. Extensive experiments demonstrate state-of-the-art results: \textbf{it reduces zero-shot generalization error by $\!>\!40\%$ on Middlebury and ETH3D}, while addressing longstanding failures on transparent and reflective surfaces. By harmonizing multi-view geometry with monocular context, our approach enables robust 3D perception that transcends modality-specific limitations. Codes available at https://github.com/aeolusguan/BridgeDepth.
中文摘要:本文提出了一种通过潜在表征迭代双向对齐来融合单目与立体深度估计的统一框架,利用单目结构先验解决立体视觉歧义,同时通过立体几何优化单目深度,实现了最先进的性能。
English Summary: This paper presents a unified framework that integrates monocular and stereo depth estimation through iterative bidirectional alignment, achieving state-of-the-art performance by resolving stereo ambiguities with monocular priors while refining monocular depth with stereo geometry.

Authors:Jun Li, Che Liu, Wenjia Bai, Mingxuan Liu, Rossella Arcucci, Cosmin I. Bercea, Julia A. Schnabel
Title: Knowledge to Sight: Reasoning over Visual Attributes via Knowledge Decomposition for Abnormality Grounding
Abstract:
In this work, we address the problem of grounding abnormalities in medical images, where the goal is to localize clinical findings based on textual descriptions. While generalist Vision-Language Models (VLMs) excel in natural grounding tasks, they often struggle in the medical domain due to rare, compositional, and domain-specific terms that are poorly aligned with visual patterns. Specialized medical VLMs address this challenge via large-scale domain pretraining, but at the cost of substantial annotation and computational resources. To overcome these limitations, we propose \textbf{Knowledge to Sight (K2Sight)}, a framework that introduces structured semantic supervision by decomposing clinical concepts into interpretable visual attributes, such as shape, density, and anatomical location. These attributes are distilled from domain ontologies and encoded into concise instruction-style prompts, which guide region-text alignment during training. Unlike conventional report-level supervision, our approach explicitly bridges domain knowledge and spatial structure, enabling data-efficient training of compact models. We train compact models with 0.23B and 2B parameters using only 1.5\% of the data required by state-of-the-art medical VLMs. Despite their small size and limited training data, these models achieve performance on par with or better than 7B+ medical VLMs, with up to 9.82\% improvement in $mAP_{50}$. Code and models: \href{https://lijunrio.github.io/K2Sight/}{\textcolor{SOTAPink}{https://lijunrio.github.io/K2Sight/}}.
中文: 本文提出K2Sight框架,通过将临床概念分解为视觉属性并采用结构化提示进行高效训练,使紧凑模型仅用少量数据和参数即可实现与大型医学视觉语言模型相媲美的性能。
English: This paper introduces K2Sight, a framework that enhances medical image grounding by decomposing clinical concepts into visual attributes and using structured prompts for efficient training of compact models, achieving competitive performance with minimal data and parameters.

Authors:Yijie Li, Wei Zhang, Xi Zhu, Ye Wu, Yogesh Rathi, Lauren J. O'Donnell, Fan Zhang
Title: DDTracking: A Deep Generative Framework for Diffusion MRI Tractography with Streamline Local-Global Spatiotemporal Modeling
Abstract:
This paper presents DDTracking, a novel deep generative framework for diffusion MRI tractography that formulates streamline propagation as a conditional denoising diffusion process. In DDTracking, we introduce a dual-pathway encoding network that jointly models local spatial encoding (capturing fine-scale structural details at each streamline point) and global temporal dependencies (ensuring long-range consistency across the entire streamline). Furthermore, we design a conditional diffusion model module, which leverages the learned local and global embeddings to predict streamline propagation orientations for tractography in an end-to-end trainable manner. We conduct a comprehensive evaluation across diverse, independently acquired dMRI datasets, including both synthetic and clinical data. Experiments on two well-established benchmarks with ground truth (ISMRM Challenge and TractoInferno) demonstrate that DDTracking largely outperforms current state-of-the-art tractography methods. Furthermore, our results highlight DDTracking's strong generalizability across heterogeneous datasets, spanning varying health conditions, age groups, imaging protocols, and scanner types. Collectively, DDTracking offers anatomically plausible and robust tractography, presenting a scalable, adaptable, and end-to-end learnable solution for broad dMRI applications. Code is available at: https://github.com/yishengpoxiao/DDtracking.git
中文摘要:DDTracking是一种创新的深度生成框架,通过条件去噪扩散过程和双路径编码实现扩散MRI纤维束追踪,在多个数据集上展现出卓越性能与强大泛化能力。
English Summary: DDTracking is a novel deep generative framework that uses a conditional denoising diffusion process and dual-pathway encoding for robust diffusion MRI tractography, demonstrating superior performance and strong generalizability across diverse datasets.

Authors:Jinxi Liu, Zijian He, Guangrun Wang, Guanbin Li, Liang Lin
Title: One Model For All: Partial Diffusion for Unified Try-On and Try-Off in Any Pose
Abstract:
Recent diffusion-based approaches have made significant advances in image-based virtual try-on, enabling more realistic and end-to-end garment synthesis. However, most existing methods remain constrained by their reliance on exhibition garments and segmentation masks, as well as their limited ability to handle flexible pose variations. These limitations reduce their practicality in real-world scenarios-for instance, users cannot easily transfer garments worn by one person onto another, and the generated try-on results are typically restricted to the same pose as the reference image. In this paper, we introduce \textbf{OMFA} (\emph{One Model For All}), a unified diffusion framework for both virtual try-on and try-off that operates without the need for exhibition garments and supports arbitrary poses. For example, OMFA enables removing garments from a source person (try-off) and transferring them onto a target person (try-on), while also allowing the generated target to appear in novel poses-even without access to multi-pose images of that person. OMFA is built upon a novel \emph{partial diffusion} strategy that selectively applies noise and denoising to individual components of the joint input-such as the garment, the person image, or the face-enabling dynamic subtask control and efficient bidirectional garment-person transformation. The framework is entirely mask-free and requires only a single portrait and a target pose as input, making it well-suited for real-world applications. Additionally, by leveraging SMPL-X-based pose conditioning, OMFA supports multi-view and arbitrary-pose try-on from just one image. Extensive experiments demonstrate that OMFA achieves state-of-the-art results on both try-on and try-off tasks, providing a practical and generalizable solution for virtual garment synthesis. The project page is here: https://onemodelforall.github.io/.
中文摘要:OMFA是一种统一的扩散框架,无需展示服装或分割掩码即可实现虚拟试穿与脱卸,通过创新的部分扩散策略支持任意姿势,为真实场景提供实用的服装合成解决方案。
English Summary: OMFA is a unified diffusion framework that enables virtual try-on and try-off without requiring exhibition garments or segmentation masks, supporting arbitrary poses through a novel partial diffusion strategy for realistic garment synthesis.

Authors:Minghang Zheng, Yuxin Peng, Benyuan Sun, Yi Yang, Yang Liu
Title: Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding
Abstract:
In this paper, we tackle the task of online video temporal grounding (OnVTG), which requires the model to locate events related to a given text query within a video stream. Unlike regular video temporal grounding, OnVTG requires the model to make predictions without observing future frames. As online videos are streaming inputs and can go on indefinitely, it is impractical and inefficient to store all historical inputs. The existing OnVTG models employ memory to store recent historical video frame features and predict scores indicating whether the current frame corresponds to the start or end time of the target event. However, these methods lack effective event modeling and cannot retain long-term historical information, leading to low performance. To tackle these challenges, we propose a hierarchical event memory for OnVTG. We propose an event-based OnVTG framework that makes predictions based on event proposals that model event-level information with various durations. To preserve historically valuable event information, we introduce a hierarchical event memory that retains historical events, allowing the model to access both recent and long-term information. To enable the real-time prediction, we further propose a future prediction branch that predicts whether the target event will occur shortly and further regresses the start time of the event. We achieve state-of-the-art performance on the TACoS, ActivityNet Captions, and MAD datasets. Code is available at https://github.com/minghangz/OnVTG.
中文: 本文提出了一种用于在线视频时序定位的分层事件记忆框架,通过建模事件级信息并保留近期和长期历史数据来实现实时事件定位,在多个数据集上取得了最先进的性能。
English: This paper introduces a hierarchical event memory framework for online video temporal grounding that enables real-time event localization by modeling event-level information and retaining both recent and long-term historical data, achieving state-of-the-art performance across multiple datasets.

Authors:Safwen Naimi, Arij Said, Wassim Bouachir, Guillaume-Alexandre Bilodeau
Title: InceptoFormer: A Multi-Signal Neural Framework for Parkinson's Disease Severity Evaluation from Gait
Abstract:
We present InceptoFormer, a multi-signal neural framework designed for Parkinson's Disease (PD) severity evaluation via gait dynamics analysis. Our architecture introduces a 1D adaptation of the Inception model, which we refer to as Inception1D, along with a Transformer-based framework to stage PD severity according to the Hoehn and Yahr (H&Y) scale. The Inception1D component captures multi-scale temporal features by employing parallel 1D convolutional filters with varying kernel sizes, thereby extracting features across multiple temporal scales. The transformer component efficiently models long-range dependencies within gait sequences, providing a comprehensive understanding of both local and global patterns. To address the issue of class imbalance in PD severity staging, we propose a data structuring and preprocessing strategy based on oversampling to enhance the representation of underrepresented severity levels. The overall design enables to capture fine-grained temporal variations and global dynamics in gait signal, significantly improving classification performance for PD severity evaluation. Through extensive experimentation, InceptoFormer achieves an accuracy of 96.6%, outperforming existing state-of-the-art methods in PD severity assessment. The source code for our implementation is publicly available at https://github.com/SafwenNaimi/InceptoFormer
中文: InceptoFormer是一个结合Inception1D和Transformer组件的神经网络框架,通过步态分析评估帕金森病严重程度,该框架通过捕捉多尺度时间特征并处理类别不平衡问题,实现了96.6%的准确率。
English: InceptoFormer is a neural framework combining Inception1D and Transformer components to evaluate Parkinson's Disease severity through gait analysis, achieving 96.6% accuracy by capturing multi-scale temporal features and addressing class imbalance.

Authors:Johannes Tischer, Patric Kienast, Marlene Stümpflen, Gregor Kasprian, Georg Langs, Roxane Licandro
Title: Conditional Fetal Brain Atlas Learning for Automatic Tissue Segmentation
Abstract:
Magnetic Resonance Imaging (MRI) of the fetal brain has become a key tool for studying brain development in vivo. Yet, its assessment remains challenging due to variability in brain maturation, imaging protocols, and uncertain estimates of Gestational Age (GA). To overcome these, brain atlases provide a standardized reference framework that facilitates objective evaluation and comparison across subjects by aligning the atlas and subjects in a common coordinate system. In this work, we introduce a novel deep-learning framework for generating continuous, age-specific fetal brain atlases for real-time fetal brain tissue segmentation. The framework combines a direct registration model with a conditional discriminator. Trained on a curated dataset of 219 neurotypical fetal MRIs spanning from 21 to 37 weeks of gestation. The method achieves high registration accuracy, captures dynamic anatomical changes with sharp structural detail, and robust segmentation performance with an average Dice Similarity Coefficient (DSC) of 86.3% across six brain tissues. Furthermore, volumetric analysis of the generated atlases reveals detailed neurotypical growth trajectories, providing valuable insights into the maturation of the fetal brain. This approach enables individualized developmental assessment with minimal pre-processing and real-time performance, supporting both research and clinical applications. The model code is available at https://github.com/cirmuw/fetal-brain-atlas
中文: 本研究提出了一种新型深度学习框架,用于生成连续、年龄特定的胎儿大脑图谱,能够以高精度和最少预处理实现实时组织分割及个体化发育评估。
English: This study presents a novel deep-learning framework for creating continuous, age-specific fetal brain atlases that enable real-time tissue segmentation and individualized developmental assessment with high accuracy and minimal preprocessing.

Authors:Uzay Gökay, Federico Spurio, Dominik R. Bach, Juergen Gall
Title: Skeleton Motion Words for Unsupervised Skeleton-Based Temporal Action Segmentation
Abstract:
Current state-of-the-art methods for skeleton-based temporal action segmentation are predominantly supervised and require annotated data, which is expensive to collect. In contrast, existing unsupervised temporal action segmentation methods have focused primarily on video data, while skeleton sequences remain underexplored, despite their relevance to real-world applications, robustness, and privacy-preserving nature. In this paper, we propose a novel approach for unsupervised skeleton-based temporal action segmentation. Our method utilizes a sequence-to-sequence temporal autoencoder that keeps the information of the different joints disentangled in the embedding space. Latent skeleton sequences are then divided into non-overlapping patches and quantized to obtain distinctive skeleton motion words, driving the discovery of semantically meaningful action clusters. We thoroughly evaluate the proposed approach on three widely used skeleton-based datasets, namely HuGaDB, LARa, and BABEL. The results demonstrate that our model outperforms the current state-of-the-art unsupervised temporal action segmentation methods. Code is available at https://github.com/bachlab/SMQ .
中文: 本文提出了一种基于骨骼的无监督时序动作分割方法,通过时序自编码器和运动词汇量化技术,在三个基准数据集上超越了现有最优方法。
English: This paper introduces an unsupervised skeleton-based temporal action segmentation method using a temporal autoencoder and motion word quantization, which outperforms existing approaches on three benchmark datasets.

Authors:Bowen Chai, Zheng Chen, Libo Zhu, Wenbo Li, Yong Guo, Yulun Zhang
Title: QuantVSR: Low-Bit Post-Training Quantization for Real-World Video Super-Resolution
Abstract:
Diffusion models have shown superior performance in real-world video super-resolution (VSR). However, the slow processing speeds and heavy resource consumption of diffusion models hinder their practical application and deployment. Quantization offers a potential solution for compressing the VSR model. Nevertheless, quantizing VSR models is challenging due to their temporal characteristics and high fidelity requirements. To address these issues, we propose QuantVSR, a low-bit quantization model for real-world VSR. We propose a spatio-temporal complexity aware (STCA) mechanism, where we first utilize the calibration dataset to measure both spatial and temporal complexities for each layer. Based on these statistics, we allocate layer-specific ranks to the low-rank full-precision (FP) auxiliary branch. Subsequently, we jointly refine the FP and low-bit branches to achieve simultaneous optimization. In addition, we propose a learnable bias alignment (LBA) module to reduce the biased quantization errors. Extensive experiments on synthetic and real-world datasets demonstrate that our method obtains comparable performance with the FP model and significantly outperforms recent leading low-bit quantization methods. Code is available at: https://github.com/bowenchai/QuantVSR.
中文摘要:QuantVSR提出了一种用于视频超分辨率的低比特量化模型,通过时空复杂度感知机制和可学习偏置对齐模块,在保持与全精度模型相当性能的同时,显著优于现有量化方法。
English Summary: QuantVSR introduces a low-bit quantization model for video super-resolution that employs spatio-temporal complexity awareness and learnable bias alignment to achieve performance comparable to full-precision models while significantly outperforming existing quantization methods.

Authors:Qingguo Hu, Ante Wang, Jia Song, Delai Qiu, Qingsong Liu, Jinsong Su
Title: Boosting Visual Knowledge-Intensive Training for LVLMs Through Causality-Driven Visual Object Completion
Abstract:
Large Vision-Language Models (LVLMs) have experienced significant advancements in recent years. However, their performance still falls short in tasks requiring deep visual perception, such as identifying subtle differences between images. A potential cause is the scarcity of visual knowledge in popular instruction-tuning corpora, resulting in inadequate visual perception and reasoning capabilities. To address this challenge, we introduce a self-improvement framework grounded in a novel visual knowledge-intensive task, \underline{C}ausality-driven \underline{V}isual object \underline{C}ompletion (CVC). This task requires LVLMs to infer the masked object in an image based on its \textit{causal} relationships with the other visible information. We first obtain rich examples cheaply through our automated instance construction pipeline, without relying on sophisticated LVLMs (\textit{e.g.}, GPT-4V) or human assistance. Then, LVLMs effectively self-improve through trial and error learning using these created instances. Our experiments demonstrate substantial gains across four challenging specialized tasks and four widely-used comprehensive benchmarks. Especially on specialized tasks, our method achieves an average improvement of 5.4\% and 4.0\% compared to the corresponding baselines when utilizing LLaVA-1.5-7B and LLaVA-1.5-13B, respectively. The code is available at https://github.com/XMUDeepLIT/CVC.
大型视觉语言模型通过基于因果关系的视觉对象补全任务开发了一种自我改进框架,利用自动化实例构建和试错学习增强其视觉感知与推理能力,在专业及综合基准测试中实现了显著性能提升。
Large Vision-Language Models have developed a self-improvement framework using a visual knowledge-intensive task called Causality-driven Visual object Completion (CVC), which enhances their visual perception and reasoning capabilities through automated instance construction and trial-and-error learning, leading to significant performance gains on specialized and comprehensive benchmarks.

Authors:Xuan Loc Pham, Gwendolyn Vuurberg, Marjan Doppen, Joey Roosen, Tip Stille, Thi Quynh Ha, Thuy Duong Quach, Quoc Vu Dang, Manh Ha Luu, Ewoud J. Smit, Hong Son Mai, Mattias Heinrich, Bram van Ginneken, Mathias Prokop, Alessa Hering
Title: TotalRegistrator: Towards a Lightweight Foundation Model for CT Image Registration
Abstract:
Image registration is a fundamental technique in the analysis of longitudinal and multi-phase CT images within clinical practice. However, most existing methods are tailored for single-organ applications, limiting their generalizability to other anatomical regions. This work presents TotalRegistrator, an image registration framework capable of aligning multiple anatomical regions simultaneously using a standard UNet architecture and a novel field decomposition strategy. The model is lightweight, requiring only 11GB of GPU memory for training. To train and evaluate our method, we constructed a large-scale longitudinal dataset comprising 695 whole-body (thorax-abdomen-pelvic) paired CT scans from individual patients acquired at different time points. We benchmarked TotalRegistrator against a generic classical iterative algorithm and a recent foundation model for image registration. To further assess robustness and generalizability, we evaluated our model on three external datasets: the public thoracic and abdominal datasets from the Learn2Reg challenge, and a private multiphase abdominal dataset from a collaborating hospital. Experimental results on the in-house dataset show that the proposed approach generally surpasses baseline methods in multi-organ abdominal registration, with a slight drop in lung alignment performance. On out-of-distribution datasets, it achieved competitive results compared to leading single-organ models, despite not being fine-tuned for those tasks, demonstrating strong generalizability. The source code will be publicly available at: https://github.com/DIAGNijmegen/oncology_image_registration.git.
中文: TotalRegistrator是一种基于UNet架构和场分解策略的轻量级图像配准框架,能够同时配准多个解剖区域,在不同数据集上展现出强大的泛化能力,尽管在特定器官配准中存在微小性能波动。
English: TotalRegistrator is a lightweight image registration framework using a UNet architecture and field decomposition to align multiple anatomical regions simultaneously, demonstrating strong generalizability across diverse datasets despite minor performance variations in specific organs.

Authors:Ethan Dack, Lorenzo Brigato, Vasilis Dedousis, Janine Gote-Schniering, Cheryl, Hanno Hoppe, Aristomenis Exadaktylos, Manuela Funke-Chambour, Thomas Geiser, Andreas Christe, Lukas Ebner, Stavroula Mougiakakou
Title: Unmasking Interstitial Lung Diseases: Leveraging Masked Autoencoders for Diagnosis
Abstract:
Masked autoencoders (MAEs) have emerged as a powerful approach for pre-training on unlabelled data, capable of learning robust and informative feature representations. This is particularly advantageous in diffused lung disease research, where annotated imaging datasets are scarce. To leverage this, we train an MAE on a curated collection of over 5,000 chest computed tomography (CT) scans, combining in-house data with publicly available scans from related conditions that exhibit similar radiological patterns, such as COVID-19 and bacterial pneumonia. The pretrained MAE is then fine-tuned on a downstream classification task for diffused lung disease diagnosis. Our findings demonstrate that MAEs can effectively extract clinically meaningful features and improve diagnostic performance, even in the absence of large-scale labelled datasets. The code and the models are available here: https://github.com/eedack01/lung_masked_autoencoder.
中文: 掩码自编码器(MAE)能够从无标注的胸部CT扫描中学习稳健特征,在少量标注数据上微调后显著提升弥漫性肺病的诊断性能。
English: Masked autoencoders (MAEs) effectively learn robust features from unlabeled chest CT scans, enhancing diagnostic performance for diffused lung diseases when fine-tuned on limited annotated data.

Authors:Jinxing Zhou, Yanghao Zhou, Mingfei Han, Tong Wang, Xiaojun Chang, Hisham Cholakkal, Rao Muhammad Anwer
Title: Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation
Abstract:
Referring Audio-Visual Segmentation (Ref-AVS) aims to segment target objects in audible videos based on given reference expressions. Prior works typically rely on learning latent embeddings via multimodal fusion to prompt a tunable SAM/SAM2 decoder for segmentation, which requires strong pixel-level supervision and lacks interpretability. From a novel perspective of explicit reference understanding, we propose TGS-Agent, which decomposes the task into a Think-Ground-Segment process, mimicking the human reasoning procedure by first identifying the referred object through multimodal analysis, followed by coarse-grained grounding and precise segmentation. To this end, we first propose Ref-Thinker, a multimodal language model capable of reasoning over textual, visual, and auditory cues. We construct an instruction-tuning dataset with explicit object-aware think-answer chains for Ref-Thinker fine-tuning. The object description inferred by Ref-Thinker is used as an explicit prompt for Grounding-DINO and SAM2, which perform grounding and segmentation without relying on pixel-level supervision. Additionally, we introduce R\textsuperscript{2}-AVSBench, a new benchmark with linguistically diverse and reasoning-intensive references for better evaluating model generalization. Our approach achieves state-of-the-art results on both standard Ref-AVSBench and proposed R\textsuperscript{2}-AVSBench. Code will be available at https://github.com/jasongief/TGS-Agent.
中文: 提出的TGS-Agent通过将任务分解为思考-定位-分割流程来解决指代音频-视觉分割问题,先通过多模态推理识别目标对象,再进行无需像素级监督的定位与分割,在基准测试中取得了领先性能。
English: The proposed TGS-Agent addresses Referring Audio-Visual Segmentation by decomposing it into a Think-Ground-Segment process, which first identifies the referred object through multimodal reasoning and then performs grounding and segmentation without pixel-level supervision, achieving state-of-the-art results on benchmarks.

Authors:Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, Yansong Tang
Title: Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning
Abstract:
The video reasoning ability of multimodal large language models (MLLMs) is crucial for downstream tasks like video question answering and temporal grounding. While recent approaches have explored text-based chain-of-thought (CoT) reasoning for MLLMs, these methods often suffer from limited cross-modal interaction and increased hallucination, especially with longer videos or reasoning chains. To address these challenges, we propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework. With a visual toolbox, the model can densely sample new video frames on demand and generate multimodal CoT for precise long video reasoning. We observe that temporal grounding and question answering are mutually beneficial for video understanding tasks. Therefore, we construct two high-quality multi-task video reasoning datasets MTVR-CoT-72k for supervised fine-tuning and MTVR-RL-110k for reinforcement learning. Moreover, we propose a Difficulty-aware Group Relative Policy Optimization algorithm (DGRPO) to mitigate difficulty imbalance in multi-task reinforcement learning. Extensive experiments on 11 challenging video understanding benchmarks demonstrate the advanced reasoning ability of VITAL, outperforming existing methods in video question answering and temporal grounding tasks, especially in long video scenarios. Code is available at https://zhang9302002.github.io/thinkingwithvideos-page/.

Authors:Canhui Tang, Zifan Han, Hongbo Sun, Sanping Zhou, Xuchong Zhang, Xin Wei, Ye Yuan, Huayu Zhang, Jinglin Xu, Hao Sun
Title: TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated significant progress in vision-language tasks, yet they still face challenges when processing long-duration video inputs. The limitation arises from MLLMs' context limit and training costs, necessitating sparse frame sampling before feeding videos into MLLMs. However, building a trainable sampling method remains challenging due to the unsupervised and non-differentiable nature of sparse frame sampling in Video-MLLMs. To address these problems, we propose Temporal Sampling Policy Optimization (TSPO), advancing MLLMs' long-form video-language understanding via reinforcement learning. Specifically, we first propose a trainable event-aware temporal agent, which captures event-query correlation for performing probabilistic keyframe selection. Then, we propose the TSPO reinforcement learning paradigm, which models keyframe selection and language generation as a joint decision-making process, enabling end-to-end group relative optimization for the temporal sampling policy. Furthermore, we propose a dual-style long video training data construction pipeline, balancing comprehensive temporal understanding and key segment localization. Finally, we incorporate rule-based answering accuracy and temporal locating reward mechanisms to optimize the temporal sampling policy. Comprehensive experiments show that our TSPO achieves state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs. Our code is available at https://github.com/Hui-design/TSPO
The proposed Temporal Sampling Policy Optimization (TSPO) method enhances multimodal large language models' long-form video understanding through reinforcement learning, achieving state-of-the-art performance across multiple benchmarks.
English:

Authors:Junan Lin, Daizong Liu, Xianke Chen, Xiaoye Qu, Xun Yang, Jixiang Zhu, Sanyuan Zhang, Jianfeng Dong
Title: Audio Does Matter: Importance-Aware Multi-Granularity Fusion for Video Moment Retrieval
Abstract:
Video Moment Retrieval (VMR) aims to retrieve a specific moment semantically related to the given query. To tackle this task, most existing VMR methods solely focus on the visual and textual modalities while neglecting the complementary but important audio modality. Although a few recent works try to tackle the joint audio-vision-text reasoning, they treat all modalities equally and simply embed them without fine-grained interaction for moment retrieval. These designs are counter-practical as: Not all audios are helpful for video moment retrieval, and the audio of some videos may be complete noise or background sound that is meaningless to the moment determination. To this end, we propose a novel Importance-aware Multi-Granularity fusion model (IMG), which learns to dynamically and selectively aggregate the audio-vision-text contexts for VMR. Specifically, after integrating the textual guidance with vision and audio separately, we first design a pseudo-label-supervised audio importance predictor that predicts the importance score of the audio, and accordingly assigns weights to mitigate the interference caused by noisy audio. Then, we design a multi-granularity audio fusion module that adaptively fuses audio and visual modalities at local-, event-, and global-level, fully capturing their complementary contexts. We further propose a cross-modal knowledge distillation strategy to address the challenge of missing audio modality during inference. To evaluate our method, we further construct a new VMR dataset, i.e., Charades-AudioMatter, where audio-related samples are manually selected and re-organized from the original Charades-STA to validate the model's capability in utilizing audio modality. Extensive experiments validate the effectiveness of our method, achieving state-of-the-art with audio-video fusion in VMR methods. Our code is available at https://github.com/HuiGuanLab/IMG.
中文: 本文提出了一种重要性感知多粒度融合模型(IMG),通过动态整合音频、视觉和文本模态来处理视频片段检索任务,采用自适应加权和多层次融合策略应对噪声音频干扰,并实现了最先进的性能。
English: This paper introduces the Importance-aware Multi-Granularity fusion model (IMG) that dynamically integrates audio, visual, and textual modalities for Video Moment Retrieval, addressing noisy audio interference through adaptive weighting and multi-level fusion while achieving state-of-the-art performance.

Authors:Xiao Wang, Ziwen Wang, Wentao Wu, Anjie Wang, Jiashu Wu, Yantao Pan, Chenglong Li
Title: Segment Any Vehicle: Semantic and Visual Context Driven SAM and A Benchmark
Abstract:
With the rapid advancement of autonomous driving, vehicle perception, particularly detection and segmentation, has placed increasingly higher demands on algorithmic performance. Pre-trained large segmentation models, especially Segment Anything Model (SAM), have sparked significant interest and inspired new research directions in artificial intelligence. However, SAM cannot be directly applied to the fine-grained task of vehicle part segmentation, as its text-prompted segmentation functionality is not publicly accessible, and the mask regions generated by its default mode lack semantic labels, limiting its utility in structured, category-specific segmentation tasks. To address these limitations, we propose SAV, a novel framework comprising three core components: a SAM-based encoder-decoder, a vehicle part knowledge graph, and a context sample retrieval encoding module. The knowledge graph explicitly models the spatial and geometric relationships among vehicle parts through a structured ontology, effectively encoding prior structural knowledge. Meanwhile, the context retrieval module enhances segmentation by identifying and leveraging visually similar vehicle instances from training data, providing rich contextual priors for improved generalization. Furthermore, we introduce a new large-scale benchmark dataset for vehicle part segmentation, named VehicleSeg10K, which contains 11,665 high-quality pixel-level annotations across diverse scenes and viewpoints. We conduct comprehensive experiments on this dataset and two other datasets, benchmarking multiple representative baselines to establish a solid foundation for future research and comparison. % Both the dataset and source code of this paper will be released upon acceptance. Both the dataset and source code of this paper will be released on https://github.com/Event-AHU/SAV
中文: 本文提出SAV框架,通过结合基于SAM的编码器-解码器、知识图谱和上下文检索模块来改进车辆部件分割,并发布了VehicleSeg10K数据集以推动该领域研究。
English: This paper introduces SAV, a novel framework that enhances vehicle part segmentation by integrating a SAM-based encoder-decoder with a knowledge graph and context retrieval module, and releases the VehicleSeg10K dataset to advance research in this field.

Authors:Kangrui Cen, Baixuan Zhao, Yi Xin, Siqi Luo, Guangtao Zhai, Xiaohong Liu
Title: LayerT2V: Interactive Multi-Object Trajectory Layering for Video Generation
Abstract:
Controlling object motion trajectories in Text-to-Video (T2V) generation is a challenging and relatively under-explored area, particularly in scenarios involving multiple moving objects. Most community models and datasets in the T2V domain are designed for single-object motion, limiting the performance of current generative models in multi-object tasks. Additionally, existing motion control methods in T2V either lack support for multi-object motion scenes or experience severe performance degradation when object trajectories intersect, primarily due to the semantic conflicts in colliding regions. To address these limitations, we introduce LayerT2V, the first approach for generating video by compositing background and foreground objects layer by layer. This layered generation enables flexible integration of multiple independent elements within a video, positioning each element on a distinct "layer" and thus facilitating coherent multi-object synthesis while enhancing control over the generation process. Extensive experiments demonstrate the superiority of LayerT2V in generating complex multi-object scenarios, showcasing 1.4x and 4.5x improvements in mIoU and AP50 metrics over state-of-the-art (SOTA) methods. Project page and code are available at https://kr-panghu.github.io/LayerT2V/ .
中文: LayerT2V首次提出分层生成方法,通过将独立元素置于不同图层进行视频合成,有效解决了多物体运动轨迹控制难题,并在性能指标上大幅超越现有技术。
English: LayerT2V introduces a layered generation approach for Text-to-Video synthesis that enables coherent multi-object motion control by compositing independent elements on separate layers, achieving significant performance improvements over existing methods.

Authors:Yuyang Liu, Qiuhe Hong, Linlan Huang, Alexandra Gomez-Villa, Dipam Goswami, Xialei Liu, Joost van de Weijer, Yonghong Tian
Title: Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting
Abstract:
Vision-language models (VLMs) have achieved impressive performance across diverse multimodal tasks by leveraging large-scale pre-training. However, enabling them to learn continually from non-stationary data remains a major challenge, as their cross-modal alignment and generalization capabilities are particularly vulnerable to catastrophic forgetting. Unlike traditional unimodal continual learning (CL), VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion. This survey offers the first focused and systematic review of continual learning for VLMs (VLM-CL). We begin by identifying the three core failure modes that degrade performance in VLM-CL. Based on these, we propose a challenge-driven taxonomy that maps solutions to their target problems: (1) \textit{Multi-Modal Replay Strategies} address cross-modal drift through explicit or implicit memory mechanisms; (2) \textit{Cross-Modal Regularization} preserves modality alignment during updates; and (3) \textit{Parameter-Efficient Adaptation} mitigates parameter interference with modular or low-rank updates. We further analyze current evaluation protocols, datasets, and metrics, highlighting the need for better benchmarks that capture VLM-specific forgetting and compositional generalization. Finally, we outline open problems and future directions, including continual pre-training and compositional zero-shot learning. This survey aims to serve as a comprehensive and diagnostic reference for researchers developing lifelong vision-language systems. All resources are available at: https://github.com/YuyangSunshine/Awesome-Continual-learning-of-Vision-Language-Models.
中文: 本综述系统梳理了视觉语言模型持续学习面临的挑战,提出了针对跨模态漂移、模态对齐保持和参数干扰的解决方案分类法,并指出需要建立更好的评估基准来推动终身视觉语言系统的发展。
English: This survey systematically reviews continual learning challenges in vision-language models, identifying core failure modes and proposing a taxonomy of solutions to address cross-modal drift, alignment preservation, and parameter interference while highlighting the need for better evaluation benchmarks.

Authors:Jianxun Yu, Ruiquan Ge, Zhipeng Wang, Cheng Yang, Chenyu Lin, Xianjun Fu, Jikui Liu, Ahmed Elazab, Changmiao Wang
Title: Small Lesions-aware Bidirectional Multimodal Multiscale Fusion Network for Lung Disease Classification
Abstract:
The diagnosis of medical diseases faces challenges such as the misdiagnosis of small lesions. Deep learning, particularly multimodal approaches, has shown great potential in the field of medical disease diagnosis. However, the differences in dimensionality between medical imaging and electronic health record data present challenges for effective alignment and fusion. To address these issues, we propose the Multimodal Multiscale Cross-Attention Fusion Network (MMCAF-Net). This model employs a feature pyramid structure combined with an efficient 3D multi-scale convolutional attention module to extract lesion-specific features from 3D medical images. To further enhance multimodal data integration, MMCAF-Net incorporates a multi-scale cross-attention module, which resolves dimensional inconsistencies, enabling more effective feature fusion. We evaluated MMCAF-Net on the Lung-PET-CT-Dx dataset, and the results showed a significant improvement in diagnostic accuracy, surpassing current state-of-the-art methods. The code is available at https://github.com/yjx1234/MMCAF-Net
中文: MMCAF-Net模型通过多尺度交叉注意力融合机制有效解决医学影像与电子健康记录间的维度差异问题,在Lung-PET-CT-Dx数据集上显著提升了小病灶诊断准确率,性能优于现有先进方法。
English: The proposed MMCAF-Net model effectively addresses dimensional inconsistencies between medical imaging and electronic health records through multi-scale cross-attention fusion, significantly improving diagnostic accuracy for small lesions as demonstrated on the Lung-PET-CT-Dx dataset.

Authors:Wengang Guo, Wei Ye, Chunchun Chen, Xin Sun, Christian Böhm, Claudia Plant, Susanto Rahardja
Title: Bootstrap Deep Spectral Clustering with Optimal Transport
Abstract:
Spectral clustering is a leading clustering method. Two of its major shortcomings are the disjoint optimization process and the limited representation capacity. To address these issues, we propose a deep spectral clustering model (named BootSC), which jointly learns all stages of spectral clustering -- affinity matrix construction, spectral embedding, and $k$-means clustering -- using a single network in an end-to-end manner. BootSC leverages effective and efficient optimal-transport-derived supervision to bootstrap the affinity matrix and the cluster assignment matrix. Moreover, a semantically-consistent orthogonal re-parameterization technique is introduced to orthogonalize spectral embeddings, significantly enhancing the discrimination capability. Experimental results indicate that BootSC achieves state-of-the-art clustering performance. For example, it accomplishes a notable 16\% NMI improvement over the runner-up method on the challenging ImageNet-Dogs dataset. Our code is available at https://github.com/spdj2271/BootSC.
Chinese: BootSC是一种深度谱聚类模型,通过端到端网络整合所有步骤,利用最优传输监督和正交重参数化技术,显著提升了聚类性能,如在ImageNet-Dogs数据集上相比次优方法NMI指标提高了16%。
English: BootSC is a deep spectral clustering model that integrates all stages into a single end-to-end network, using optimal transport supervision and orthogonal embeddings to achieve state-of-the-art performance, such as a 16% NMI improvement on ImageNet-Dogs.

Authors:Yan Zhang, Gangyan Zeng, Daiqing Wu, Huawen Shen, Binbin Li, Yu Zhou, Can Ma, Xiaojun Bi
Title: Gather and Trace: Rethinking Video TextVQA from an Instance-oriented Perspective
Abstract:
Video text-based visual question answering (Video TextVQA) aims to answer questions by explicitly reading and reasoning about the text involved in a video. Most works in this field follow a frame-level framework which suffers from redundant text entities and implicit relation modeling, resulting in limitations in both accuracy and efficiency. In this paper, we rethink the Video TextVQA task from an instance-oriented perspective and propose a novel model termed GAT (Gather and Trace). First, to obtain accurate reading result for each video text instance, a context-aggregated instance gathering module is designed to integrate the visual appearance, layout characteristics, and textual contents of the related entities into a unified textual representation. Then, to capture dynamic evolution of text in the video flow, an instance-focused trajectory tracing module is utilized to establish spatio-temporal relationships between instances and infer the final answer. Extensive experiments on several public Video TextVQA datasets validate the effectiveness and generalization of our framework. GAT outperforms existing Video TextVQA methods, video-language pretraining methods, and video large language models in both accuracy and inference speed. Notably, GAT surpasses the previous state-of-the-art Video TextVQA methods by 3.86\% in accuracy and achieves ten times of faster inference speed than video large language models. The source code is available at https://github.com/zhangyan-ucas/GAT.
中文摘要:GAT模型通过聚合文本实例上下文并追踪其时空轨迹,显著提升了视频文本问答的准确性和推理速度,超越了现有最优方法。
English Summary: The GAT model improves Video TextVQA by gathering contextual text instances and tracing their spatio-temporal trajectories, achieving superior accuracy and faster inference than existing methods.

Authors:Fengyi Wu, Yimian Dai, Tianfang Zhang, Yixuan Ding, Jian Yang, Ming-Ming Cheng, Zhenming Peng
Title: RPCANet++: Deep Interpretable Robust PCA for Sparse Object Segmentation
Abstract:
Robust principal component analysis (RPCA) decomposes an observation matrix into low-rank background and sparse object components. This capability has enabled its application in tasks ranging from image restoration to segmentation. However, traditional RPCA models suffer from computational burdens caused by matrix operations, reliance on finely tuned hyperparameters, and rigid priors that limit adaptability in dynamic scenarios. To solve these limitations, we propose RPCANet++, a sparse object segmentation framework that fuses the interpretability of RPCA with efficient deep architectures. Our approach unfolds a relaxed RPCA model into a structured network comprising a Background Approximation Module (BAM), an Object Extraction Module (OEM), and an Image Restoration Module (IRM). To mitigate inter-stage transmission loss in the BAM, we introduce a Memory-Augmented Module (MAM) to enhance background feature preservation, while a Deep Contrast Prior Module (DCPM) leverages saliency cues to expedite object extraction. Extensive experiments on diverse datasets demonstrate that RPCANet++ achieves state-of-the-art performance under various imaging scenarios. We further improve interpretability via visual and numerical low-rankness and sparsity measurements. By combining the theoretical strengths of RPCA with the efficiency of deep networks, our approach sets a new baseline for reliable and interpretable sparse object segmentation. Codes are available at our Project Webpage https://fengyiwu98.github.io/rpcanetx.
中文摘要:RPCANet++ 创新地将鲁棒主成分分析与深度网络架构相结合,通过模块化设计实现了高效的稀疏目标分割,在保持理论可解释性的同时显著提升了计算性能。
English Summary: RPCANet++ is a novel deep learning framework that integrates robust PCA principles with efficient network architectures to achieve state-of-the-art sparse object segmentation while enhancing computational efficiency and interpretability.

Authors:Jingchao Wang, Zhijian Wu, Dingjiang Huang, Yefeng Zheng, Hong Wang
Title: Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder
Abstract:
Reference Expression Segmentation (RES) aims to segment image regions specified by referring expressions and has become popular with the rise of multimodal large models (MLLMs). While MLLMs excel in semantic understanding, their token-generation paradigm struggles with pixel-level dense prediction. Existing RES methods either couple MLLMs with the parameter-heavy Segment Anything Model (SAM) with 632M network parameters or adopt SAM-free lightweight pipelines that sacrifice accuracy. To address the trade-off between performance and cost, we specifically propose MLLMSeg, a novel framework that fully exploits the inherent visual detail features encoded in the MLLM vision encoder without introducing an extra visual encoder. Besides, we propose a detail-enhanced and semantic-consistent feature fusion module (DSFF) that fully integrates the detail-related visual feature with the semantic-related feature output by the large language model (LLM) of MLLM. Finally, we establish a light-weight mask decoder with only 34M network parameters that optimally leverages detailed spatial features from the visual encoder and semantic features from the LLM to achieve precise mask prediction. Extensive experiments demonstrate that our method generally surpasses both SAM-based and SAM-free competitors, striking a better balance between performance and cost. Code is available at https://github.com/jcwang0602/MLLMSeg.
中文: 提出的MLLMSeg框架充分利用多模态大模型固有的视觉特征,无需额外视觉编码器即可实现精确的参照表达分割,在性能和成本效率上均优于现有方法。
English: The proposed MLLMSeg framework effectively utilizes the inherent visual features of multimodal large models to achieve precise reference expression segmentation without additional visual encoders, outperforming existing methods in both performance and cost efficiency.

Authors:Zunhui Xia, Hongxing Li, Libin Lan
Title: TCSAFormer: Efficient Vision Transformer with Token Compression and Sparse Attention for Medical Image Segmentation
Abstract:
In recent years, transformer-based methods have achieved remarkable progress in medical image segmentation due to their superior ability to capture long-range dependencies. However, these methods typically suffer from two major limitations. First, their computational complexity scales quadratically with the input sequences. Second, the feed-forward network (FFN) modules in vanilla Transformers typically rely on fully connected layers, which limits models' ability to capture local contextual information and multiscale features critical for precise semantic segmentation. To address these issues, we propose an efficient medical image segmentation network, named TCSAFormer. The proposed TCSAFormer adopts two key ideas. First, it incorporates a Compressed Attention (CA) module, which combines token compression and pixel-level sparse attention to dynamically focus on the most relevant key-value pairs for each query. This is achieved by pruning globally irrelevant tokens and merging redundant ones, significantly reducing computational complexity while enhancing the model's ability to capture relationships between tokens. Second, it introduces a Dual-Branch Feed-Forward Network (DBFFN) module as a replacement for the standard FFN to capture local contextual features and multiscale information, thereby strengthening the model's feature representation capability. We conduct extensive experiments on three publicly available medical image segmentation datasets: ISIC-2018, CVC-ClinicDB, and Synapse, to evaluate the segmentation performance of TCSAFormer. Experimental results demonstrate that TCSAFormer achieves superior performance compared to existing state-of-the-art (SOTA) methods, while maintaining lower computational overhead, thus achieving an optimal trade-off between efficiency and accuracy.
中文摘要:TCSAFormer是一种高效的医学图像分割网络,通过压缩注意力模块和双分支前馈网络解决了传统Transformer计算复杂度高和局部特征捕获能力有限的问题,在多个数据集上以更低计算成本实现了优越性能。
English Summary: TCSAFormer is an efficient medical image segmentation network that addresses the computational complexity and limited local feature capture of traditional transformers by incorporating a Compressed Attention module and Dual-Branch Feed-Forward Network, achieving superior performance with lower computational overhead on multiple datasets.

Authors:Yuheng Ji, Yipu Wang, Yuyang Liu, Xiaoshuai Hao, Yue Liu, Yuting Zhao, Huaihai Lyu, Xiaolong Zheng
Title: VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning
Abstract:
Visual transformation reasoning (VTR) is a vital cognitive capability that empowers intelligent agents to understand dynamic scenes, model causal relationships, and predict future states, and thereby guiding actions and laying the foundation for advanced intelligent systems. However, existing benchmarks suffer from a sim-to-real gap, limited task complexity, and incomplete reasoning coverage, limiting their practical use in real-world scenarios. To address these limitations, we introduce VisualTrans, the first comprehensive benchmark specifically designed for VTR in real-world human-object interaction scenarios. VisualTrans encompasses 12 semantically diverse manipulation tasks and systematically evaluates three essential reasoning dimensions - spatial, procedural, and quantitative - through 6 well-defined subtask types. The benchmark features 472 high-quality question-answer pairs in various formats, including multiple-choice, open-ended counting, and target enumeration. We introduce a scalable data construction pipeline built upon first-person manipulation videos, which integrates task selection, image pair extraction, automated metadata annotation with large multimodal models, and structured question generation. Human verification ensures the final benchmark is both high-quality and interpretable. Evaluations of various state-of-the-art vision-language models show strong performance in static spatial tasks. However, they reveal notable shortcomings in dynamic, multi-step reasoning scenarios, particularly in areas like intermediate state recognition and transformation sequence planning. These findings highlight fundamental weaknesses in temporal modeling and causal reasoning, providing clear directions for future research aimed at developing more capable and generalizable VTR systems. The dataset and code are available at https://github.com/WangYipu2002/VisualTrans.
中文: VisualTrans是首个专为现实世界人机交互中的视觉转换推理设计的综合基准,通过多样化任务和结构化评估弥补现有基准的不足,同时揭示了当前模型在动态和因果推理能力上的显著缺陷。
English: VisualTrans is introduced as the first comprehensive benchmark for visual transformation reasoning in real-world human-object interactions, addressing limitations of existing benchmarks through diverse tasks and structured evaluations, while revealing significant gaps in current models' dynamic and causal reasoning abilities.

Authors:Tongshun Zhang, Pingling Liu, Zijian Zhang, Qiuzhan Zhou
Title: SPJFNet: Self-Mining Prior-Guided Joint Frequency Enhancement for Ultra-Efficient Dark Image Restoration
Abstract:
Current dark image restoration methods suffer from severe efficiency bottlenecks, primarily stemming from: (1) computational burden and error correction costs associated with reliance on external priors (manual or cross-modal); (2) redundant operations in complex multi-stage enhancement pipelines; and (3) indiscriminate processing across frequency components in frequency-domain methods, leading to excessive global computational demands. To address these challenges, we propose an Efficient Self-Mining Prior-Guided Joint Frequency Enhancement Network (SPJFNet). Specifically, we first introduce a Self-Mining Guidance Module (SMGM) that generates lightweight endogenous guidance directly from the network, eliminating dependence on external priors and thereby bypassing error correction overhead while improving inference speed. Second, through meticulous analysis of different frequency domain characteristics, we reconstruct and compress multi-level operation chains into a single efficient operation via lossless wavelet decomposition and joint Fourier-based advantageous frequency enhancement, significantly reducing parameters. Building upon this foundation, we propose a Dual-Frequency Guidance Framework (DFGF) that strategically deploys specialized high/low frequency branches (wavelet-domain high-frequency enhancement and Fourier-domain low-frequency restoration), decoupling frequency processing to substantially reduce computational complexity. Rigorous evaluation across multiple benchmarks demonstrates that SPJFNet not only surpasses state-of-the-art performance but also achieves significant efficiency improvements, substantially reducing model complexity and computational overhead. Code is available at https://github.com/bywlzts/SPJFNet.
中文摘要:现有暗图像修复方法因外部先验依赖、冗余处理流程及频率分量无差别计算导致效率低下,SPJFNet通过自挖掘先验引导、联合频率增强与双频解耦处理,在提升性能的同时显著降低了计算复杂度。
English Summary: Current dark image restoration methods face efficiency issues from external priors, redundant pipelines, and indiscriminate frequency processing, which SPJFNet addresses through self-mined priors, joint frequency enhancement, and dual-frequency guidance to achieve superior performance with reduced complexity.

Authors:Zechen Li, Baiyu Chen, Hao Xue, Flora D. Salim
Title: ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents
Abstract:
Motion sensor time-series are central to human activity recognition (HAR), with applications in health, sports, and smart devices. However, existing methods are trained for fixed activity sets and require costly retraining when new behaviours or sensor setups appear. Recent attempts to use large language models (LLMs) for HAR, typically by converting signals into text or images, suffer from limited accuracy and lack verifiable interpretability. We propose ZARA, the first agent-based framework for zero-shot, explainable HAR directly from raw motion time-series. ZARA integrates an automatically derived pair-wise feature knowledge base that captures discriminative statistics for every activity pair, a multi-sensor retrieval module that surfaces relevant evidence, and a hierarchical agent pipeline that guides the LLM to iteratively select features, draw on this evidence, and produce both activity predictions and natural-language explanations. ZARA enables flexible and interpretable HAR without any fine-tuning or task-specific classifiers. Extensive experiments on 8 HAR benchmarks show that ZARA achieves SOTA zero-shot performance, delivering clear reasoning while exceeding the strongest baselines by 2.53x in macro F1. Ablation studies further confirm the necessity of each module, marking ZARA as a promising step toward trustworthy, plug-and-play motion time-series analysis. Our codes are available at https://github.com/zechenli03/ZARA.
中文: ZARA是一种基于智能体的创新框架,可直接从原始运动传感器数据实现零样本、可解释的人类活动识别,无需重新训练或特定任务分类器即可达到最优性能。
English: ZARA is a novel agent-based framework that enables zero-shot, explainable human activity recognition from raw motion sensor data, achieving state-of-the-art performance without requiring retraining or task-specific classifiers.

Authors:Trinh Quoc Nguyen, Oky Dicky Ardiansyah Prima, Syahid Al Irfan, Hindriyanto Dwi Purnomo, Radius Tanone
Title: CORE-ReID V2: Advancing the Domain Adaptation for Object Re-Identification with Optimized Training and Ensemble Fusion
Abstract:
This study presents CORE-ReID V2, an enhanced framework building upon CORE-ReID. The new framework extends its predecessor by addressing Unsupervised Domain Adaptation (UDA) challenges in Person ReID and Vehicle ReID, with further applicability to Object ReID. During pre-training, CycleGAN is employed to synthesize diverse data, bridging image characteristic gaps across different domains. In the fine-tuning, an advanced ensemble fusion mechanism, consisting of the Efficient Channel Attention Block (ECAB) and the Simplified Efficient Channel Attention Block (SECAB), enhances both local and global feature representations while reducing ambiguity in pseudo-labels for target samples. Experimental results on widely used UDA Person ReID and Vehicle ReID datasets demonstrate that the proposed framework outperforms state-of-the-art methods, achieving top performance in Mean Average Precision (mAP) and Rank-k Accuracy (Top-1, Top-5, Top-10). Moreover, the framework supports lightweight backbones such as ResNet18 and ResNet34, ensuring both scalability and efficiency. Our work not only pushes the boundaries of UDA-based Object ReID but also provides a solid foundation for further research and advancements in this domain. Our codes and models are available at https://github.com/TrinhQuocNguyen/CORE-ReID-V2.
中文:CORE-ReID V2通过CycleGAN合成数据和先进的融合机制增强无监督领域自适应在行人、车辆及物体重识别中的应用,提升了特征表示和伪标签准确性,并以轻量级骨干网络实现顶尖性能。
English: CORE-ReID V2 enhances unsupervised domain adaptation for Person, Vehicle, and Object ReID by using CycleGAN for data synthesis and an advanced fusion mechanism to improve feature representation and pseudo-label accuracy, achieving state-of-the-art performance with efficient lightweight backbones.

Authors:Hee-Yeun Kim, Byeonggyu Park, Byonghyok Choi, Hansang Cho, Byungkwan Kim, Soomok Lee, Mingu Jeon, Seung-Woo Seo, Seong-Woo Kim
Title: Radar-Based NLoS Pedestrian Localization for Darting-Out Scenarios Near Parked Vehicles with Camera-Assisted Point Cloud Interpretation
Abstract:
The presence of Non-Line-of-Sight (NLoS) blind spots resulting from roadside parking in urban environments poses a significant challenge to road safety, particularly due to the sudden emergence of pedestrians. mmWave technology leverages diffraction and reflection to observe NLoS regions, and recent studies have demonstrated its potential for detecting obscured objects. However, existing approaches predominantly rely on predefined spatial information or assume simple wall reflections, thereby limiting their generalizability and practical applicability. A particular challenge arises in scenarios where pedestrians suddenly appear from between parked vehicles, as these parked vehicles act as temporary spatial obstructions. Furthermore, since parked vehicles are dynamic and may relocate over time, spatial information obtained from satellite maps or other predefined sources may not accurately reflect real-time road conditions, leading to erroneous sensor interpretations. To address this limitation, we propose an NLoS pedestrian localization framework that integrates monocular camera image with 2D radar point cloud (PCD) data. The proposed method initially detects parked vehicles through image segmentation, estimates depth to infer approximate spatial characteristics, and subsequently refines this information using 2D radar PCD to achieve precise spatial inference. Experimental evaluations conducted in real-world urban road environments demonstrate that the proposed approach enhances early pedestrian detection and contributes to improved road safety. Supplementary materials are available at https://hiyeun.github.io/NLoS/.

Authors:Junyi Wang, Jinjiang Li, Guodong Fan, Yakun Ju, Xiang Fang, Alex C. Kot
Title: Prototype-Driven Structure Synergy Network for Remote Sensing Images Segmentation
Abstract:
In the semantic segmentation of remote sensing images, acquiring complete ground objects is critical for achieving precise analysis. However, this task is severely hindered by two major challenges: high intra-class variance and high inter-class similarity. Traditional methods often yield incomplete segmentation results due to their inability to effectively unify class representations and distinguish between similar features. Even emerging class-guided approaches are limited by coarse class prototype representations and a neglect of target structural information. Therefore, this paper proposes a Prototype-Driven Structure Synergy Network (PDSSNet). The design of this network is based on a core concept, a complete ground object is jointly defined by its invariant class semantics and its variant spatial structure. To implement this, we have designed three key modules. First, the Adaptive Prototype Extraction Module (APEM) ensures semantic accuracy from the source by encoding the ground truth to extract unbiased class prototypes. Subsequently, the designed Semantic-Structure Coordination Module (SSCM) follows a hierarchical semantics-first, structure-second principle. This involves first establishing a global semantic cognition, then leveraging structural information to constrain and refine the semantic representation, thereby ensuring the integrity of class information. Finally, the Channel Similarity Adjustment Module (CSAM) employs a dynamic step-size adjustment mechanism to focus on discriminative features between classes. Extensive experiments demonstrate that PDSSNet outperforms state-of-the-art methods. The source code is available at https://github.com/wangjunyi-1/PDSSNet.
中文摘要:本文提出的原型驱动结构协同网络(PDSSNet)通过自适应原型提取、语义结构协调和通道相似性调节三个核心模块,将不变类别语义与多变空间结构相结合,有效解决了遥感图像中因类内差异大和类间相似性高导致的物体分割不完整问题,性能超越现有先进方法。
English Summary: The proposed Prototype-Driven Structure Synergy Network (PDSSNet) addresses incomplete segmentation in remote sensing imagery by integrating invariant class semantics with variant spatial structures through three specialized modules, achieving superior performance over existing methods.

Authors:Haiqi Yang, Jinzhe Li, Gengxu Li, Yi Chang, Yuan Wu
Title: Can Large Multimodal Models Actively Recognize Faulty Inputs? A Systematic Evaluation Framework of Their Input Scrutiny Ability
Abstract:
Large Multimodal Models (LMMs) have witnessed remarkable growth, showcasing formidable capabilities in handling intricate multimodal tasks with exceptional performance. Recent research has underscored the inclination of large language models to passively accept defective inputs, often resulting in futile reasoning on invalid prompts. However, the same critical question of whether LMMs can actively detect and scrutinize erroneous inputs still remains unexplored. To address this gap, we introduce the Input Scrutiny Ability Evaluation Framework (ISEval), which encompasses seven categories of flawed premises and three evaluation metrics. Our extensive evaluation of ten advanced LMMs has identified key findings. Most models struggle to actively detect flawed textual premises without guidance, which reflects a strong reliance on explicit prompts for premise error identification. Error type affects performance: models excel at identifying logical fallacies but struggle with surface-level linguistic errors and certain conditional flaws. Modality trust varies-Gemini 2.5 pro and Claude Sonnet 4 balance visual and textual info, while aya-vision-8b over-rely on text in conflicts. These insights underscore the urgent need to enhance LMMs' proactive verification of input validity and shed novel insights into mitigating the problem. The code is available at https://github.com/MLGroupJLU/LMM_ISEval.
Chinese Summary: 该研究提出ISEval框架,发现多数大型多模态模型难以主动识别缺陷输入,存在对显式提示的过度依赖,且在错误类型和模态信任方面表现不一。
English Summary: The study introduces ISEval, a framework revealing that most large multimodal models struggle to detect flawed inputs independently, showing over-reliance on explicit prompts and varying performance across error types and modality trust.

Authors:Weilun Feng, Haotong Qin, Chuanguang Yang, Xiangqi Li, Han Yang, Yuqi Li, Zhulin An, Libo Huang, Michele Magno, Yongjun Xu
Title: S$^2$Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation
Abstract:
Diffusion transformers have emerged as the mainstream paradigm for video generation models. However, the use of up to billions of parameters incurs significant computational costs. Quantization offers a promising solution by reducing memory usage and accelerating inference. Nonetheless, we observe that the joint modeling of spatial and temporal information in video diffusion models (V-DMs) leads to extremely long token sequences, which introduces high calibration variance and learning challenges. To address these issues, we propose S$^2$Q-VDiT, a post-training quantization framework for V-DMs that leverages Salient data and Sparse token distillation. During the calibration phase, we identify that quantization performance is highly sensitive to the choice of calibration data. To mitigate this, we introduce \textit{Hessian-aware Salient Data Selection}, which constructs high-quality calibration datasets by considering both diffusion and quantization characteristics unique to V-DMs. To tackle the learning challenges, we further analyze the sparse attention patterns inherent in V-DMs. Based on this observation, we propose \textit{Attention-guided Sparse Token Distillation}, which exploits token-wise attention distributions to emphasize tokens that are more influential to the model's output. Under W4A6 quantization, S$^2$Q-VDiT achieves lossless performance while delivering $3.9\times$ model compression and $1.3\times$ inference acceleration. Code will be available at https://github.com/wlfeng0509/s2q-vdit.
中文: 本文提出S²Q-VDiT后训练量化框架,通过基于Hessian的显著数据选择和注意力引导的稀疏令牌蒸馏,解决了视频扩散模型中的高校准方差和学习难题,在实现显著压缩和加速的同时保持了近乎无损的性能。
English: This paper introduces S²Q-VDiT, a post-training quantization framework that addresses high calibration variance and learning challenges in video diffusion models by using Hessian-aware salient data selection and attention-guided sparse token distillation, achieving near-lossless performance with significant compression and acceleration.

Authors:Jinwei Zhang, Lianrui Zuo, Blake E. Dewey, Samuel W. Remedios, Yihao Liu, Savannah P. Hays, Dzung L. Pham, Ellen M. Mowry, Scott D. Newsome, Peter A. Calabresi, Aaron Carass, Jerry L. Prince
Title: UNISELF: A Unified Network with Instance Normalization and Self-Ensembled Lesion Fusion for Multiple Sclerosis Lesion Segmentation
Abstract:
Automated segmentation of multiple sclerosis (MS) lesions using multicontrast magnetic resonance (MR) images improves efficiency and reproducibility compared to manual delineation, with deep learning (DL) methods achieving state-of-the-art performance. However, these DL-based methods have yet to simultaneously optimize in-domain accuracy and out-of-domain generalization when trained on a single source with limited data, or their performance has been unsatisfactory. To fill this gap, we propose a method called UNISELF, which achieves high accuracy within a single training domain while demonstrating strong generalizability across multiple out-of-domain test datasets. UNISELF employs a novel test-time self-ensembled lesion fusion to improve segmentation accuracy, and leverages test-time instance normalization (TTIN) of latent features to address domain shifts and missing input contrasts. Trained on the ISBI 2015 longitudinal MS segmentation challenge training dataset, UNISELF ranks among the best-performing methods on the challenge test dataset. Additionally, UNISELF outperforms all benchmark methods trained on the same ISBI training data across diverse out-of-domain test datasets with domain shifts and missing contrasts, including the public MICCAI 2016 and UMCL datasets, as well as a private multisite dataset. These test datasets exhibit domain shifts and/or missing contrasts caused by variations in acquisition protocols, scanner types, and imaging artifacts arising from imperfect acquisition. Our code is available at https://github.com/uponacceptance.
中文: 提出的UNISELF方法通过测试时自集成病灶融合和实例归一化技术,在多发性硬化病灶分割中实现了高域内精度和强跨域泛化能力,有效应对域偏移和缺失对比度问题。
English: The proposed UNISELF method achieves high in-domain accuracy and strong out-of-domain generalization for MS lesion segmentation by employing test-time self-ensembled lesion fusion and instance normalization to handle domain shifts and missing contrasts.

Authors:Yajun Liu, Zenghui Zhang, Jiang Yue, Weiwei Guo, Dongying Li
Title: M$^3$HL: Mutual Mask Mix with High-Low Level Feature Consistency for Semi-Supervised Medical Image Segmentation
Abstract:
Data augmentation methods inspired by CutMix have demonstrated significant potential in recent semi-supervised medical image segmentation tasks. However, these approaches often apply CutMix operations in a rigid and inflexible manner, while paying insufficient attention to feature-level consistency constraints. In this paper, we propose a novel method called Mutual Mask Mix with High-Low level feature consistency (M$^3$HL) to address the aforementioned challenges, which consists of two key components: 1) M$^3$: An enhanced data augmentation operation inspired by the masking strategy from Masked Image Modeling (MIM), which advances conventional CutMix through dynamically adjustable masks to generate spatially complementary image pairs for collaborative training, thereby enabling effective information fusion between labeled and unlabeled images. 2) HL: A hierarchical consistency regularization framework that enforces high-level and low-level feature consistency between unlabeled and mixed images, enabling the model to better capture discriminative feature representations.Our method achieves state-of-the-art performance on widely adopted medical image segmentation benchmarks including the ACDC and LA datasets. Source code is available at https://github.com/PHPJava666/M3HL
中文: 本文提出M$^3$HL方法,通过动态掩码数据增强和分层特征一致性约束,在半监督医学图像分割任务中实现了最先进的性能,并在ACDC和LA数据集上得到验证。
English: This paper introduces M$^3$HL, an enhanced semi-supervised medical image segmentation method that combines dynamic mask-based data augmentation with hierarchical feature consistency to improve performance on benchmarks like ACDC and LA.

Authors:Weiwei Cao, Jianpeng Zhang, Zhongyi Shui, Sinuo Wang, Zeli Chen, Xi Li, Le Lu, Xianghua Ye, Tingbo Liang, Qi Zhang, Ling Zhang
Title: Boosting Vision Semantic Density with Anatomy Normality Modeling for Medical Vision-language Pre-training
Abstract:
Vision-language pre-training (VLP) has great potential for developing multifunctional and general medical diagnostic capabilities. However, aligning medical images with a low signal-to-noise ratio (SNR) to reports with a high SNR presents a semantic density gap, leading to visual alignment bias. In this paper, we propose boosting vision semantic density to improve alignment effectiveness. On one hand, we enhance visual semantics through disease-level vision contrastive learning, which strengthens the model's ability to differentiate between normal and abnormal samples for each anatomical structure. On the other hand, we introduce an anatomical normality modeling method to model the distribution of normal samples for each anatomy, leveraging VQ-VAE for reconstructing normal vision embeddings in the latent space. This process amplifies abnormal signals by leveraging distribution shifts in abnormal samples, enhancing the model's perception and discrimination of abnormal attributes. The enhanced visual representation effectively captures the diagnostic-relevant semantics, facilitating more efficient and accurate alignment with the diagnostic report. We conduct extensive experiments on two chest CT datasets, CT-RATE and Rad-ChestCT, and an abdominal CT dataset, MedVL-CT69K, and comprehensively evaluate the diagnosis performance across multiple tasks in the chest and abdominal CT scenarios, achieving state-of-the-art zero-shot performance. Notably, our method achieved an average AUC of 84.9% across 54 diseases in 15 organs, significantly surpassing existing methods. Additionally, we demonstrate the superior transfer learning capabilities of our pre-trained model. Code is available at https://github.com/alibaba-damo-academy/ViSD-Boost.
中文摘要:本文提出通过疾病级别对比学习和解剖结构正态建模来增强视觉语义密度,从而改进医学视觉语言预训练,在多个CT数据集的零样本诊断任务中实现了最优性能。
English Summary: This paper proposes a method to enhance vision-language pre-training for medical diagnostics by boosting visual semantic density through disease-level contrastive learning and anatomical normality modeling, achieving state-of-the-art zero-shot performance across multiple CT datasets.

Authors:Kushal Kanwar, Dushyant Singh Chauhan, Gopendra Vikram Singh, Asif Ekbal
Title: What is Beneath Misogyny: Misogynous Memes Classification and Explanation
Abstract:
Memes are popular in the modern world and are distributed primarily for entertainment. However, harmful ideologies such as misogyny can be propagated through innocent-looking memes. The detection and understanding of why a meme is misogynous is a research challenge due to its multimodal nature (image and text) and its nuanced manifestations across different societal contexts. We introduce a novel multimodal approach, \textit{namely}, \textit{\textbf{MM-Misogyny}} to detect, categorize, and explain misogynistic content in memes. \textit{\textbf{MM-Misogyny}} processes text and image modalities separately and unifies them into a multimodal context through a cross-attention mechanism. The resulting multimodal context is then easily processed for labeling, categorization, and explanation via a classifier and Large Language Model (LLM). The evaluation of the proposed model is performed on a newly curated dataset (\textit{\textbf{W}hat's \textbf{B}eneath \textbf{M}isogynous \textbf{S}tereotyping (WBMS)}) created by collecting misogynous memes from cyberspace and categorizing them into four categories, \textit{namely}, Kitchen, Leadership, Working, and Shopping. The model not only detects and classifies misogyny, but also provides a granular understanding of how misogyny operates in domains of life. The results demonstrate the superiority of our approach compared to existing methods. The code and dataset are available at \href{https://github.com/kushalkanwarNS/WhatisBeneathMisogyny/tree/main}{https://github.com/Misogyny}.
中文: 摘要介绍了MM-Misogyny这一多模态方法,它通过整合文本和图像数据来检测、分类并解释表情包中的厌女内容,并在WBMS数据集上展现出卓越性能。
English: The abstract introduces MM-Misogyny, a multimodal method that detects, classifies, and explains misogynistic content in memes by integrating text and image data, demonstrating superior performance on the WBMS dataset.

Authors:Jianxiong Gao, Zhaoxi Chen, Xian Liu, Jianfeng Feng, Chenyang Si, Yanwei Fu, Yu Qiao, Ziwei Liu
Title: LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation
Abstract:
Controllable ultra-long video generation is a fundamental yet challenging task. Although existing methods are effective for short clips, they struggle to scale due to issues such as temporal inconsistency and visual degradation. In this paper, we initially investigate and identify three key factors: separate noise initialization, independent control signal normalization, and the limitations of single-modality guidance. To address these issues, we propose LongVie, an end-to-end autoregressive framework for controllable long video generation. LongVie introduces two core designs to ensure temporal consistency: 1) a unified noise initialization strategy that maintains consistent generation across clips, and 2) global control signal normalization that enforces alignment in the control space throughout the entire video. To mitigate visual degradation, LongVie employs 3) a multi-modal control framework that integrates both dense (e.g., depth maps) and sparse (e.g., keypoints) control signals, complemented by 4) a degradation-aware training strategy that adaptively balances modality contributions over time to preserve visual quality. We also introduce LongVGenBench, a comprehensive benchmark consisting of 100 high-resolution videos spanning diverse real-world and synthetic environments, each lasting over one minute. Extensive experiments show that LongVie achieves state-of-the-art performance in long-range controllability, consistency, and quality.
中文: LongVie是一种自回归框架,通过统一噪声初始化、全局控制归一化和多模态引导,解决了超长视频生成中的时序不一致和视觉退化问题。
English: LongVie is an autoregressive framework that addresses temporal inconsistency and visual degradation in ultra-long video generation through unified noise initialization, global control normalization, and multi-modal guidance.

Authors:Katherine Liu, Sergey Zakharov, Dian Chen, Takuya Ikeda, Greg Shakhnarovich, Adrien Gaidon, Rares Ambrus
Title: OmniShape: Zero-Shot Multi-Hypothesis Shape and Pose Estimation in the Real World
Abstract:
We would like to estimate the pose and full shape of an object from a single observation, without assuming known 3D model or category. In this work, we propose OmniShape, the first method of its kind to enable probabilistic pose and shape estimation. OmniShape is based on the key insight that shape completion can be decoupled into two multi-modal distributions: one capturing how measurements project into a normalized object reference frame defined by the dataset and the other modelling a prior over object geometries represented as triplanar neural fields. By training separate conditional diffusion models for these two distributions, we enable sampling multiple hypotheses from the joint pose and shape distribution. OmniShape demonstrates compelling performance on challenging real world datasets. Project website: https://tri-ml.github.io/omnishape

Authors:Xinyu Wang, Yue Zhang, Liqiang Jing
Title: Can Large Vision-Language Models Understand Multimodal Sarcasm?
Abstract:
Sarcasm is a complex linguistic phenomenon that involves a disparity between literal and intended meanings, making it challenging for sentiment analysis and other emotion-sensitive tasks. While traditional sarcasm detection methods primarily focus on text, recent approaches have incorporated multimodal information. However, the application of Large Visual Language Models (LVLMs) in Multimodal Sarcasm Analysis (MSA) remains underexplored. In this paper, we evaluate LVLMs in MSA tasks, specifically focusing on Multimodal Sarcasm Detection and Multimodal Sarcasm Explanation. Through comprehensive experiments, we identify key limitations, such as insufficient visual understanding and a lack of conceptual knowledge. To address these issues, we propose a training-free framework that integrates in-depth object extraction and external conceptual knowledge to improve the model's ability to interpret and explain sarcasm in multimodal contexts. The experimental results on multiple models show the effectiveness of our proposed framework. The code is available at https://github.com/cp-cp/LVLM-MSA.
中文摘要:本文评估了大型视觉语言模型在多模态讽刺分析中的应用,发现其在视觉理解和概念知识方面存在不足,并提出了一种无需训练的框架,通过结合深度对象提取和外部知识来提升模型对讽刺的解读能力。
English Summary: This paper evaluates Large Visual Language Models in multimodal sarcasm analysis, identifying limitations in visual understanding and conceptual knowledge, and proposes a training-free framework that enhances sarcasm interpretation by integrating object extraction and external knowledge.

Authors:Xiangyu Sun, Haoyi Jiang, Liu Liu, Seungtae Nam, Gyeongjin Kang, Xinjie Wang, Wei Sui, Zhizhong Su, Wenyu Liu, Xinggang Wang, Eunbyung Park
Title: Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images
Abstract:
Reconstructing and semantically interpreting 3D scenes from sparse 2D views remains a fundamental challenge in computer vision. Conventional methods often decouple semantic understanding from reconstruction or necessitate costly per-scene optimization, thereby restricting their scalability and generalizability. In this paper, we introduce Uni3R, a novel feed-forward framework that jointly reconstructs a unified 3D scene representation enriched with open-vocabulary semantics, directly from unposed multi-view images. Our approach leverages a Cross-View Transformer to robustly integrate information across arbitrary multi-view inputs, which then regresses a set of 3D Gaussian primitives endowed with semantic feature fields. This unified representation facilitates high-fidelity novel view synthesis, open-vocabulary 3D semantic segmentation, and depth prediction, all within a single, feed-forward pass. Extensive experiments demonstrate that Uni3R establishes a new state-of-the-art across multiple benchmarks, including 25.07 PSNR on RE10K and 55.84 mIoU on ScanNet. Our work signifies a novel paradigm towards generalizable, unified 3D scene reconstruction and understanding. The code is available at https://github.com/HorizonRobotics/Uni3R.
中文: Uni3R提出了一种前馈框架,能够从无位姿多视角图像中重建带有开放词汇语义的3D场景,在新视角合成和语义分割任务上实现了最先进的性能。
English: Uni3R introduces a feed-forward framework that reconstructs semantically enriched 3D scenes from unposed multi-view images, achieving state-of-the-art performance in novel view synthesis and semantic segmentation.

Authors:Wuyang Li, Wentao Pan, Xiaoyuan Liu, Zhendong Luo, Chenxin Li, Hengyu Liu, Din Ping Tsai, Mu Ku Chen, Yixuan Yuan
Title: MetaScope: Optics-Driven Neural Network for Ultra-Micro Metalens Endoscopy
Abstract:
Miniaturized endoscopy has advanced accurate visual perception within the human body. Prevailing research remains limited to conventional cameras employing convex lenses, where the physical constraints with millimetre-scale thickness impose serious impediments on the micro-level clinical. Recently, with the emergence of meta-optics, ultra-micro imaging based on metalenses (micron-scale) has garnered great attention, serving as a promising solution. However, due to the physical difference of metalens, there is a large gap in data acquisition and algorithm research. In light of this, we aim to bridge this unexplored gap, advancing the novel metalens endoscopy. First, we establish datasets for metalens endoscopy and conduct preliminary optical simulation, identifying two derived optical issues that physically adhere to strong optical priors. Second, we propose MetaScope, a novel optics-driven neural network tailored for metalens endoscopy driven by physical optics. MetaScope comprises two novel designs: Optics-informed Intensity Adjustment (OIA), rectifying intensity decay by learning optical embeddings, and Optics-informed Chromatic Correction (OCC), mitigating chromatic aberration by learning spatial deformations informed by learned Point Spread Function (PSF) distributions. To enhance joint learning, we further deploy a gradient-guided distillation to transfer knowledge from the foundational model adaptively. Extensive experiments demonstrate that MetaScope not only outperforms state-of-the-art methods in both metalens segmentation and restoration but also achieves impressive generalized ability in real biomedical scenes.
基于超构透镜的微型内窥镜为超微成像提供了创新方案,而提出的MetaScope框架通过物理光学驱动的神经网络有效解决了光学问题,显著提升了图像质量和临床适用性。
Miniaturized endoscopy using metalenses offers a promising ultra-micro imaging solution, and the proposed MetaScope framework effectively addresses optical challenges through physics-informed neural networks to enhance image quality and clinical applicability.

Authors:Xinyu Xiong, Zihuang Wu, Lei Zhang, Lei Lu, Ming Li, Guanbin Li
Title: SAM2-UNeXT: An Improved High-Resolution Baseline for Adapting Foundation Models to Downstream Segmentation Tasks
Abstract:
Recent studies have highlighted the potential of adapting the Segment Anything Model (SAM) for various downstream tasks. However, constructing a more powerful and generalizable encoder to further enhance performance remains an open challenge. In this work, we propose SAM2-UNeXT, an advanced framework that builds upon the core principles of SAM2-UNet while extending the representational capacity of SAM2 through the integration of an auxiliary DINOv2 encoder. By incorporating a dual-resolution strategy and a dense glue layer, our approach enables more accurate segmentation with a simple architecture, relaxing the need for complex decoder designs. Extensive experiments conducted on four benchmarks, including dichotomous image segmentation, camouflaged object detection, marine animal segmentation, and remote sensing saliency detection, demonstrate the superior performance of our proposed method. The code is available at https://github.com/WZH0120/SAM2-UNeXT.
中文: 本文提出SAM2-UNeXT框架,通过整合DINOv2编码器和双分辨率策略,在简化架构的同时显著提升了多类基准任务的图像分割精度。
English: This paper introduces SAM2-UNeXT, an enhanced framework that integrates a DINOv2 encoder with a dual-resolution strategy to improve segmentation accuracy across multiple benchmarks using a simplified architecture.

Authors:Kaishen Yuan, Yuting Zhang, Shang Gao, Yijie Zhu, Wenshuo Chen, Yutao Yue
Title: CoEmoGen: Towards Semantically-Coherent and Scalable Emotional Image Content Generation
Abstract:
Emotional Image Content Generation (EICG) aims to generate semantically clear and emotionally faithful images based on given emotion categories, with broad application prospects. While recent text-to-image diffusion models excel at generating concrete concepts, they struggle with the complexity of abstract emotions. There have also emerged methods specifically designed for EICG, but they excessively rely on word-level attribute labels for guidance, which suffer from semantic incoherence, ambiguity, and limited scalability. To address these challenges, we propose CoEmoGen, a novel pipeline notable for its semantic coherence and high scalability. Specifically, leveraging multimodal large language models (MLLMs), we construct high-quality captions focused on emotion-triggering content for context-rich semantic guidance. Furthermore, inspired by psychological insights, we design a Hierarchical Low-Rank Adaptation (HiLoRA) module to cohesively model both polarity-shared low-level features and emotion-specific high-level semantics. Extensive experiments demonstrate CoEmoGen's superiority in emotional faithfulness and semantic coherence from quantitative, qualitative, and user study perspectives. To intuitively showcase scalability, we curate EmoArt, a large-scale dataset of emotionally evocative artistic images, providing endless inspiration for emotion-driven artistic creation. The dataset and code are available at https://github.com/yuankaishen2001/CoEmoGen.
中文: CoEmoGen是一种新颖的流程,利用多模态大语言模型生成情感导向的描述,并通过分层LoRA模块建模情感特征,在定量、定性和用户评估中均展现出卓越的语义连贯性和情感忠实度。
English: CoEmoGen is a novel pipeline that leverages multimodal large language models for emotion-focused captions and a hierarchical LoRA module to generate semantically coherent and emotionally faithful images, demonstrating superior performance across quantitative, qualitative, and user evaluations.

Authors:Yazhou Zhu, Haofeng Zhang
Title: MAUP: Training-free Multi-center Adaptive Uncertainty-aware Prompting for Cross-domain Few-shot Medical Image Segmentation
Abstract:
Cross-domain Few-shot Medical Image Segmentation (CD-FSMIS) is a potential solution for segmenting medical images with limited annotation using knowledge from other domains. The significant performance of current CD-FSMIS models relies on the heavily training procedure over other source medical domains, which degrades the universality and ease of model deployment. With the development of large visual models of natural images, we propose a training-free CD-FSMIS model that introduces the Multi-center Adaptive Uncertainty-aware Prompting (MAUP) strategy for adapting the foundation model Segment Anything Model (SAM), which is trained with natural images, into the CD-FSMIS task. To be specific, MAUP consists of three key innovations: (1) K-means clustering based multi-center prompts generation for comprehensive spatial coverage, (2) uncertainty-aware prompts selection that focuses on the challenging regions, and (3) adaptive prompt optimization that can dynamically adjust according to the target region complexity. With the pre-trained DINOv2 feature encoder, MAUP achieves precise segmentation results across three medical datasets without any additional training compared with several conventional CD-FSMIS models and training-free FSMIS model. The source code is available at: https://github.com/YazhouZhu19/MAUP.
中文: 本研究提出无需训练的MAUP策略,通过多中心提示生成、不确定性区域选择和动态提示优化,将Segment Anything模型适配于跨领域少样本医学图像分割任务,在无需额外训练的情况下实现精确分割效果。
English: The study introduces MAUP, a training-free strategy that adapts the Segment Anything Model for cross-domain few-shot medical image segmentation by generating multi-center prompts, selecting uncertain regions, and dynamically optimizing prompts to achieve precise results without additional training.

Authors:Deqiang Yin, Junyi Guo, Huanda Lu, Fangyu Wu, Dongming Lu
Title: EditGarment: An Instruction-Based Garment Editing Dataset Constructed with Automated MLLM Synthesis and Semantic-Aware Evaluation
Abstract:
Instruction-based garment editing enables precise image modifications via natural language, with broad applications in fashion design and customization. Unlike general editing tasks, it requires understanding garment-specific semantics and attribute dependencies. However, progress is limited by the scarcity of high-quality instruction-image pairs, as manual annotation is costly and hard to scale. While MLLMs have shown promise in automated data synthesis, their application to garment editing is constrained by imprecise instruction modeling and a lack of fashion-specific supervisory signals. To address these challenges, we present an automated pipeline for constructing a garment editing dataset. We first define six editing instruction categories aligned with real-world fashion workflows to guide the generation of balanced and diverse instruction-image triplets. Second, we introduce Fashion Edit Score, a semantic-aware evaluation metric that captures semantic dependencies between garment attributes and provides reliable supervision during construction. Using this pipeline, we construct a total of 52,257 candidate triplets and retain 20,596 high-quality triplets to build EditGarment, the first instruction-based dataset tailored to standalone garment editing. The project page is https://yindq99.github.io/EditGarment-project/.

Authors:Haotian Wang, Yuzhe Weng, Jun Du, Haoran Xu, Xiaoyan Wu, Shan He, Bing Yin, Cong Liu, Jianqing Gao, Qingfeng Liu
Title: READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation
Abstract:
The introduction of diffusion models has brought significant advances to the field of audio-driven talking head generation. However, the extremely slow inference speed severely limits the practical implementation of diffusion-based talking head generation models. In this study, we propose READ, the first real-time diffusion-transformer-based talking head generation framework. Our approach first learns a spatiotemporal highly compressed video latent space via a temporal VAE, significantly reducing the token count to accelerate generation. To achieve better audio-visual alignment within this compressed latent space, a pre-trained Speech Autoencoder (SpeechAE) is proposed to generate temporally compressed speech latent codes corresponding to the video latent space. These latent representations are then modeled by a carefully designed Audio-to-Video Diffusion Transformer (A2V-DiT) backbone for efficient talking head synthesis. Furthermore, to ensure temporal consistency and accelerated inference in extended generation, we propose a novel asynchronous noise scheduler (ANS) for both the training and inference process of our framework. The ANS leverages asynchronous add-noise and asynchronous motion-guided generation in the latent space, ensuring consistency in generated video clips. Experimental results demonstrate that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime, achieving an optimal balance between quality and speed while maintaining robust metric stability in long-time generation.

Authors:Qiyu Chen, Zhen Qu, Wei Luo, Haiming Yao, Yunkang Cao, Yuxin Jiang, Yinan Duan, Huiyuan Luo, Chengkan Lv, Zhengtao Zhang
Title: CoPS: Conditional Prompt Synthesis for Zero-Shot Anomaly Detection
Abstract:
Recently, large pre-trained vision-language models have shown remarkable performance in zero-shot anomaly detection (ZSAD). With fine-tuning on a single auxiliary dataset, the model enables cross-category anomaly detection on diverse datasets covering industrial defects and medical lesions. Compared to manually designed prompts, prompt learning eliminates the need for expert knowledge and trial-and-error. However, it still faces the following challenges: (i) static learnable tokens struggle to capture the continuous and diverse patterns of normal and anomalous states, limiting generalization to unseen categories; (ii) fixed textual labels provide overly sparse category information, making the model prone to overfitting to a specific semantic subspace. To address these issues, we propose Conditional Prompt Synthesis (CoPS), a novel framework that synthesizes dynamic prompts conditioned on visual features to enhance ZSAD performance. Specifically, we extract representative normal and anomaly prototypes from fine-grained patch features and explicitly inject them into prompts, enabling adaptive state modeling. Given the sparsity of class labels, we leverage a variational autoencoder to model semantic image features and implicitly fuse varied class tokens into prompts. Additionally, integrated with our spatially-aware alignment mechanism, extensive experiments demonstrate that CoPS surpasses state-of-the-art methods by 2.5% AUROC in both classification and segmentation across 13 industrial and medical datasets. Code will be available at https://github.com/cqylunlun/CoPS.
中文: 提出的条件提示合成(CoPS)框架通过基于视觉特征动态生成提示,解决了零样本异常检测中的泛化问题,在13个工业和医疗数据集上实现了最先进的性能。
English: The proposed Conditional Prompt Synthesis (CoPS) framework dynamically generates prompts based on visual features to overcome limitations in zero-shot anomaly detection, achieving state-of-the-art performance across 13 industrial and medical datasets.

Authors:Ning Zhu, Xiaochuan Ma, Shaoting Zhang, Guotai Wang
Title: MedCAL-Bench: A Comprehensive Benchmark on Cold-Start Active Learning with Foundation Models for Medical Image Analysis
Abstract:
Cold-Start Active Learning (CSAL) aims to select informative samples for annotation without prior knowledge, which is important for improving annotation efficiency and model performance under a limited annotation budget in medical image analysis. Most existing CSAL methods rely on Self-Supervised Learning (SSL) on the target dataset for feature extraction, which is inefficient and limited by insufficient feature representation. Recently, pre-trained Foundation Models (FMs) have shown powerful feature extraction ability with a potential for better CSAL. However, this paradigm has been rarely investigated, with a lack of benchmarks for comparison of FMs in CSAL tasks. To this end, we propose MedCAL-Bench, the first systematic FM-based CSAL benchmark for medical image analysis. We evaluate 14 FMs and 7 CSAL strategies across 7 datasets under different annotation budgets, covering classification and segmentation tasks from diverse medical modalities. It is also the first CSAL benchmark that evaluates both the feature extraction and sample selection stages. Our experimental results reveal that: 1) Most FMs are effective feature extractors for CSAL, with DINO family performing the best in segmentation; 2) The performance differences of these FMs are large in segmentation tasks, while small for classification; 3) Different sample selection strategies should be considered in CSAL on different datasets, with Active Learning by Processing Surprisal (ALPS) performing the best in segmentation while RepDiv leading for classification. The code is available at https://github.com/HiLab-git/MedCAL-Bench.
中文摘要:MedCAL-Bench首次建立了基于基础模型的医学图像冷启动主动学习基准,通过评估14个模型在7个数据集上的表现,发现DINO系列在分割任务中表现最优,且不同数据集需采用不同的样本选择策略。
English Summary: MedCAL-Bench introduces the first foundation model-based benchmark for cold-start active learning in medical imaging, evaluating 14 models across 7 datasets to reveal DINO's superiority in segmentation tasks and strategy-dependent performance variations.

Authors:Futian Wang, Yuhan Qiao, Xiao Wang, Fuling Wang, Yuxiang Zhang, Dengdi Sun
Title: R2GenKG: Hierarchical Multi-modal Knowledge Graph for LLM-based Radiology Report Generation
Abstract:
X-ray medical report generation is one of the important applications of artificial intelligence in healthcare. With the support of large foundation models, the quality of medical report generation has significantly improved. However, challenges such as hallucination and weak disease diagnostic capability still persist. In this paper, we first construct a large-scale multi-modal medical knowledge graph (termed M3KG) based on the ground truth medical report using the GPT-4o. It contains 2477 entities, 3 kinds of relations, 37424 triples, and 6943 disease-aware vision tokens for the CheXpert Plus dataset. Then, we sample it to obtain multi-granularity semantic graphs and use an R-GCN encoder for feature extraction. For the input X-ray image, we adopt the Swin-Transformer to extract the vision features and interact with the knowledge using cross-attention. The vision tokens are fed into a Q-former and retrieved the disease-aware vision tokens using another cross-attention. Finally, we adopt the large language model to map the semantic knowledge graph, input X-ray image, and disease-aware vision tokens into language descriptions. Extensive experiments on multiple datasets fully validated the effectiveness of our proposed knowledge graph and X-ray report generation framework. The source code of this paper will be released on https://github.com/Event-AHU/Medical_Image_Analysis.
中文: 本文构建了一个大规模多模态医学知识图谱(M3KG),并提出了一种新框架,通过交叉注意力机制将知识图谱与X射线图像及疾病感知视觉标记相结合,有效提升了AI生成医学报告的准确性并减少了幻觉现象,经多数据集实验充分验证。
English: This paper introduces a large-scale multi-modal medical knowledge graph (M3KG) and a novel framework that integrates it with X-ray images and disease-aware vision tokens using cross-attention mechanisms, significantly enhancing the accuracy and reducing hallucinations in AI-generated medical reports, as validated by extensive experiments.

Authors:Xinlei Yu, Zhangquan Chen, Yudong Zhang, Shilin Lu, Ruolin Shen, Jiangning Zhang, Xiaobin Hu, Yanwei Fu, Shuicheng Yan
Title: Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling
Abstract:
Existing vision-language models (VLMs), whether generalists or specialists, remain constrained by their parameter scale, lack robust self-correction capabilities, and underperform in tasks involving long visual contexts and complex reasoning, resulting in suboptimal performance on document-based tasks. To address this, we propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling, tailored for visual document understanding and visual question answering (VQA). It comprises four distinct small-scale agents, i.e., planning, execution, judgment, and answer agents, with clearly defined roles and effective collaboration. Notably, the judgment agent exclusively verifies correctness and redirects to prior agents for revisions, outperforming conventional correction strategies. To further expand the capability boundaries of the framework, we propose mixed reward modeling that balances agent-specific abilities and global collaboration, as well as agent-wise hybrid test-time scaling, which customizes different scaling strategies for each agent based on their functions. Evaluated on benchmarks spanning both document-based and non-document-based settings, our MACT shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks. Especially, it stands out in benchmarks involving long visual contexts and complicated reasoning. The three variants of MACT consistently hold the top three positions in average scores, leading in 13 of the 15 benchmarks. Code will be available at: https://github.com/YU-deep/MACT.git.
中文: 提出的MACT框架采用多智能体协作和测试时扩展技术,通过规划、执行、判断和回答四个专门代理的协同工作,在减少参数量的同时,显著提升了视觉文档理解和视觉问答任务的表现,并在多个基准测试中取得领先成绩。
English: The proposed MACT framework employs a multi-agent collaboration system with test-time scaling to enhance visual document understanding and VQA, achieving top performance across multiple benchmarks with fewer parameters by integrating specialized agents for planning, execution, judgment, and answering.

Authors:Pingchuan Ma, Xiaopei Yang, Yusong Li, Ming Gui, Felix Krause, Johannes Schusterbauer, Björn Ommer
Title: SCFlow: Implicitly Learning Style and Content Disentanglement with Flow Models
Abstract:
Explicitly disentangling style and content in vision models remains challenging due to their semantic overlap and the subjectivity of human perception. Existing methods propose separation through generative or discriminative objectives, but they still face the inherent ambiguity of disentangling intertwined concepts. Instead, we ask: Can we bypass explicit disentanglement by learning to merge style and content invertibly, allowing separation to emerge naturally? We propose SCFlow, a flow-matching framework that learns bidirectional mappings between entangled and disentangled representations. Our approach is built upon three key insights: 1) Training solely to merge style and content, a well-defined task, enables invertible disentanglement without explicit supervision; 2) flow matching bridges on arbitrary distributions, avoiding the restrictive Gaussian priors of diffusion models and normalizing flows; and 3) a synthetic dataset of 510,000 samples (51 styles $\times$ 10,000 content samples) was curated to simulate disentanglement through systematic style-content pairing. Beyond controllable generation tasks, we demonstrate that SCFlow generalizes to ImageNet-1k and WikiArt in zero-shot settings and achieves competitive performance, highlighting that disentanglement naturally emerges from the invertible merging process.

Authors:Yifei Sun, Zhanghao Chen, Hao Zheng, Yuqing Lu, Lixin Duan, Fenglei Fan, Ahmed Elazab, Xiang Wan, Changmiao Wang, Ruiquan Ge
Title: GL-LCM: Global-Local Latent Consistency Models for Fast High-Resolution Bone Suppression in Chest X-Ray Images
Abstract:
Chest X-Ray (CXR) imaging for pulmonary diagnosis raises significant challenges, primarily because bone structures can obscure critical details necessary for accurate diagnosis. Recent advances in deep learning, particularly with diffusion models, offer significant promise for effectively minimizing the visibility of bone structures in CXR images, thereby improving clarity and diagnostic accuracy. Nevertheless, existing diffusion-based methods for bone suppression in CXR imaging struggle to balance the complete suppression of bones with preserving local texture details. Additionally, their high computational demand and extended processing time hinder their practical use in clinical settings. To address these limitations, we introduce a Global-Local Latent Consistency Model (GL-LCM) architecture. This model combines lung segmentation, dual-path sampling, and global-local fusion, enabling fast high-resolution bone suppression in CXR images. To tackle potential boundary artifacts and detail blurring in local-path sampling, we further propose Local-Enhanced Guidance, which addresses these issues without additional training. Comprehensive experiments on a self-collected dataset SZCH-X-Rays, and the public dataset JSRT, reveal that our GL-LCM delivers superior bone suppression and remarkable computational efficiency, significantly outperforming several competitive methods. Our code is available at https://github.com/diaoquesang/GL-LCM.
中文: 本研究提出的全局-局部潜在一致性模型(GL-LCM)在有效抑制胸部X光图像中骨骼结构的同时,能保持局部纹理细节,在计算效率和性能表现上均显著优于现有方法。
English: The proposed Global-Local Latent Consistency Model (GL-LCM) effectively suppresses bone structures in chest X-ray images while preserving texture details, achieving superior performance and computational efficiency compared to existing methods.

Authors:Haoran Lin, Wenrui Chen, Xianchi Chen, Fan Yang, Qiang Diao, Wenxin Xie, Sijie Wu, Kailun Yang, Maojun Li, Yaonan Wang
Title: UniFucGrasp: Human-Hand-Inspired Unified Functional Grasp Annotation Strategy and Dataset for Diverse Dexterous Hands
Abstract:
Dexterous grasp datasets are vital for embodied intelligence, but mostly emphasize grasp stability, ignoring functional grasps needed for tasks like opening bottle caps or holding cup handles. Most rely on bulky, costly, and hard-to-control high-DOF Shadow Hands. Inspired by the human hand's underactuated mechanism, we establish UniFucGrasp, a universal functional grasp annotation strategy and dataset for multiple dexterous hand types. Based on biomimicry, it maps natural human motions to diverse hand structures and uses geometry-based force closure to ensure functional, stable, human-like grasps. This method supports low-cost, efficient collection of diverse, high-quality functional grasps. Finally, we establish the first multi-hand functional grasp dataset and provide a synthesis model to validate its effectiveness. Experiments on the UFG dataset, IsaacSim, and complex robotic tasks show that our method improves functional manipulation accuracy and grasp stability, enables efficient generalization across diverse robotic hands, and overcomes annotation cost and generalization challenges in dexterous grasping. The project page is at https://haochen611.github.io/UFG.

Authors:Tongshun Zhang, Pingping Liu, Zixuan Zhong, Zijian Zhang, Qiuzhan Zhou
Title: Beyond Illumination: Fine-Grained Detail Preservation in Extreme Dark Image Restoration
Abstract:
Recovering fine-grained details in extremely dark images remains challenging due to severe structural information loss and noise corruption. Existing enhancement methods often fail to preserve intricate details and sharp edges, limiting their effectiveness in downstream applications like text and edge detection. To address these deficiencies, we propose an efficient dual-stage approach centered on detail recovery for dark images. In the first stage, we introduce a Residual Fourier-Guided Module (RFGM) that effectively restores global illumination in the frequency domain. RFGM captures inter-stage and inter-channel dependencies through residual connections, providing robust priors for high-fidelity frequency processing while mitigating error accumulation risks from unreliable priors. The second stage employs complementary Mamba modules specifically designed for textural structure refinement: (1) Patch Mamba operates on channel-concatenated non-downsampled patches, meticulously modeling pixel-level correlations to enhance fine-grained details without resolution loss. (2) Grad Mamba explicitly focuses on high-gradient regions, alleviating state decay in state space models and prioritizing reconstruction of sharp edges and boundaries. Extensive experiments on multiple benchmark datasets and downstream applications demonstrate that our method significantly improves detail recovery performance while maintaining efficiency. Crucially, the proposed modules are lightweight and can be seamlessly integrated into existing Fourier-based frameworks with minimal computational overhead. Code is available at https://github.com/bywlzts/RFGM.
中文摘要:本文提出一种针对极暗图像的双阶段增强方法,通过残差傅里叶引导模块恢复全局光照,并采用互补Mamba模块进行纹理优化,在保持计算效率的同时显著提升了细节重建性能。
English Summary: This paper introduces a dual-stage approach for enhancing extremely dark images, utilizing a Residual Fourier-Guided Module for global illumination restoration and complementary Mamba modules for textural refinement, achieving superior detail recovery with minimal computational overhead.

Authors:Jun Luo, Zijing Zhao, Yang Liu
Title: Zero Shot Domain Adaptive Semantic Segmentation by Synthetic Data Generation and Progressive Adaptation
Abstract:
Deep learning-based semantic segmentation models achieve impressive results yet remain limited in handling distribution shifts between training and test data. In this paper, we present SDGPA (Synthetic Data Generation and Progressive Adaptation), a novel method that tackles zero-shot domain adaptive semantic segmentation, in which no target images are available, but only a text description of the target domain's style is provided. To compensate for the lack of target domain training data, we utilize a pretrained off-the-shelf text-to-image diffusion model, which generates training images by transferring source domain images to target style. Directly editing source domain images introduces noise that harms segmentation because the layout of source images cannot be precisely maintained. To address inaccurate layouts in synthetic data, we propose a method that crops the source image, edits small patches individually, and then merges them back together, which helps improve spatial precision. Recognizing the large domain gap, SDGPA constructs an augmented intermediate domain, leveraging easier adaptation subtasks to enable more stable model adaptation to the target domain. Additionally, to mitigate the impact of noise in synthetic data, we design a progressive adaptation strategy, ensuring robust learning throughout the training process. Extensive experiments demonstrate that our method achieves state-of-the-art performance in zero-shot semantic segmentation. The code is available at https://github.com/ROUJINN/SDGPA
中文: SDGPA提出了一种新颖的零样本域自适应语义分割方法,通过文本到图像扩散生成目标风格合成数据,并采用渐进式适应策略在无需目标域图像的情况下有效缩小领域差异。
English: SDGPA introduces a novel zero-shot domain adaptation method for semantic segmentation by generating synthetic target-style images using text-to-image diffusion and employing progressive adaptation to bridge domain gaps without requiring target domain images.

Authors:Hang Guo, Qing Zhang, Zixuan Gao, Siyuan Yang, Shulin Peng, Xiang Tao, Ting Yu, Yan Wang, Qingli Li
Title: Efficient Multi-Slide Visual-Language Feature Fusion for Placental Disease Classification
Abstract:
Accurate prediction of placental diseases via whole slide images (WSIs) is critical for preventing severe maternal and fetal complications. However, WSI analysis presents significant computational challenges due to the massive data volume. Existing WSI classification methods encounter critical limitations: (1) inadequate patch selection strategies that either compromise performance or fail to sufficiently reduce computational demands, and (2) the loss of global histological context resulting from patch-level processing approaches. To address these challenges, we propose an Efficient multimodal framework for Patient-level placental disease Diagnosis, named EmmPD. Our approach introduces a two-stage patch selection module that combines parameter-free and learnable compression strategies, optimally balancing computational efficiency with critical feature preservation. Additionally, we develop a hybrid multimodal fusion module that leverages adaptive graph learning to enhance pathological feature representation and incorporates textual medical reports to enrich global contextual understanding. Extensive experiments conducted on both a self-constructed patient-level Placental dataset and two public datasets demonstrating that our method achieves state-of-the-art diagnostic performance. The code is available at https://github.com/ECNU-MultiDimLab/EmmPD.
中文: 本研究提出EmmPD高效多模态框架,通过优化切片选择与混合融合技术解决全切片图像分析的计算难题,在胎盘疾病诊断中实现了最先进的性能。
English: The study introduces EmmPD, an efficient multimodal framework that overcomes computational challenges in whole slide image analysis through optimized patch selection and hybrid fusion techniques, achieving state-of-the-art placental disease diagnosis performance.

Authors:Gang Dai, Yifan Zhang, Yutao Qin, Qiangya Guo, Shuangping Huang, Shuicheng Yan
Title: Beyond Isolated Words: Diffusion Brush for Handwritten Text-Line Generation
Abstract:
Existing handwritten text generation methods primarily focus on isolated words. However, realistic handwritten text demands attention not only to individual words but also to the relationships between them, such as vertical alignment and horizontal spacing. Therefore, generating entire text lines emerges as a more promising and comprehensive task. However, this task poses significant challenges, including the accurate modeling of complex style patterns encompassing both intra- and inter-word relationships, and maintaining content accuracy across numerous characters. To address these challenges, we propose DiffBrush, a novel diffusion-based model for handwritten text-line generation. Unlike existing methods, DiffBrush excels in both style imitation and content accuracy through two key strategies: (1) content-decoupled style learning, which disentangles style from content to better capture intra-word and inter-word style patterns by using column- and row-wise masking; and (2) multi-scale content learning, which employs line and word discriminators to ensure global coherence and local accuracy of textual content. Extensive experiments show that DiffBrush excels in generating high-quality text lines, particularly in style reproduction and content preservation. Code is available at https://github.com/dailenson/DiffBrush.
中文摘要:DiffBrush是一种基于扩散的模型,通过内容与风格解耦学习和多尺度内容学习策略,有效解决了手写文本行生成中的风格模仿和内容准确性难题。
English Summary: DiffBrush is a diffusion-based model that addresses the challenges of handwritten text-line generation by decoupling style from content and using multi-scale learning to ensure both style imitation and content accuracy.

Authors:Jisoo Kim, Wooseok Seo, Junwan Kim, Seungho Park, Sooyeon Park, Youngjae Yu
Title: V.I.P. : Iterative Online Preference Distillation for Efficient Video Diffusion Models
Abstract:
With growing interest in deploying text-to-video (T2V) models in resource-constrained environments, reducing their high computational cost has become crucial, leading to extensive research on pruning and knowledge distillation methods while maintaining performance. However, existing distillation methods primarily rely on supervised fine-tuning (SFT), which often leads to mode collapse as pruned models with reduced capacity fail to directly match the teacher's outputs, ultimately resulting in degraded quality. To address this challenge, we propose an effective distillation method, ReDPO, that integrates DPO and SFT. Our approach leverages DPO to guide the student model to focus on recovering only the targeted properties, rather than passively imitating the teacher, while also utilizing SFT to enhance overall performance. We additionally propose V.I.P., a novel framework for filtering and curating high-quality pair datasets, along with a step-by-step online approach for calibrated training. We validate our method on two leading T2V models, VideoCrafter2 and AnimateDiff, achieving parameter reduction of 36.2% and 67.5% each, while maintaining or even surpassing the performance of full models. Further experiments demonstrate the effectiveness of both ReDPO and V.I.P. framework in enabling efficient and high-quality video generation. Our code and videos are available at https://jiiiisoo.github.io/VIP.github.io/.

Authors:Xingchao Yang, Shiori Ueda, Yuantian Huang, Tomoya Akiyama, Takafumi Taketomi
Title: FFHQ-Makeup: Paired Synthetic Makeup Dataset with Facial Consistency Across Multiple Styles
Abstract:
Paired bare-makeup facial images are essential for a wide range of beauty-related tasks, such as virtual try-on, facial privacy protection, and facial aesthetics analysis. However, collecting high-quality paired makeup datasets remains a significant challenge. Real-world data acquisition is constrained by the difficulty of collecting large-scale paired images, while existing synthetic approaches often suffer from limited realism or inconsistencies between bare and makeup images. Current synthetic methods typically fall into two categories: warping-based transformations, which often distort facial geometry and compromise the precision of makeup; and text-to-image generation, which tends to alter facial identity and expression, undermining consistency. In this work, we present FFHQ-Makeup, a high-quality synthetic makeup dataset that pairs each identity with multiple makeup styles while preserving facial consistency in both identity and expression. Built upon the diverse FFHQ dataset, our pipeline transfers real-world makeup styles from existing datasets onto 18K identities by introducing an improved makeup transfer method that disentangles identity and makeup. Each identity is paired with 5 different makeup styles, resulting in a total of 90K high-quality bare-makeup image pairs. To the best of our knowledge, this is the first work that focuses specifically on constructing a makeup dataset. We hope that FFHQ-Makeup fills the gap of lacking high-quality bare-makeup paired datasets and serves as a valuable resource for future research in beauty-related tasks.

Authors:Ting Lei, Shaofeng Yin, Qingchao Chen, Yuxin Peng, Yang Liu
Title: Open-Vocabulary HOI Detection with Interaction-aware Prompt and Concept Calibration
Abstract:
Open Vocabulary Human-Object Interaction (HOI) detection aims to detect interactions between humans and objects while generalizing to novel interaction classes beyond the training set. Current methods often rely on Vision and Language Models (VLMs) but face challenges due to suboptimal image encoders, as image-level pre-training does not align well with the fine-grained region-level interaction detection required for HOI. Additionally, effectively encoding textual descriptions of visual appearances remains difficult, limiting the model's ability to capture detailed HOI relationships. To address these issues, we propose INteraction-aware Prompting with Concept Calibration (INP-CC), an end-to-end open-vocabulary HOI detector that integrates interaction-aware prompts and concept calibration. Specifically, we propose an interaction-aware prompt generator that dynamically generates a compact set of prompts based on the input scene, enabling selective sharing among similar interactions. This approach directs the model's attention to key interaction patterns rather than generic image-level semantics, enhancing HOI detection. Furthermore, we refine HOI concept representations through language model-guided calibration, which helps distinguish diverse HOI concepts by investigating visual similarities across categories. A negative sampling strategy is also employed to improve inter-modal similarity modeling, enabling the model to better differentiate visually similar but semantically distinct actions. Extensive experimental results demonstrate that INP-CC significantly outperforms state-of-the-art models on the SWIG-HOI and HICO-DET datasets. Code is available at https://github.com/ltttpku/INP-CC.
中文: 提出的INP-CC模型通过动态生成交互感知提示和改进概念表征的语义校准,显著提升了开放词汇人机交互检测的性能,在基准数据集上实现了最优表现。
English: The proposed INP-CC model advances open-vocabulary human-object interaction detection by introducing interaction-aware prompts that dynamically adapt to visual scenes and concept calibration that refines semantic representations, achieving state-of-the-art performance on benchmark datasets.

Authors:Yidan Wang, Chenyi Zhuang, Wutao Liu, Pan Gao, Nicu Sebe
Title: AlignCAT: Visual-Linguistic Alignment of Category and Attributefor Weakly Supervised Visual Grounding
Abstract:
Weakly supervised visual grounding (VG) aims to locate objects in images based on text descriptions. Despite significant progress, existing methods lack strong cross-modal reasoning to distinguish subtle semantic differences in text expressions due to category-based and attribute-based ambiguity. To address these challenges, we introduce AlignCAT, a novel query-based semantic matching framework for weakly supervised VG. To enhance visual-linguistic alignment, we propose a coarse-grained alignment module that utilizes category information and global context, effectively mitigating interference from category-inconsistent objects. Subsequently, a fine-grained alignment module leverages descriptive information and captures word-level text features to achieve attribute consistency. By exploiting linguistic cues to their fullest extent, our proposed AlignCAT progressively filters out misaligned visual queries and enhances contrastive learning efficiency. Extensive experiments on three VG benchmarks, namely RefCOCO, RefCOCO+, and RefCOCOg, verify the superiority of AlignCAT against existing weakly supervised methods on two VG tasks. Our code is available at: https://github.com/I2-Multimedia-Lab/AlignCAT.
Chinese: AlignCAT是一种新颖的基于查询的语义匹配框架,通过粗粒度对齐和细粒度对齐模块增强弱监督视觉定位,有效解决类别和属性歧义,在多个基准测试中展现出卓越性能。
English: AlignCAT is a novel query-based semantic matching framework that enhances weakly supervised visual grounding through coarse-grained and fine-grained alignment modules, effectively addressing category and attribute ambiguities and demonstrating superior performance on multiple benchmarks.

Authors:Liangyang Ouyang, Jiafeng Mao
Title: LORE: Latent Optimization for Precise Semantic Control in Rectified Flow-based Image Editing
Abstract:
Text-driven image editing enables users to flexibly modify visual content through natural language instructions, and is widely applied to tasks such as semantic object replacement, insertion, and removal. While recent inversion-based editing methods using rectified flow models have achieved promising results in image quality, we identify a structural limitation in their editing behavior: the semantic bias toward the source concept encoded in the inverted noise tends to suppress attention to the target concept. This issue becomes particularly critical when the source and target semantics are dissimilar, where the attention mechanism inherently leads to editing failure or unintended modifications in non-target regions. In this paper, we systematically analyze and validate this structural flaw, and introduce LORE, a training-free and efficient image editing method. LORE directly optimizes the inverted noise, addressing the core limitations in generalization and controllability of existing approaches, enabling stable, controllable, and general-purpose concept replacement, without requiring architectural modification or model fine-tuning. We conduct comprehensive evaluations on three challenging benchmarks: PIEBench, SmartEdit, and GapEdit. Experimental results show that LORE significantly outperforms strong baselines in terms of semantic alignment, image quality, and background fidelity, demonstrating the effectiveness and scalability of latent-space optimization for general-purpose image editing. Our implementation is available at https://github.com/oyly16/LORE.
中文: 本文提出LORE这一无需训练的图像编辑方法,通过优化反向噪声解决现有模型的结构性缺陷,实现了稳定可控的概念替换,并在多个基准测试中显著提升了语义对齐、图像质量和背景保真度的表现。
English: This paper introduces LORE, a training-free image editing method that optimizes inverted noise to overcome structural limitations in existing models, enabling stable and controllable concept replacement while achieving superior performance in semantic alignment, image quality, and background fidelity across multiple benchmarks.

Authors:Haozhou Zhai, Yanzhe Gao, Tianjiang Hu
Title: Uint: Building Uint Detection Dataset
Abstract:
Fire scene datasets are crucial for training robust computer vision models, particularly in tasks such as fire early warning and emergency rescue operations. However, among the currently available fire-related data, there is a significant shortage of annotated data specifically targeting building units.To tackle this issue, we introduce an annotated dataset of building units captured by drones, which incorporates multiple enhancement techniques. We construct backgrounds using real multi-story scenes, combine motion blur and brightness adjustment to enhance the authenticity of the captured images, simulate drone shooting conditions under various circumstances, and employ large models to generate fire effects at different locations.The synthetic dataset generated by this method encompasses a wide range of building scenarios, with a total of 1,978 images. This dataset can effectively improve the generalization ability of fire unit detection, providing multi-scenario and scalable data while reducing the risks and costs associated with collecting real fire data. The dataset is available at https://github.com/boilermakerr/FireUnitData.
中文摘要:本文提出了一种无人机采集的建筑单元标注数据集,通过多重增强技术模拟真实火灾场景,有效解决了现有火灾数据不足的问题,提升了火灾检测模型的泛化能力并降低了数据采集成本。
English Summary: This paper introduces a drone-captured annotated dataset of building units enhanced with realistic simulation techniques to address the shortage of fire scene data, improving detection model generalization while reducing collection risks and costs.

Authors:Sai Ma, Zhuang Li, John A Taylor
Title: Landsat30-AU: A Vision-Language Dataset for Australian Landsat Imagery
Abstract:
Vision language models (VLMs) that enable natural language interaction with satellite imagery can democratize Earth observation by accelerating expert workflows, making data accessible to non-specialists, and enabling planet-scale automation. However, existing datasets focus mainly on short-term, high-resolution imagery from a limited number of satellites, overlooking low-resolution, multi-satellite, long-term archives, such as Landsat, that are essential for affordable and bias-robust global monitoring. We address this gap with Landsat30-AU, a large-scale vision-language dataset built from 30-meter resolution imagery collected by four Landsat satellites (5, 7, 8, and 9) over Australia, spanning more than 36 years. The dataset includes two components: Landsat30-AU-Cap, containing $196,262$ image-caption pairs, and Landsat30-AU-VQA, comprising 17,725 human-verified visual question answering (VQA) samples across eight remote sensing domains. Both datasets are curated through a bootstrapped pipeline that leverages generic VLMs with iterative refinement and human verification to ensure quality. Our evaluation of eight VLMs on our benchmark reveals that off-the-shelf models struggle to understand satellite imagery. The open-source remote-sensing VLM EarthDial achieves only 0.07 SPIDEr in captioning and a VQA accuracy of 0.48, highlighting the limitations of current approaches. Encouragingly, lightweight fine-tuning of Qwen2.5-VL-7B on Landsat30-AU improves captioning performance from 0.11 to 0.31 SPIDEr and boosts VQA accuracy from 0.74 to 0.87. Code and data are available at https://github.com/papersubmit1/landsat30-au.
中文: 视觉语言模型虽能普及地球观测,但现有数据集忽略了对全球监测至关重要的长期多卫星档案;新推出的Landsat30-AU数据集通过提供澳大利亚36年卫星图像及标注,揭示了现有模型的不足,同时证明微调可显著提升性能。
English: Vision language models can democratize Earth observation, but existing datasets overlook critical long-term, multi-satellite archives, which the new Landsat30-AU dataset addresses by providing 36 years of Australian satellite imagery with captions and verified VQA samples, revealing current models' limitations while demonstrating significant improvements through fine-tuning.

Authors:Heng Jia, Linchao Zhu, Na Zhao
Title: H3R: Hybrid Multi-view Correspondence for Generalizable 3D Reconstruction
Abstract:
Despite recent advances in feed-forward 3D Gaussian Splatting, generalizable 3D reconstruction remains challenging, particularly in multi-view correspondence modeling. Existing approaches face a fundamental trade-off: explicit methods achieve geometric precision but struggle with ambiguous regions, while implicit methods provide robustness but suffer from slow convergence. We present H3R, a hybrid framework that addresses this limitation by integrating volumetric latent fusion with attention-based feature aggregation. Our framework consists of two complementary components: an efficient latent volume that enforces geometric consistency through epipolar constraints, and a camera-aware Transformer that leverages Plücker coordinates for adaptive correspondence refinement. By integrating both paradigms, our approach enhances generalization while converging 2$\times$ faster than existing methods. Furthermore, we show that spatial-aligned foundation models (e.g., SD-VAE) substantially outperform semantic-aligned models (e.g., DINOv2), resolving the mismatch between semantic representations and spatial reconstruction requirements. Our method supports variable-number and high-resolution input views while demonstrating robust cross-dataset generalization. Extensive experiments show that our method achieves state-of-the-art performance across multiple benchmarks, with significant PSNR improvements of 0.59 dB, 1.06 dB, and 0.22 dB on the RealEstate10K, ACID, and DTU datasets, respectively. Code is available at https://github.com/JiaHeng-DLUT/H3R.
中文: H3R是一种混合三维重建框架,通过融合体积隐式建模与注意力特征聚合,在实现跨数据集强泛化能力的同时收敛速度提升两倍,并在多个基准测试中取得显著性能提升。
English: H3R is a hybrid 3D reconstruction framework that combines volumetric latent fusion with attention-based feature aggregation to achieve robust cross-dataset generalization while converging twice as fast as existing methods, with significant performance improvements across multiple benchmarks.

Authors:Tianjiao Jiang, Zhen Zhang, Yuhang Liu, Javen Qinfeng Shi
Title: Causal Disentanglement and Cross-Modal Alignment for Enhanced Few-Shot Learning
Abstract:
Few-shot learning (FSL) often requires effective adaptation of models using limited labeled data. However, most existing FSL methods rely on entangled representations, requiring the model to implicitly recover the unmixing process to obtain disentangled representations using only limited supervision, which hinders effective adaptation. Recent theoretical studies show that multimodal contrastive learning methods, such as CLIP, can disentangle latent representations up to linear transformations. In light of this, we propose the Causal CLIP Adapter (CCA), a novel framework that explicitly disentangles visual features extracted from CLIP using unsupervised Independent Component Analysis (ICA). This removes the need to learn the unmixing process from the labeled data, thereby reducing the number of trainable parameters and mitigating overfitting. Taking a step further, while ICA can obtain visual disentangled representations, it may also disrupt CLIP's intra- and inter-modal alignment. To counteract this, CCA further leverages CLIP's inherent cross-modal alignment by enhancing it in two ways: unidirectionally, through fine-tuning a CLIP-based text classifier, and bidirectionally, via a cross-attention mechanism that enriches visual and textual representations through mutual interaction. Both unimodal and cross-modal classification outputs can be effectively combined linearly to improve classification accuracy. Extensive experiments on 11 benchmark datasets demonstrate that our method consistently outperforms state-of-the-art approaches in terms of few-shot performance and robustness to distributional shifts, while maintaining computational efficiency. Code will be available at https://github.com/tianjiao-j/CCA.
中文:提出的因果CLIP适配器(CCA)框架通过无监督独立成分分析显式解耦CLIP视觉特征并强化跨模态对齐,以更少参数在多个基准数据集上实现了优异的少样本学习性能和鲁棒性。
English: The proposed Causal CLIP Adapter (CCA) framework enhances few-shot learning by explicitly disentangling CLIP's visual features with unsupervised ICA and reinforcing cross-modal alignment, achieving superior performance and robustness across benchmarks with fewer parameters.

Authors:Trinh Quoc Nguyen, Oky Dicky Ardiansyah Prima, Katsuyoshi Hotta
Title: CORE-ReID: Comprehensive Optimization and Refinement through Ensemble fusion in Domain Adaptation for person re-identification
Abstract:
This study introduces a novel framework, "Comprehensive Optimization and Refinement through Ensemble Fusion in Domain Adaptation for Person Re-identification (CORE-ReID)", to address an Unsupervised Domain Adaptation (UDA) for Person Re-identification (ReID). The framework utilizes CycleGAN to generate diverse data that harmonizes differences in image characteristics from different camera sources in the pre-training stage. In the fine-tuning stage, based on a pair of teacher-student networks, the framework integrates multi-view features for multi-level clustering to derive diverse pseudo labels. A learnable Ensemble Fusion component that focuses on fine-grained local information within global features is introduced to enhance learning comprehensiveness and avoid ambiguity associated with multiple pseudo-labels. Experimental results on three common UDAs in Person ReID demonstrate significant performance gains over state-of-the-art approaches. Additional enhancements, such as Efficient Channel Attention Block and Bidirectional Mean Feature Normalization mitigate deviation effects and adaptive fusion of global and local features using the ResNet-based model, further strengthening the framework. The proposed framework ensures clarity in fusion features, avoids ambiguity, and achieves high ac-curacy in terms of Mean Average Precision, Top-1, Top-5, and Top-10, positioning it as an advanced and effective solution for the UDA in Person ReID. Our codes and models are available at https://github.com/TrinhQuocNguyen/CORE-ReID.
Chinese: CORE-ReID框架提出了一种新颖的无监督域自适应行人重识别方法,通过CycleGAN生成数据和师生网络集成融合,实现了优于现有方法的性能,其增强的特征清晰度和全面学习效果显著。
English: The CORE-ReID framework introduces a novel unsupervised domain adaptation approach for person re-identification, utilizing CycleGAN-generated data and ensemble fusion with teacher-student networks to achieve superior performance over existing methods through enhanced feature clarity and comprehensive learning.

Authors:Hyebin Cho, Jaehyup Lee
Title: Uncertainty-Guided Face Matting for Occlusion-Aware Face Transformation
Abstract:
Face filters have become a key element of short-form video content, enabling a wide array of visual effects such as stylization and face swapping. However, their performance often degrades in the presence of occlusions, where objects like hands, hair, or accessories obscure the face. To address this limitation, we introduce the novel task of face matting, which estimates fine-grained alpha mattes to separate occluding elements from facial regions. We further present FaceMat, a trimap-free, uncertainty-aware framework that predicts high-quality alpha mattes under complex occlusions. Our approach leverages a two-stage training pipeline: a teacher model is trained to jointly estimate alpha mattes and per-pixel uncertainty using a negative log-likelihood (NLL) loss, and this uncertainty is then used to guide the student model through spatially adaptive knowledge distillation. This formulation enables the student to focus on ambiguous or occluded regions, improving generalization and preserving semantic consistency. Unlike previous approaches that rely on trimaps or segmentation masks, our framework requires no auxiliary inputs making it well-suited for real-time applications. In addition, we reformulate the matting objective by explicitly treating skin as foreground and occlusions as background, enabling clearer compositing strategies. To support this task, we newly constructed CelebAMat, a large-scale synthetic dataset specifically designed for occlusion-aware face matting. Extensive experiments show that FaceMat outperforms state-of-the-art methods across multiple benchmarks, enhancing the visual quality and robustness of face filters in real-world, unconstrained video scenarios. The source code and CelebAMat dataset are available at https://github.com/hyebin-c/FaceMat.git
中文摘要:FaceMat是一种无需辅助输入的新框架,通过预测高质量阿尔法遮罩有效解决面部滤镜中的遮挡问题,显著提升了实时视频应用中的鲁棒性和视觉质量。
English Summary: FaceMat is a novel framework that addresses occlusion challenges in face filters by predicting high-quality alpha mattes without auxiliary inputs, improving robustness and visual quality in real-time video applications.

Authors:Zeyu Zhu, Weijia Wu, Mike Zheng Shou
Title: Multi-human Interactive Talking Dataset
Abstract:
Existing studies on talking video generation have predominantly focused on single-person monologues or isolated facial animations, limiting their applicability to realistic multi-human interactions. To bridge this gap, we introduce MIT, a large-scale dataset specifically designed for multi-human talking video generation. To this end, we develop an automatic pipeline that collects and annotates multi-person conversational videos. The resulting dataset comprises 12 hours of high-resolution footage, each featuring two to four speakers, with fine-grained annotations of body poses and speech interactions. It captures natural conversational dynamics in multi-speaker scenario, offering a rich resource for studying interactive visual behaviors. To demonstrate the potential of MIT, we furthur propose CovOG, a baseline model for this novel task. It integrates a Multi-Human Pose Encoder (MPE) to handle varying numbers of speakers by aggregating individual pose embeddings, and an Interactive Audio Driver (IAD) to modulate head dynamics based on speaker-specific audio features. Together, these components showcase the feasibility and challenges of generating realistic multi-human talking videos, establishing MIT as a valuable benchmark for future research. The code is avalibale at: https://github.com/showlab/Multi-human-Talking-Video-Dataset.
中文摘要:MIT数据集填补了多人物对话视频生成的研究空白,提供了12小时高清多人对话视频及精细标注,并推出CovOG基准模型,通过融合姿态编码与音频驱动技术,实现了自然的多说话人交互视频合成。
English Summary: The MIT dataset addresses the gap in multi-human talking video generation by providing 12 hours of high-resolution, annotated footage of natural conversations between two to four speakers, accompanied by the CovOG baseline model that integrates pose and audio features to enable realistic interactive video synthesis.

Authors:Zachary Yahn, Selim Furkan Tekin, Fatih Ilhan, Sihao Hu, Tiansheng Huang, Yichang Xu, Margaret Loper, Ling Liu
Title: Adversarial Attention Perturbations for Large Object Detection Transformers
Abstract:
Adversarial perturbations are useful tools for exposing vulnerabilities in neural networks. Existing adversarial perturbation methods for object detection are either limited to attacking CNN-based detectors or weak against transformer-based detectors. This paper presents an Attention-Focused Offensive Gradient (AFOG) attack against object detection transformers. By design, AFOG is neural-architecture agnostic and effective for attacking both large transformer-based object detectors and conventional CNN-based detectors with a unified adversarial attention framework. This paper makes three original contributions. First, AFOG utilizes a learnable attention mechanism that focuses perturbations on vulnerable image regions in multi-box detection tasks, increasing performance over non-attention baselines by up to 30.6%. Second, AFOG's attack loss is formulated by integrating two types of feature loss through learnable attention updates with iterative injection of adversarial perturbations. Finally, AFOG is an efficient and stealthy adversarial perturbation method. It probes the weak spots of detection transformers by adding strategically generated and visually imperceptible perturbations which can cause well-trained object detection models to fail. Extensive experiments conducted with twelve large detection transformers on COCO demonstrate the efficacy of AFOG. Our empirical results also show that AFOG outperforms existing attacks on transformer-based and CNN-based object detectors by up to 83% with superior speed and imperceptibility. Code is available at https://github.com/zacharyyahn/AFOG.
Chinese: 本文提出AFOG对抗攻击方法,通过可学习注意力机制将扰动聚焦于脆弱区域,有效攻击基于Transformer和CNN的目标检测器,在性能提升和隐蔽性方面表现卓越。
English: This paper introduces AFOG, an adversarial attack method that effectively targets both transformer-based and CNN-based object detectors by focusing perturbations on vulnerable areas through a learnable attention mechanism, achieving significant performance improvements and stealth.

Authors:Chenxu Zhang, Zenan Li, Hongyi Xu, You Xie, Xiaochen Zhao, Tianpei Gu, Guoxian Song, Xin Chen, Chao Liang, Jianwen Jiang, Linjie Luo
Title: X-Actor: Emotional and Expressive Long-Range Portrait Acting from Audio
Abstract:
We present X-Actor, a novel audio-driven portrait animation framework that generates lifelike, emotionally expressive talking head videos from a single reference image and an input audio clip. Unlike prior methods that emphasize lip synchronization and short-range visual fidelity in constrained speaking scenarios, X-Actor enables actor-quality, long-form portrait performance capturing nuanced, dynamically evolving emotions that flow coherently with the rhythm and content of speech. Central to our approach is a two-stage decoupled generation pipeline: an audio-conditioned autoregressive diffusion model that predicts expressive yet identity-agnostic facial motion latent tokens within a long temporal context window, followed by a diffusion-based video synthesis module that translates these motions into high-fidelity video animations. By operating in a compact facial motion latent space decoupled from visual and identity cues, our autoregressive diffusion model effectively captures long-range correlations between audio and facial dynamics through a diffusion-forcing training paradigm, enabling infinite-length emotionally-rich motion prediction without error accumulation. Extensive experiments demonstrate that X-Actor produces compelling, cinematic-style performances that go beyond standard talking head animations and achieves state-of-the-art results in long-range, audio-driven emotional portrait acting.

Authors:Mahnoor Fatima Saad, Ziad Al-Halah
Title: How Would It Sound? Material-Controlled Multimodal Acoustic Profile Generation for Indoor Scenes
Abstract:
How would the sound in a studio change with a carpeted floor and acoustic tiles on the walls? We introduce the task of material-controlled acoustic profile generation, where, given an indoor scene with specific audio-visual characteristics, the goal is to generate a target acoustic profile based on a user-defined material configuration at inference time. We address this task with a novel encoder-decoder approach that encodes the scene's key properties from an audio-visual observation and generates the target Room Impulse Response (RIR) conditioned on the material specifications provided by the user. Our model enables the generation of diverse RIRs based on various material configurations defined dynamically at inference time. To support this task, we create a new benchmark, the Acoustic Wonderland Dataset, designed for developing and evaluating material-aware RIR prediction methods under diverse and challenging settings. Our results demonstrate that the proposed model effectively encodes material information and generates high-fidelity RIRs, outperforming several baselines and state-of-the-art methods.

Authors:Mehrdad Moradi, Kamran Paynabar
Title: RDDPM: Robust Denoising Diffusion Probabilistic Model for Unsupervised Anomaly Segmentation
Abstract:
Recent advancements in diffusion models have demonstrated significant success in unsupervised anomaly segmentation. For anomaly segmentation, these models are first trained on normal data; then, an anomalous image is noised to an intermediate step, and the normal image is reconstructed through backward diffusion. Unlike traditional statistical methods, diffusion models do not rely on specific assumptions about the data or target anomalies, making them versatile for use across different domains. However, diffusion models typically assume access to normal data for training, limiting their applicability in realistic settings. In this paper, we propose novel robust denoising diffusion models for scenarios where only contaminated (i.e., a mix of normal and anomalous) unlabeled data is available. By casting maximum likelihood estimation of the data as a nonlinear regression problem, we reinterpret the denoising diffusion probabilistic model through a regression lens. Using robust regression, we derive a robust version of denoising diffusion probabilistic models. Our novel framework offers flexibility in constructing various robust diffusion models. Our experiments show that our approach outperforms current state of the art diffusion models, for unsupervised anomaly segmentation when only contaminated data is available. Our method outperforms existing diffusion-based approaches, achieving up to 8.08\% higher AUROC and 10.37\% higher AUPRC on MVTec datasets. The implementation code is available at: https://github.com/mehrdadmoradi124/RDDPM
Chinese: 本文提出了一种鲁棒去噪扩散概率模型,能够在仅使用受污染未标注数据的情况下有效实现无监督异常分割,在MVTec数据集上的AUROC和AUPRC指标分别比现有方法最高提升8.08%和10.37%。
English: This paper introduces robust denoising diffusion probabilistic models (RDDPM) that effectively perform unsupervised anomaly segmentation using only contaminated unlabeled data, outperforming existing methods by up to 8.08% in AUROC and 10.37% in AUPRC on benchmark datasets.

Authors:Farzad Beizaee, Sina Hajimiri, Ismail Ben Ayed, Gregory Lodygensky, Christian Desrosiers, Jose Dolz
Title: REFLECT: Rectified Flows for Efficient Brain Anomaly Correction Transport
Abstract:
Unsupervised anomaly detection (UAD) in brain imaging is crucial for identifying pathologies without the need for labeled data. However, accurately localizing anomalies remains challenging due to the intricate structure of brain anatomy and the scarcity of abnormal examples. In this work, we introduce REFLECT, a novel framework that leverages rectified flows to establish a direct, linear trajectory for correcting abnormal MR images toward a normal distribution. By learning a straight, one-step correction transport map, our method efficiently corrects brain anomalies and can precisely localize anomalies by detecting discrepancies between anomalous input and corrected counterpart. In contrast to the diffusion-based UAD models, which require iterative stochastic sampling, rectified flows provide a direct transport map, enabling single-step inference. Extensive experiments on popular UAD brain segmentation benchmarks demonstrate that REFLECT significantly outperforms state-of-the-art unsupervised anomaly detection methods. The code is available at https://github.com/farzad-bz/REFLECT.
中文摘要:REFLECT框架创新性地利用整流流技术,通过单步校正实现脑部异常的高效修复与精确定位,在无监督异常检测基准测试中显著优于现有最先进方法。
English Summary: The REFLECT framework introduces a novel approach using rectified flows to enable efficient single-step correction and precise localization of brain anomalies in unsupervised anomaly detection, significantly outperforming existing methods on benchmark tests.

Authors:Mikołaj Zieliński, Krzysztof Byrski, Tomasz Szczepanik, Przemysław Spurek
Title: GENIE: Gaussian Encoding for Neural Radiance Fields Interactive Editing
Abstract:
Neural Radiance Fields (NeRF) and Gaussian Splatting (GS) have recently transformed 3D scene representation and rendering. NeRF achieves high-fidelity novel view synthesis by learning volumetric representations through neural networks, but its implicit encoding makes editing and physical interaction challenging. In contrast, GS represents scenes as explicit collections of Gaussian primitives, enabling real-time rendering, faster training, and more intuitive manipulation. This explicit structure has made GS particularly well-suited for interactive editing and integration with physics-based simulation. In this paper, we introduce GENIE (Gaussian Encoding for Neural Radiance Fields Interactive Editing), a hybrid model that combines the photorealistic rendering quality of NeRF with the editable and structured representation of GS. Instead of using spherical harmonics for appearance modeling, we assign each Gaussian a trainable feature embedding. These embeddings are used to condition a NeRF network based on the k nearest Gaussians to each query point. To make this conditioning efficient, we introduce Ray-Traced Gaussian Proximity Search (RT-GPS), a fast nearest Gaussian search based on a modified ray-tracing pipeline. We also integrate a multi-resolution hash grid to initialize and update Gaussian features. Together, these components enable real-time, locality-aware editing: as Gaussian primitives are repositioned or modified, their interpolated influence is immediately reflected in the rendered output. By combining the strengths of implicit and explicit representations, GENIE supports intuitive scene manipulation, dynamic interaction, and compatibility with physical simulation, bridging the gap between geometry-based editing and neural rendering. The code can be found under (https://github.com/MikolajZielinski/genie)
中文摘要:GENIE是一种混合模型,将神经辐射场(NeRF)的逼真渲染与高斯泼溅(GS)的可编辑结构相结合,通过创新的条件机制和高效的高斯邻近搜索,实现了实时交互式场景编辑。
English Summary: GENIE is a hybrid model that merges the photorealistic rendering of Neural Radiance Fields with the editable structure of Gaussian Splatting, enabling real-time interactive scene manipulation through a novel conditioning mechanism and efficient Gaussian proximity search.

Authors:Haoyang Li, Liang Wang, Chao Wang, Siyu Zhou, Jing Jiang, Yan Peng, Guodong Long
Title: Raw Data Matters: Enhancing Prompt Tuning by Internal Augmentation on Vision-Language Models
Abstract:
For CLIP-based prompt tuning, introducing more data as additional knowledge for enhancing fine-tuning process is proved to be an effective approach. Existing data amplification strategies for prompt tuning typically rely on external knowledge (e.g., large language models or pre-structured knowledge bases), resulting in higher costs for data collection and processing, while generally ignoring further utilization of features in image modality. To address this, we propose Augmentation-driven Prompt Tuning (AugPT), a self-contained distillation-based prompt tuning approach using only internal augmentation on raw dataset to better exploit known features. Specifically, AugPT employs self-supervised augmentation on unlabeled images in the training set, and introduces a novel gating mechanism based on consensus test, reusing the pre-trained prompt tuning backbone model to spontaneously filter noisy samples, further enhancing the quality of augmented views. Extensive experiments validate that AugPT simultaneously enhances model performance and generalization capability without using appended external knowledge. The code of AugPT is available at: https://github.com/JREion/AugPT .
中文:AugPT是一种自包含的提示调优方法,通过内部数据增强和新型门控机制,无需外部知识即可提升模型性能与泛化能力。
English: AugPT is a self-contained prompt tuning method that uses internal data augmentation and a novel gating mechanism to enhance model performance and generalization without external knowledge.

Authors:Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, Yuyin Zhou
Title: MedVLThinker: Simple Baselines for Multimodal Medical Reasoning
Abstract:
Large Reasoning Models (LRMs) have introduced a new paradigm in AI by enabling models to ``think before responding" via chain-of-thought reasoning. However, the absence of open and reproducible recipes for building reasoning-centric medical LMMs hinders community-wide research, analysis, and comparison. In this paper, we present MedVLThinker, a suite of simple yet strong baselines. Our fully open recipe consists of: (1) systematic data curation for both text-only and image-text medical data, filtered according to varying levels of reasoning difficulty, and (2) two training paradigms: Supervised Fine-Tuning (SFT) on distilled reasoning traces and Reinforcement Learning with Verifiable Rewards (RLVR) based on final answer correctness. Across extensive experiments on the Qwen2.5-VL model family (3B, 7B) and six medical QA benchmarks, we find that RLVR consistently and significantly outperforms SFT. Additionally, under the RLVR framework, a key, counter-intuitive finding is that training on our curated text-only reasoning data provides a more substantial performance boost than training on multimodal image-text data. Our best open 7B model, trained using the RLVR recipe on text-only data, establishes a new state-of-the-art on existing public VQA benchmarks, surpassing all previous open-source medical LMMs. Furthermore, scaling our model to 32B achieves performance on par with the proprietary GPT-4o. We release all curated data, models, and code to provide the community with a strong, open foundation for future research in multimodal medical reasoning.
中文:MedVLThinker提出了一套开源医疗推理模型框架,通过可验证奖励的强化学习实现了与GPT-4o等专有模型相媲美的顶尖性能。
English: MedVLThinker introduces an open-source framework for medical reasoning models, using reinforcement learning with verifiable rewards to achieve state-of-the-art performance that rivals proprietary models like GPT-4o.

Authors:Zhengdao Li, Siheng Wang, Zeyu Zhang, Hao Tang
Title: ReMoMask: Retrieval-Augmented Masked Motion Generation
Abstract:
Text-to-Motion (T2M) generation aims to synthesize realistic and semantically aligned human motion sequences from natural language descriptions. However, current approaches face dual challenges: Generative models (e.g., diffusion models) suffer from limited diversity, error accumulation, and physical implausibility, while Retrieval-Augmented Generation (RAG) methods exhibit diffusion inertia, partial-mode collapse, and asynchronous artifacts. To address these limitations, we propose ReMoMask, a unified framework integrating three key innovations: 1) A Bidirectional Momentum Text-Motion Model decouples negative sample scale from batch size via momentum queues, substantially improving cross-modal retrieval precision; 2) A Semantic Spatio-temporal Attention mechanism enforces biomechanical constraints during part-level fusion to eliminate asynchronous artifacts; 3) RAG-Classier-Free Guidance incorporates minor unconditional generation to enhance generalization. Built upon MoMask's RVQ-VAE, ReMoMask efficiently generates temporally coherent motions in minimal steps. Extensive experiments on standard benchmarks demonstrate the state-of-the-art performance of ReMoMask, achieving a 3.88% and 10.97% improvement in FID scores on HumanML3D and KIT-ML, respectively, compared to the previous SOTA method RAG-T2M. Code: https://github.com/AIGeeksGroup/ReMoMask. Website: https://aigeeksgroup.github.io/ReMoMask.
中文: ReMoMask通过双向动量建模、语义时空注意力机制和增强引导技术,构建了统一的文本驱动运动生成框架,在标准基准测试中显著提升FID分数,实现了最先进的性能表现。
English: ReMoMask introduces a unified framework that overcomes limitations in text-to-motion generation through bidirectional momentum modeling, semantic attention mechanisms, and enhanced guidance, achieving state-of-the-art performance with significant FID score improvements on benchmark datasets.

Authors:Wei Sun, Linhan Cao, Yuqin Cao, Weixia Zhang, Wen Wen, Kaiwei Zhang, Zijian Chen, Fangfang Lu, Xiongkuo Min, Guangtao Zhai
Title: Engagement Prediction of Short Videos with Large Multimodal Models
Abstract:
The rapid proliferation of user-generated content (UGC) on short-form video platforms has made video engagement prediction increasingly important for optimizing recommendation systems and guiding content creation. However, this task remains challenging due to the complex interplay of factors such as semantic content, visual quality, audio characteristics, and user background. Prior studies have leveraged various types of features from different modalities, such as visual quality, semantic content, background sound, etc., but often struggle to effectively model their cross-feature and cross-modality interactions. In this work, we empirically investigate the potential of large multimodal models (LMMs) for video engagement prediction. We adopt two representative LMMs: VideoLLaMA2, which integrates audio, visual, and language modalities, and Qwen2.5-VL, which models only visual and language modalities. Specifically, VideoLLaMA2 jointly processes key video frames, text-based metadata, and background sound, while Qwen2.5-VL utilizes only key video frames and text-based metadata. Trained on the SnapUGC dataset, both models demonstrate competitive performance against state-of-the-art baselines, showcasing the effectiveness of LMMs in engagement prediction. Notably, VideoLLaMA2 consistently outperforms Qwen2.5-VL, highlighting the importance of audio features in engagement prediction. By ensembling two types of models, our method achieves first place in the ICCV VQualA 2025 EVQA-SnapUGC Challenge on short-form video engagement prediction. The code is available at https://github.com/sunwei925/LMM-EVQA.git.
中文: 本研究证明,大型多模态模型(LMMs)能有效预测视频参与度,其中结合音频特征的VideoLLaMA2表现尤为突出,在ICCV VQualA 2025挑战赛中取得了最佳成绩。
English: This study demonstrates that large multimodal models (LMMs), particularly VideoLLaMA2 which incorporates audio features alongside visual and textual data, effectively predict video engagement and achieved top performance in the ICCV VQualA 2025 challenge.

Authors:Sheng Wu, Fei Teng, Hao Shi, Qi Jiang, Kai Luo, Kaiwei Wang, Kailun Yang
Title: QuaDreamer: Controllable Panoramic Video Generation for Quadruped Robots
Abstract:
Panoramic cameras, capturing comprehensive 360-degree environmental data, are suitable for quadruped robots in surrounding perception and interaction with complex environments. However, the scarcity of high-quality panoramic training data-caused by inherent kinematic constraints and complex sensor calibration challenges-fundamentally limits the development of robust perception systems tailored to these embodied platforms. To address this issue, we propose QuaDreamer-the first panoramic data generation engine specifically designed for quadruped robots. QuaDreamer focuses on mimicking the motion paradigm of quadruped robots to generate highly controllable, realistic panoramic videos, providing a data source for downstream tasks. Specifically, to effectively capture the unique vertical vibration characteristics exhibited during quadruped locomotion, we introduce Vertical Jitter Encoding (VJE). VJE extracts controllable vertical signals through frequency-domain feature filtering and provides high-quality prompts. To facilitate high-quality panoramic video generation under jitter signal control, we propose a Scene-Object Controller (SOC) that effectively manages object motion and boosts background jitter control through the attention mechanism. To address panoramic distortions in wide-FoV video generation, we propose the Panoramic Enhancer (PE)-a dual-stream architecture that synergizes frequency-texture refinement for local detail enhancement with spatial-structure correction for global geometric consistency. We further demonstrate that the generated video sequences can serve as training data for the quadruped robot's panoramic visual perception model, enhancing the performance of multi-object tracking in 360-degree scenes. The source code and model weights will be publicly available at https://github.com/losehu/QuaDreamer.
中文: QuaDreamer是首个专为四足机器人设计的全景数据生成引擎,通过模拟其运动模式生成可控的真实全景视频,并利用垂直抖动编码和全景增强技术解决训练数据匮乏问题。
English: QuaDreamer is a pioneering panoramic data generation engine for quadruped robots that mimics their motion to produce realistic videos, addressing training data scarcity through vertical jitter encoding and panoramic enhancement techniques.

Authors:Yaofeng Cheng, Xinkai Gao, Sen Zhang, Chao Zeng, Fusheng Zha, Lining Sun, Chenguang Yang
Title: Rethinking Transparent Object Grasping: Depth Completion with Monocular Depth Estimation and Instance Mask
Abstract:
Due to the optical properties, transparent objects often lead depth cameras to generate incomplete or invalid depth data, which in turn reduces the accuracy and reliability of robotic grasping. Existing approaches typically input the RGB-D image directly into the network to output the complete depth, expecting the model to implicitly infer the reliability of depth values. However, while effective in training datasets, such methods often fail to generalize to real-world scenarios, where complex light interactions lead to highly variable distributions of valid and invalid depth data. To address this, we propose ReMake, a novel depth completion framework guided by an instance mask and monocular depth estimation. By explicitly distinguishing transparent regions from non-transparent ones, the mask enables the model to concentrate on learning accurate depth estimation in these areas from RGB-D input during training. This targeted supervision reduces reliance on implicit reasoning and improves generalization to real-world scenarios. Additionally, monocular depth estimation provides depth context between the transparent object and its surroundings, enhancing depth prediction accuracy. Extensive experiments show that our method outperforms existing approaches on both benchmark datasets and real-world scenarios, demonstrating superior accuracy and generalization capability. Code and videos are available at https://chengyaofeng.github.io/ReMake.github.io/.

Authors:Jianchao Wang, Peng Zhou, Cen Li, Rong Quan, Jie Qin
Title: Low-Frequency First: Eliminating Floating Artifacts in 3D Gaussian Splatting
Abstract:
3D Gaussian Splatting (3DGS) is a powerful and computationally efficient representation for 3D reconstruction. Despite its strengths, 3DGS often produces floating artifacts, which are erroneous structures detached from the actual geometry and significantly degrade visual fidelity. The underlying mechanisms causing these artifacts, particularly in low-quality initialization scenarios, have not been fully explored. In this paper, we investigate the origins of floating artifacts from a frequency-domain perspective and identify under-optimized Gaussians as the primary source. Based on our analysis, we propose \textit{Eliminating-Floating-Artifacts} Gaussian Splatting (EFA-GS), which selectively expands under-optimized Gaussians to prioritize accurate low-frequency learning. Additionally, we introduce complementary depth-based and scale-based strategies to dynamically refine Gaussian expansion, effectively mitigating detail erosion. Extensive experiments on both synthetic and real-world datasets demonstrate that EFA-GS substantially reduces floating artifacts while preserving high-frequency details, achieving an improvement of 1.68 dB in PSNR over baseline method on our RWLQ dataset. Furthermore, we validate the effectiveness of our approach in downstream 3D editing tasks. We provide our implementation in https://jcwang-gh.github.io/EFA-GS.

Authors:Junxiao Xue, Xiaozhen Liu, Xuecheng Wu, Fei Yu, Jun Wang
Title: InfoSyncNet: Information Synchronization Temporal Convolutional Network for Visual Speech Recognition
Abstract:
Estimating spoken content from silent videos is crucial for applications in Assistive Technology (AT) and Augmented Reality (AR). However, accurately mapping lip movement sequences in videos to words poses significant challenges due to variability across sequences and the uneven distribution of information within each sequence. To tackle this, we introduce InfoSyncNet, a non-uniform sequence modeling network enhanced by tailored data augmentation techniques. Central to InfoSyncNet is a non-uniform quantization module positioned between the encoder and decoder, enabling dynamic adjustment to the network's focus and effectively handling the natural inconsistencies in visual speech data. Additionally, multiple training strategies are incorporated to enhance the model's capability to handle variations in lighting and the speaker's orientation. Comprehensive experiments on the LRW and LRW1000 datasets confirm the superiority of InfoSyncNet, achieving new state-of-the-art accuracies of 92.0% and 60.7% Top-1 ACC. The code is available for download (see comments).
中文摘要:InfoSyncNet是一种创新的非均匀序列建模网络,通过定制化数据增强技术动态调整关注点以解决视觉语音识别难题,在基准数据集上分别实现了92.0%和60.7%的最优准确率。
English Summary: InfoSyncNet is a novel non-uniform sequence modeling network with tailored data augmentation that dynamically adjusts focus to overcome visual speech recognition challenges, achieving state-of-the-art accuracy of 92.0% and 60.7% on benchmark datasets.

Authors:Xiao Wang, Hao Si, Fan Zhang, Xiaoya Zhou, Dengdi Sun, Wanli Lyu, Qingquan Yang, Jin Tang
Title: HGTS-Former: Hierarchical HyperGraph Transformer for Multivariate Time Series Analysis
Abstract:
Multivariate time series analysis has long been one of the key research topics in the field of artificial intelligence. However, analyzing complex time series data remains a challenging and unresolved problem due to its high dimensionality, dynamic nature, and complex interactions among variables. Inspired by the strong structural modeling capability of hypergraphs, this paper proposes a novel hypergraph-based time series transformer backbone network, termed HGTS-Former, to address the multivariate coupling in time series data. Specifically, given the multivariate time series signal, we first normalize and embed each patch into tokens. Then, we adopt the multi-head self-attention to enhance the temporal representation of each patch. The hierarchical hypergraphs are constructed to aggregate the temporal patterns within each channel and fine-grained relations between different variables. After that, we convert the hyperedge into node features through the EdgeToNode module and adopt the feed-forward network to further enhance the output features. Extensive experiments conducted on two multivariate time series tasks and eight datasets fully validated the effectiveness of our proposed HGTS-Former. The source code will be released on https://github.com/Event-AHU/Time_Series_Analysis.
中文: 本文提出HGTS-Former这一基于超图的创新Transformer网络,通过构建分层超图来建模时间序列中的多元耦合关系,在多个数据集上的实验验证了其优越性能。
English: This paper introduces HGTS-Former, a novel hypergraph-based transformer network that effectively models multivariate coupling in time series data through hierarchical hypergraph construction and feature enhancement, achieving superior performance across multiple datasets.

Authors:Jialiang Wang, Xiong Zhou, Deming Zhai, Junjun Jiang, Xiangyang Ji, Xianming Liu
Title: $ε$-Softmax: Approximating One-Hot Vectors for Mitigating Label Noise
Abstract:
Noisy labels pose a common challenge for training accurate deep neural networks. To mitigate label noise, prior studies have proposed various robust loss functions to achieve noise tolerance in the presence of label noise, particularly symmetric losses. However, they usually suffer from the underfitting issue due to the overly strict symmetric condition. In this work, we propose a simple yet effective approach for relaxing the symmetric condition, namely $ε$-softmax, which simply modifies the outputs of the softmax layer to approximate one-hot vectors with a controllable error $ε$. Essentially, $ε$-softmax not only acts as an alternative for the softmax layer, but also implicitly plays the crucial role in modifying the loss function. We prove theoretically that $ε$-softmax can achieve noise-tolerant learning with controllable excess risk bound for almost any loss function. Recognizing that $ε$-softmax-enhanced losses may slightly reduce fitting ability on clean datasets, we further incorporate them with one symmetric loss, thereby achieving a better trade-off between robustness and effective learning. Extensive experiments demonstrate the superiority of our method in mitigating synthetic and real-world label noise. The code is available at https://github.com/cswjl/eps-softmax.
中文: 本文提出$ε$-softmax方法,通过放宽对称条件来改进鲁棒损失函数,有效应对深度学习中的标签噪声问题,在理论保证和实验验证下实现了噪声鲁棒性与模型拟合能力的更好平衡。
English: This paper introduces the $ε$-softmax method, which relaxes the symmetric condition in robust loss functions to address noisy labels in deep learning, achieving a better balance between noise tolerance and model fitting through theoretical guarantees and experimental validation.

Authors:Shuo Lu, Yanyin Chen, Wei Feng, Jiahao Fan, Fengheng Li, Zheng Zhang, Jingjing Lv, Junjie Shen, Ching Law, Jian Liang
Title: Uni-Layout: Integrating Human Feedback in Unified Layout Generation and Evaluation
Abstract:
Layout generation plays a crucial role in enhancing both user experience and design efficiency. However, current approaches suffer from task-specific generation capabilities and perceptually misaligned evaluation metrics, leading to limited applicability and ineffective measurement. In this paper, we propose \textit{Uni-Layout}, a novel framework that achieves unified generation, human-mimicking evaluation and alignment between the two. For universal generation, we incorporate various layout tasks into a single taxonomy and develop a unified generator that handles background or element contents constrained tasks via natural language prompts. To introduce human feedback for the effective evaluation of layouts, we build \textit{Layout-HF100k}, the first large-scale human feedback dataset with 100,000 expertly annotated layouts. Based on \textit{Layout-HF100k}, we introduce a human-mimicking evaluator that integrates visual and geometric information, employing a Chain-of-Thought mechanism to conduct qualitative assessments alongside a confidence estimation module to yield quantitative measurements. For better alignment between the generator and the evaluator, we integrate them into a cohesive system by adopting Dynamic-Margin Preference Optimization (DMPO), which dynamically adjusts margins based on preference strength to better align with human judgments. Extensive experiments show that \textit{Uni-Layout} significantly outperforms both task-specific and general-purpose methods. Our code is publicly available at https://github.com/JD-GenX/Uni-Layout.
中文: Uni-Layout 是一个统一框架,通过自然语言提示实现通用布局生成,并利用大规模人工标注数据集进行拟人化评估,借助动态对齐优化实现了卓越性能。
English: Uni-Layout is a unified framework that integrates universal layout generation via natural language prompts and human-mimicking evaluation using a large-scale annotated dataset, achieving superior performance through dynamic alignment optimization.

Authors:Marian Lupascu, Mihai-Sorin Stupariu
Title: Optimal Transport for Rectified Flow Image Editing: Unifying Inversion-Based and Direct Methods
Abstract:
Image editing in rectified flow models remains challenging due to the fundamental trade-off between reconstruction fidelity and editing flexibility. While inversion-based methods suffer from trajectory deviation, recent inversion-free approaches like FlowEdit offer direct editing pathways but can benefit from additional guidance to improve structure preservation. In this work, we demonstrate that optimal transport theory provides a unified framework for improving both paradigms in rectified flow editing. We introduce a zero-shot transport-guided inversion framework that leverages optimal transport during the reverse diffusion process, and extend optimal transport principles to enhance inversion-free methods through transport-optimized velocity field corrections. Incorporating transport-based guidance can effectively balance reconstruction accuracy and editing controllability across different rectified flow editing approaches. For inversion-based editing, our method achieves high-fidelity reconstruction with LPIPS scores of 0.001 and SSIM of 0.992 on face editing benchmarks, observing 7.8% to 12.9% improvements over RF-Inversion on LSUN datasets. For inversion-free editing with FlowEdit on FLUX and Stable Diffusion 3, we demonstrate consistent improvements in semantic consistency and structure preservation across diverse editing scenarios. Our semantic face editing experiments show an 11.2% improvement in identity preservation and enhanced perceptual quality. The unified optimal transport framework produces visually compelling edits with superior detail preservation across both inversion-based and direct editing paradigms. Code is available for RF-Inversion and FlowEdit at: https://github.com/marianlupascu/OT-RF
中文: 最优传输理论为整流流编辑提供了统一框架,通过传输引导的反转和速度场校正,在保持高保真重建的同时显著提升了编辑可控性与结构保持能力。
English: Optimal transport theory provides a unified framework to enhance both inversion-based and inversion-free rectified flow editing methods, effectively balancing reconstruction fidelity and editing flexibility while achieving superior structure preservation and semantic consistency.

Authors:Xu Wang, Shengeng Tang, Fei Wang, Lechao Cheng, Dan Guo, Feng Xue, Richang Hong
Title: Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering
Abstract:
Generating semantically coherent and visually accurate talking faces requires bridging the gap between linguistic meaning and facial articulation. Although audio-driven methods remain prevalent, their reliance on high-quality paired audio visual data and the inherent ambiguity in mapping acoustics to lip motion pose significant challenges in terms of scalability and robustness. To address these issues, we propose Text2Lip, a viseme-centric framework that constructs an interpretable phonetic-visual bridge by embedding textual input into structured viseme sequences. These mid-level units serve as a linguistically grounded prior for lip motion prediction. Furthermore, we design a progressive viseme-audio replacement strategy based on curriculum learning, enabling the model to gradually transition from real audio to pseudo-audio reconstructed from enhanced viseme features via cross-modal attention. This allows for robust generation in both audio-present and audio-free scenarios. Finally, a landmark-guided renderer synthesizes photorealistic facial videos with accurate lip synchronization. Extensive evaluations show that Text2Lip outperforms existing approaches in semantic fidelity, visual realism, and modality robustness, establishing a new paradigm for controllable and flexible talking face generation. Our project homepage is https://plyon1.github.io/Text2Lip/.

Authors:Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun Wen, Wensen Feng, Xiaoxiao Xu, Yi Wang, Yichang Zhang, Yongqiang Zhu, Yujia Wu, Yuxuan Cai, Zenan Liu
Title: Qwen-Image Technical Report
Abstract:
We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially enhances the model's native text rendering capabilities. As a result, Qwen-Image not only performs exceptionally well in alphabetic languages such as English, but also achieves remarkable progress on more challenging logographic languages like Chinese. To enhance image editing consistency, we introduce an improved multi-task training paradigm that incorporates not only traditional text-to-image (T2I) and text-image-to-image (TI2I) tasks but also image-to-image (I2I) reconstruction, effectively aligning the latent representations between Qwen2.5-VL and MMDiT. Furthermore, we separately feed the original image into Qwen2.5-VL and the VAE encoder to obtain semantic and reconstructive representations, respectively. This dual-encoding mechanism enables the editing module to strike a balance between preserving semantic consistency and maintaining visual fidelity. Qwen-Image achieves state-of-the-art performance, demonstrating its strong capabilities in both image generation and editing across multiple benchmarks.
中文: Qwen-Image 是一款突破性的图像生成模型,通过先进的数据处理和渐进式训练,在复杂文本渲染和精确图像编辑方面表现卓越,在多项基准测试中达到了领先水平。
English: Qwen-Image is a groundbreaking image generation model that excels in complex text rendering and precise image editing through advanced data processing and progressive training, achieving state-of-the-art performance across multiple benchmarks.

Authors:Philipp Wulff, Felix Wimbauer, Dominik Muhle, Daniel Cremers
Title: Dream-to-Recon: Monocular 3D Reconstruction with Diffusion-Depth Distillation from Single Images
Abstract:
Volumetric scene reconstruction from a single image is crucial for a broad range of applications like autonomous driving and robotics. Recent volumetric reconstruction methods achieve impressive results, but generally require expensive 3D ground truth or multi-view supervision. We propose to leverage pre-trained 2D diffusion models and depth prediction models to generate synthetic scene geometry from a single image. This can then be used to distill a feed-forward scene reconstruction model. Our experiments on the challenging KITTI-360 and Waymo datasets demonstrate that our method matches or outperforms state-of-the-art baselines that use multi-view supervision, and offers unique advantages, for example regarding dynamic scenes.

Authors:Dmitrii Seletkov, Sophie Starck, Ayhan Can Erdur, Yundi Zhang, Daniel Rueckert, Rickmer Braren
Title: Whole-body Representation Learning For Competing Preclinical Disease Risk Assessment
Abstract:
Reliable preclinical disease risk assessment is essential to move public healthcare from reactive treatment to proactive identification and prevention. However, image-based risk prediction algorithms often consider one condition at a time and depend on hand-crafted features obtained through segmentation tools. We propose a whole-body self-supervised representation learning method for the preclinical disease risk assessment under a competing risk modeling. This approach outperforms whole-body radiomics in multiple diseases, including cardiovascular disease (CVD), type 2 diabetes (T2D), chronic obstructive pulmonary disease (COPD), and chronic kidney disease (CKD). Simulating a preclinical screening scenario and subsequently combining with cardiac MRI, it sharpens further the prediction for CVD subgroups: ischemic heart disease (IHD), hypertensive diseases (HD), and stroke. The results indicate the translational potential of whole-body representations as a standalone screening modality and as part of a multi-modal framework within clinical workflows for early personalized risk stratification. The code is available at https://github.com/yayapa/WBRLforCR/
Chinese: 本研究提出了一种全身自监督学习方法,在多种疾病预测中优于传统影像组学,展现了其作为独立筛查工具及多模态临床流程一部分,在早期个性化风险评估中的转化潜力。
English: This study introduces a whole-body self-supervised learning method that outperforms traditional radiomics in predicting multiple diseases, demonstrating its potential for early personalized risk screening both independently and in multimodal clinical workflows.

Authors:Jae-Young Kang, Hoonhee Cho, Kuk-Jin Yoon
Title: Unleashing the Temporal Potential of Stereo Event Cameras for Continuous-Time 3D Object Detection
Abstract:
3D object detection is essential for autonomous systems, enabling precise localization and dimension estimation. While LiDAR and RGB cameras are widely used, their fixed frame rates create perception gaps in high-speed scenarios. Event cameras, with their asynchronous nature and high temporal resolution, offer a solution by capturing motion continuously. The recent approach, which integrates event cameras with conventional sensors for continuous-time detection, struggles in fast-motion scenarios due to its dependency on synchronized sensors. We propose a novel stereo 3D object detection framework that relies solely on event cameras, eliminating the need for conventional 3D sensors. To compensate for the lack of semantic and geometric information in event data, we introduce a dual filter mechanism that extracts both. Additionally, we enhance regression by aligning bounding boxes with object-centric information. Experiments show that our method outperforms prior approaches in dynamic environments, demonstrating the potential of event cameras for robust, continuous-time 3D perception. The code is available at https://github.com/mickeykang16/Ev-Stereo3D.
中文: 本文提出了一种仅使用事件相机的新型立体三维物体检测框架,通过双滤波器机制提取语义和几何信息并改进边界框回归,克服了传统传感器在高速场景中的局限性。
English: This paper introduces a novel stereo 3D object detection framework that uses only event cameras, overcoming limitations of conventional sensors in high-speed scenarios through a dual filter mechanism for semantic and geometric information extraction and improved bounding box regression.

Authors:Wenchuan Zhang, Jingru Guo, Hengzhe Zhang, Penghao Zhang, Jie Chen, Shuwan Zhang, Zhang Zhang, Yuhao Yi, Hong Bu
Title: Patho-AgenticRAG: Towards Multimodal Agentic Retrieval-Augmented Generation for Pathology VLMs via Reinforcement Learning
Abstract:
Although Vision Language Models (VLMs) have shown strong generalization in medical imaging, pathology presents unique challenges due to ultra-high resolution, complex tissue structures, and nuanced clinical semantics. These factors make pathology VLMs prone to hallucinations, i.e., generating outputs inconsistent with visual evidence, which undermines clinical trust. Existing RAG approaches in this domain largely depend on text-based knowledge bases, limiting their ability to leverage diagnostic visual cues. To address this, we propose Patho-AgenticRAG, a multimodal RAG framework with a database built on page-level embeddings from authoritative pathology textbooks. Unlike traditional text-only retrieval systems, it supports joint text-image search, enabling direct retrieval of textbook pages that contain both the queried text and relevant visual cues, thus avoiding the loss of critical image-based information. Patho-AgenticRAG also supports reasoning, task decomposition, and multi-turn search interactions, improving accuracy in complex diagnostic scenarios. Experiments show that Patho-AgenticRAG significantly outperforms existing multimodal models in complex pathology tasks like multiple-choice diagnosis and visual question answering. Our project is available at the Patho-AgenticRAG repository: https://github.com/Wenchuan-Zhang/Patho-AgenticRAG.
中文: Patho-AgenticRAG是一种多模态检索增强生成框架,通过从权威病理教材中实现文本-图像联合检索来解决病理视觉语言模型的幻觉问题,显著提升了复杂诊断任务的准确性。
English: Patho-AgenticRAG is a multimodal retrieval-augmented generation framework that addresses hallucinations in pathology vision-language models by enabling joint text-image retrieval from authoritative textbooks, significantly improving diagnostic accuracy in complex tasks.

Authors:Yuanbin Fu, Xiaojie Guo
Title: Semi-Supervised Semantic Segmentation via Derivative Label Propagation
Abstract:
Semi-supervised semantic segmentation, which leverages a limited set of labeled images, helps to relieve the heavy annotation burden. While pseudo-labeling strategies yield promising results, there is still room for enhancing the reliability of pseudo-labels. Hence, we develop a semi-supervised framework, namely DerProp, equipped with a novel derivative label propagation to rectify imperfect pseudo-labels. Our label propagation method imposes discrete derivative operations on pixel-wise feature vectors as additional regularization, thereby generating strictly regularized similarity metrics. Doing so effectively alleviates the ill-posed problem that identical similarities correspond to different features, through constraining the solution space. Extensive experiments are conducted to verify the rationality of our design, and demonstrate our superiority over other methods. Codes are available at https://github.com/ForawardStar/DerProp/.
中文摘要:提出的DerProp框架通过引入导数标签传播方法,对像素特征进行离散导数运算来改进伪标签质量,有效约束解空间并在半监督语义分割任务中优于现有方法。
English Summary: The proposed DerProp framework enhances semi-supervised semantic segmentation by introducing derivative label propagation to refine pseudo-labels through discrete derivative operations on pixel features, effectively constraining the solution space and outperforming existing methods.

Authors:Ziyan Liu, Junwen Li, Kaiwen Li, Tong Ruan, Chao Wang, Xinyan He, Zongyu Wang, Xuezhi Cao, Jingping Liu
Title: I2CR: Intra- and Inter-modal Collaborative Reflections for Multimodal Entity Linking
Abstract:
Multimodal entity linking plays a crucial role in a wide range of applications. Recent advances in large language model-based methods have become the dominant paradigm for this task, effectively leveraging both textual and visual modalities to enhance performance. Despite their success, these methods still face two challenges, including unnecessary incorporation of image data in certain scenarios and the reliance only on a one-time extraction of visual features, which can undermine their effectiveness and accuracy. To address these challenges, we propose a novel LLM-based framework for the multimodal entity linking task, called Intra- and Inter-modal Collaborative Reflections. This framework prioritizes leveraging text information to address the task. When text alone is insufficient to link the correct entity through intra- and inter-modality evaluations, it employs a multi-round iterative strategy that integrates key visual clues from various aspects of the image to support reasoning and enhance matching accuracy. Extensive experiments on three widely used public datasets demonstrate that our framework consistently outperforms current state-of-the-art methods in the task, achieving improvements of 3.2%, 5.1%, and 1.6%, respectively. Our code is available at https://github.com/ziyan-xiaoyu/I2CR/.
Chinese Summary: 本文提出了一种新颖的基于大语言模型的多模态实体链接框架,优先利用文本信息,在必要时通过多轮迭代策略整合关键视觉线索,在三个公开数据集上实现了最先进的性能。
English Summary: This paper introduces a novel LLM-based framework for multimodal entity linking that prioritizes text information and employs a multi-round iterative strategy to integrate key visual clues only when necessary, achieving state-of-the-art performance on three public datasets.

Authors:Xiaoliu Guan, Lielin Jiang, Hanqi Chen, Xu Zhang, Jiaxing Yan, Guanzhong Wang, Yi Liu, Zetao Zhang, Yu Wu
Title: Forecasting When to Forecast: Accelerating Diffusion Models with Confidence-Gated Taylor
Abstract:
Diffusion Transformers (DiTs) have demonstrated remarkable performance in visual generation tasks. However, their low inference speed limits their deployment in low-resource applications. Recent training-free approaches exploit the redundancy of features across timesteps by caching and reusing past representations to accelerate inference. Building on this idea, TaylorSeer instead uses cached features to predict future ones via Taylor expansion. However, its module-level prediction across all transformer blocks (e.g., attention or feedforward modules) requires storing fine-grained intermediate features, leading to notable memory and computation overhead. Moreover, it adopts a fixed caching schedule without considering the varying accuracy of predictions across timesteps, which can lead to degraded outputs when prediction fails. To address these limitations, we propose a novel approach to better leverage Taylor-based acceleration. First, we shift the Taylor prediction target from the module level to the last block level, significantly reducing the number of cached features. Furthermore, observing strong sequential dependencies among Transformer blocks, we propose to use the error between the Taylor-estimated and actual outputs of the first block as an indicator of prediction reliability. If the error is small, we trust the Taylor prediction for the last block; otherwise, we fall back to full computation, thereby enabling a dynamic caching mechanism. Empirical results show that our method achieves a better balance between speed and quality, achieving a 3.17x acceleration on FLUX, 2.36x on DiT, and 4.14x on Wan Video with negligible quality drop. The Project Page is \href{https://cg-taylor-acce.github.io/CG-Taylor/}{here.}

Authors:Yanyun Wang, Li Liu
Title: Failure Cases Are Better Learned But Boundary Says Sorry: Facilitating Smooth Perception Change for Accuracy-Robustness Trade-Off in Adversarial Training
Abstract:
Adversarial Training (AT) is one of the most effective methods to train robust Deep Neural Networks (DNNs). However, AT creates an inherent trade-off between clean accuracy and adversarial robustness, which is commonly attributed to the more complicated decision boundary caused by the insufficient learning of hard adversarial samples. In this work, we reveal a counterintuitive fact for the first time: From the perspective of perception consistency, hard adversarial samples that can still attack the robust model after AT are already learned better than those successfully defended. Thus, different from previous views, we argue that it is rather the over-sufficient learning of hard adversarial samples that degrades the decision boundary and contributes to the trade-off problem. Specifically, the excessive pursuit of perception consistency would force the model to view the perturbations as noise and ignore the information within them, which should have been utilized to induce a smoother perception transition towards the decision boundary to support its establishment to an appropriate location. In response, we define a new AT objective named Robust Perception, encouraging the model perception to change smoothly with input perturbations, based on which we propose a novel Robust Perception Adversarial Training (RPAT) method, effectively mitigating the current accuracy-robustness trade-off. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet with ResNet-18, PreActResNet-18, and WideResNet-34-10 demonstrate the effectiveness of our method beyond four common baselines and 12 state-of-the-art (SOTA) works. The code is available at https://github.com/FlaAI/RPAT.
Chinese: 本研究揭示,对抗性训练中清洁准确性与鲁棒性之间的权衡源于对困难对抗样本的过度学习,导致决策边界退化,并提出了一种鲁棒感知对抗训练方法,有效缓解了这一问题。
English: This study identifies that the trade-off between clean accuracy and adversarial robustness in Adversarial Training stems from the over-learning of hard adversarial samples, leading to a degraded decision boundary, and proposes a Robust Perception Adversarial Training method to mitigate this issue effectively.

Authors:Zeshuai Deng, Guohao Chen, Shuaicheng Niu, Hui Luo, Shuhai Zhang, Yifan Yang, Renjie Chen, Wei Luo, Mingkui Tan
Title: Test-Time Model Adaptation for Quantized Neural Networks
Abstract:
Quantizing deep models prior to deployment is a widely adopted technique to speed up inference for various real-time applications, such as autonomous driving. However, quantized models often suffer from severe performance degradation in dynamic environments with potential domain shifts and this degradation is significantly more pronounced compared with their full-precision counterparts, as shown by our theoretical and empirical illustrations. To address the domain shift problem, test-time adaptation (TTA) has emerged as an effective solution by enabling models to learn adaptively from test data. Unfortunately, existing TTA methods are often impractical for quantized models as they typically rely on gradient backpropagation--an operation that is unsupported on quantized models due to vanishing gradients, as well as memory and latency constraints. In this paper, we focus on TTA for quantized models to improve their robustness and generalization ability efficiently. We propose a continual zeroth-order adaptation (ZOA) framework that enables efficient model adaptation using only two forward passes, eliminating the computational burden of existing methods. Moreover, we propose a domain knowledge management scheme to store and reuse different domain knowledge with negligible memory consumption, reducing the interference of different domain knowledge and fostering the knowledge accumulation during long-term adaptation. Experimental results on three classical architectures, including quantized transformer-based and CNN-based models, demonstrate the superiority of our methods for quantized model adaptation. On the quantized W6A6 ViT-B model, our ZOA is able to achieve a 5.0\% improvement over the state-of-the-art FOA on ImageNet-C dataset. The source code is available at https://github.com/DengZeshuai/ZOA.
中文: 量化深度模型在动态环境中性能严重下降,而现有测试时适应方法因梯度反向传播限制难以应用,因此提出了持续零阶适应框架,仅需两次前向传播即可高效适应,并辅以可忽略内存消耗的领域知识管理方案。
English: Quantized deep models face severe performance degradation in dynamic environments, but existing test-time adaptation methods are impractical due to gradient backpropagation constraints, prompting the proposal of a continual zeroth-order adaptation framework that enables efficient adaptation with only two forward passes and a domain knowledge management scheme.

Authors:Bufano Michele, Kotter Elmar
Title: Deep classification algorithm for De-identification of DICOM medical images
Abstract:
Background : De-identification of DICOM (Digital Imaging and Communi-cations in Medicine) files is an essential component of medical image research. Personal Identifiable Information (PII) and/or Personal Health Identifying Information (PHI) need to be hidden or removed due to legal reasons. According to the Health Insurance Portability and Accountability Act (HIPAA) and privacy rules, also full-face photographic images and any compa-rable images are direct identifiers and are considered protected health information that also need to be de-identified. Objective : The study aimed to implement a method that permit to de-identify the PII and PHI information present in the header and burned on the pixel data of DICOM. Methods : To execute the de-identification, we implemented an algorithm based on the safe harbor method, defined by HIPAA. Our algorithm uses input customizable parameter to classify and then possibly de-identify individual DICOM tags. Results : The most sensible information, like names, history, personal data and institution were successfully recognized. Conclusions : We developed a python algorithm that is able to classify infor-mation present in a DICOM file. The flexibility provided by the use of customi-zable input parameters, which allow the user to customize the entire process de-pending on the case (e.g., the language), makes the entire program very promis-ing for both everyday use and research purposes. Our code is available at https://github.com/rtdicomexplorer/deep_deidentification.
中文: 本研究基于HIPAA安全港方法开发了一种Python算法,用于对DICOM文件中的个人和健康信息进行去标识化处理,其可定制参数设计使其在研究和日常应用中都具有良好的灵活性。
English: This study developed a Python algorithm based on HIPAA's safe harbor method to de-identify personal and health information in DICOM files, offering customizable parameters for flexible application in both research and daily use.

Authors:Lei Yao, Yi Wang, Yi Zhang, Moyun Liu, Lap-Pui Chau
Title: GaussianCross: Cross-modal Self-supervised 3D Representation Learning via Gaussian Splatting
Abstract:
The significance of informative and robust point representations has been widely acknowledged for 3D scene understanding. Despite existing self-supervised pre-training counterparts demonstrating promising performance, the model collapse and structural information deficiency remain prevalent due to insufficient point discrimination difficulty, yielding unreliable expressions and suboptimal performance. In this paper, we present GaussianCross, a novel cross-modal self-supervised 3D representation learning architecture integrating feed-forward 3D Gaussian Splatting (3DGS) techniques to address current challenges. GaussianCross seamlessly converts scale-inconsistent 3D point clouds into a unified cuboid-normalized Gaussian representation without missing details, enabling stable and generalizable pre-training. Subsequently, a tri-attribute adaptive distillation splatting module is incorporated to construct a 3D feature field, facilitating synergetic feature capturing of appearance, geometry, and semantic cues to maintain cross-modal consistency. To validate GaussianCross, we perform extensive evaluations on various benchmarks, including ScanNet, ScanNet200, and S3DIS. In particular, GaussianCross shows a prominent parameter and data efficiency, achieving superior performance through linear probing (<0.1% parameters) and limited data training (1% of scenes) compared to state-of-the-art methods. Furthermore, GaussianCross demonstrates strong generalization capabilities, improving the full fine-tuning accuracy by 9.3% mIoU and 6.1% AP$_{50}$ on ScanNet200 semantic and instance segmentation tasks, respectively, supporting the effectiveness of our approach. The code, weights, and visualizations are publicly available at \href{https://rayyoh.github.io/GaussianCross/}{https://rayyoh.github.io/GaussianCross/}.
Chinese: GaussianCross提出了一种融合3D高斯泼溅的跨模态自监督三维表征学习架构,通过构建三属性自适应蒸馏模块解决模型坍塌与结构信息缺失问题,在ScanNet等基准测试中以极低参数量(<0.1%)和少量数据(1%场景)实现最优性能。
English: GaussianCross introduces a cross-modal self-supervised 3D representation learning architecture that integrates 3D Gaussian Splatting to overcome model collapse and structural deficiencies, achieving superior efficiency and performance on benchmarks like ScanNet through minimal parameter usage and limited data training.

Authors:Tom Fischer, Xiaojie Zhang, Eddy Ilg
Title: Unified Category-Level Object Detection and Pose Estimation from RGB Images using 3D Prototypes
Abstract:
Recognizing objects in images is a fundamental problem in computer vision. Although detecting objects in 2D images is common, many applications require determining their pose in 3D space. Traditional category-level methods rely on RGB-D inputs, which may not always be available, or employ two-stage approaches that use separate models and representations for detection and pose estimation. For the first time, we introduce a unified model that integrates detection and pose estimation into a single framework for RGB images by leveraging neural mesh models with learned features and multi-model RANSAC. Our approach achieves state-of-the-art results for RGB category-level pose estimation on REAL275, improving on the current state-of-the-art by 22.9% averaged across all scale-agnostic metrics. Finally, we demonstrate that our unified method exhibits greater robustness compared to single-stage baselines. Our code and models are available at https://github.com/Fischer-Tom/unified-detection-and-pose-estimation.
Chinese: 本文提出了一种统一模型,将物体检测与三维姿态估计整合到单一框架中,仅需RGB图像输入,在REAL275数据集上实现了22.9%的性能提升,并展现出更强的鲁棒性。
English: This paper introduces a unified model that combines object detection and 3D pose estimation into a single framework for RGB images, achieving state-of-the-art results with a 22.9% improvement on the REAL275 dataset and demonstrating enhanced robustness.

Authors:Daniel Lengerer, Mathias Pechinger, Klaus Bogenberger, Carsten Markgraf
Title: AID4AD: Aerial Image Data for Automated Driving Perception
Abstract:
This work investigates the integration of spatially aligned aerial imagery into perception tasks for automated vehicles (AVs). As a central contribution, we present AID4AD, a publicly available dataset that augments the nuScenes dataset with high-resolution aerial imagery precisely aligned to its local coordinate system. The alignment is performed using SLAM-based point cloud maps provided by nuScenes, establishing a direct link between aerial data and nuScenes local coordinate system. To ensure spatial fidelity, we propose an alignment workflow that corrects for localization and projection distortions. A manual quality control process further refines the dataset by identifying a set of high-quality alignments, which we publish as ground truth to support future research on automated registration. We demonstrate the practical value of AID4AD in two representative tasks: in online map construction, aerial imagery serves as a complementary input that improves the mapping process; in motion prediction, it functions as a structured environmental representation that replaces high-definition maps. Experiments show that aerial imagery leads to a 15-23% improvement in map construction accuracy and a 2% gain in trajectory prediction performance. These results highlight the potential of aerial imagery as a scalable and adaptable source of environmental context in automated vehicle systems, particularly in scenarios where high-definition maps are unavailable, outdated, or costly to maintain. AID4AD, along with evaluation code and pretrained models, is publicly released to foster further research in this direction: https://github.com/DriverlessMobility/AID4AD.
中文: 本研究提出了AID4AD数据集,通过将高分辨率航拍图像与nuScenes数据集精确对齐,显著提升了自动驾驶车辆的地图构建精度15-23%和轨迹预测性能2%,为环境感知提供了可扩展的解决方案。
English: This research introduces AID4AD, a dataset that enhances the nuScenes dataset with aligned aerial imagery, demonstrating its utility in improving automated vehicle tasks such as map construction by 15-23% and motion prediction by 2%.

Authors:Kuo Wang, Quanlong Zheng, Junlin Xie, Yanhao Zhang, Jinguo Luo, Haonan Lu, Liang Lin, Fan Zhou, Guanbin Li
Title: Free-MoRef: Instantly Multiplexing Context Perception Capabilities of Video-MLLMs within Single Inference
Abstract:
Video Multimodal Large Language Models~(Video-MLLM) have achieved remarkable advancements in video understanding tasks. However, constrained by the context length limitation in the underlying LLMs, existing Video-MLLMs typically exhibit suboptimal performance on long video scenarios. To understand extended input frames, common solutions span token compression and streaming inference techniques, which sacrifice feature granularity or inference efficiency. Differently, to efficiently achieve comprehensive understanding of longer frame inputs, we draw ideas from MoE and propose a training-free approach \textbf{Free-MoRef}, which instantly multiplexes the context perception capabilities of Video-MLLMs within one inference pass. Specifically, Free-MoRef reconstructs the vision tokens into several short sequences as multi-references. Subsequently, we introduce MoRef-attention, which gathers clues from the multi-reference chunks in parallel to summarize unified query activations. After the shadow layers in LLMs, a reference fusion step is derived to compose a final mixed reasoning sequence with key tokens from parallel chunks, which compensates the cross-reference vision interactions that are neglected in MoRef-attention. By splitting and fusing the long vision token sequences, Free-MoRef achieves improved performance under much lower computing costs in reasoning multiplexed context length, demonstrating strong efficiency and effectiveness. Experiments on VideoMME, MLVU, LongVideoBench show that Free-MoRef achieves full perception of 2$\times$ to 8$\times$ longer input frames without compression on a single A100 GPU while keeping instant responses, thereby bringing significant performance gains, even surpassing dedicatedly trained long-video-MLLMs. Codes are available at https://github.com/wkfdb/Free-MoRef
中文摘要:Free-MoRef提出一种无需训练的方法,通过将视觉令牌重构为多参考序列并融合并行线索,有效扩展视频多模态大语言模型对长视频的上下文感知能力,以更低计算成本实现更优性能。
English Summary: Free-MoRef introduces a training-free method that efficiently extends Video-MLLMs' context perception for long videos by reconstructing vision tokens into multi-reference sequences and fusing parallel clues, achieving superior performance with lower computational costs.

Authors:Yihang Huang, Yuanfei Huang, Junhui Lin, Hua Huang
Title: DeflareMamba: Hierarchical Vision Mamba for Contextually Consistent Lens Flare Removal
Abstract:
Lens flare removal remains an information confusion challenge in the underlying image background and the optical flares, due to the complex optical interactions between light sources and camera lens. While recent solutions have shown promise in decoupling the flare corruption from image, they often fail to maintain contextual consistency, leading to incomplete and inconsistent flare removal. To eliminate this limitation, we propose DeflareMamba, which leverages the efficient sequence modeling capabilities of state space models while maintains the ability to capture local-global dependencies. Particularly, we design a hierarchical framework that establishes long-range pixel correlations through varied stride sampling patterns, and utilize local-enhanced state space models that simultaneously preserves local details. To the best of our knowledge, this is the first work that introduces state space models to the flare removal task. Extensive experiments demonstrate that our method effectively removes various types of flare artifacts, including scattering and reflective flares, while maintaining the natural appearance of non-flare regions. Further downstream applications demonstrate the capacity of our method to improve visual object recognition and cross-modal semantic understanding. Code is available at https://github.com/BNU-ERC-ITEA/DeflareMamba.
中文摘要:DeflareMamba通过采用状态空间模型的分层框架,在有效消除镜头光晕伪影的同时保持图像上下文与细节,显著提升了视觉质量与下游任务性能。
English Summary: DeflareMamba introduces a novel hierarchical framework using state space models to effectively remove lens flare artifacts while preserving image context and details, improving both visual quality and downstream tasks.

Authors:Yuanfei Huang, Hua Huang
Title: Tackling Ill-posedness of Reversible Image Conversion with Well-posed Invertible Network
Abstract:
Reversible image conversion (RIC) suffers from ill-posedness issues due to its forward conversion process being considered an underdetermined system. Despite employing invertible neural networks (INN), existing RIC methods intrinsically remain ill-posed as inevitably introducing uncertainty by incorporating randomly sampled variables. To tackle the ill-posedness dilemma, we focus on developing a reliable approximate left inverse for the underdetermined system by constructing an overdetermined system with a non-zero Gram determinant, thus ensuring a well-posed solution. Based on this principle, we propose a well-posed invertible $1\times1$ convolution (WIC), which eliminates the reliance on random variable sampling and enables the development of well-posed invertible networks. Furthermore, we design two innovative networks, WIN-Naïve and WIN, with the latter incorporating advanced skip-connections to enhance long-term memory. Our methods are evaluated across diverse RIC tasks, including reversible image hiding, image rescaling, and image decolorization, consistently achieving state-of-the-art performance. Extensive experiments validate the effectiveness of our approach, demonstrating its ability to overcome the bottlenecks of existing RIC solutions and setting a new benchmark in the field. Codes are available in https://github.com/BNU-ERC-ITEA/WIN.
Chinese: 本研究通过提出一种不依赖随机变量的适定可逆卷积方法,解决了可逆图像转换中的不适定问题,在多项任务中无需随机采样即实现了最先进的性能。
English: The study addresses the ill-posedness in reversible image conversion by introducing a well-posed invertible convolution method that eliminates reliance on random variables, leading to state-of-the-art performance in various tasks without using random sampling.

Authors:Hongzhao Chen, Hexiao Ding, Yufeng Jiang, Jing Lan, Ka Chun Li, Gerald W. Y. Cheng, Sam Ng, Chi Lai Ho, Jing Cai, Liang-ting Lin, Jung Sun Yoo
Title: REACT-KD: Region-Aware Cross-modal Topological Knowledge Distillation for Interpretable Medical Image Classification
Abstract:
Reliable and interpretable tumor classification from clinical imaging remains a core challenge due to heterogeneous modality quality, limited annotations, and the lack of structured anatomical guidance. We introduce REACT-KD, a Region-Aware Cross-modal Topological Knowledge Distillation framework that transfers rich supervision from high-fidelity multi-modal sources into a lightweight CT-based student model. The framework uses a dual teacher design: one branch captures structure-function relationships using dual-tracer PET/CT, and the other models dose-aware features through synthetically degraded low-dose CT data. These branches jointly guide the student model through two complementary objectives. The first focuses on semantic alignment via logits distillation, while the second models anatomical topology using region graph distillation. A shared CBAM-3D module is employed to maintain consistent attention across modalities. To improve reliability for deployment, REACT-KD introduces modality dropout during training, allowing inference under partial or noisy inputs. The staging task for hepatocellular carcinoma (HCC) is conducted as a case study. REACT-KD achieves an average AUC of 93.4% on an internal PET/CT cohort and maintains 76.6% to 81.5% AUC across varying dose levels in external CT testing. Decision curve analysis shows that REACT-KD consistently provides the highest clinical benefit across decision thresholds, supporting its potential in real-world diagnostics. Code is available at https://github.com/Kinetics-JOJO/REACT-KD.
中文摘要:REACT-KD框架通过双教师区域感知知识蒸馏,将多模态影像监督知识迁移至轻量CT模型,在肝癌分期任务中实现了优异的诊断性能和临床实用性。
English Summary: REACT-KD is a novel framework that enhances CT-based tumor classification by transferring knowledge from multi-modal imaging through dual teacher guidance and region-aware distillation, achieving high diagnostic accuracy and clinical reliability.

Authors:Haoxin Yang, Weihong Chen, Xuemiao Xu, Cheng Xu, Peng Xiao, Cuifeng Sun, Shaoyu Huang, Shengfeng He
Title: StarPose: 3D Human Pose Estimation via Spatial-Temporal Autoregressive Diffusion
Abstract:
Monocular 3D human pose estimation remains a challenging task due to inherent depth ambiguities and occlusions. Compared to traditional methods based on Transformers or Convolutional Neural Networks (CNNs), recent diffusion-based approaches have shown superior performance, leveraging their probabilistic nature and high-fidelity generation capabilities. However, these methods often fail to account for the spatial and temporal correlations across predicted frames, resulting in limited temporal consistency and inferior accuracy in predicted 3D pose sequences. To address these shortcomings, this paper proposes StarPose, an autoregressive diffusion framework that effectively incorporates historical 3D pose predictions and spatial-temporal physical guidance to significantly enhance both the accuracy and temporal coherence of pose predictions. Unlike existing approaches, StarPose models the 2D-to-3D pose mapping as an autoregressive diffusion process. By synergically integrating previously predicted 3D poses with 2D pose inputs via a Historical Pose Integration Module (HPIM), the framework generates rich and informative historical pose embeddings that guide subsequent denoising steps, ensuring temporally consistent predictions. In addition, a fully plug-and-play Spatial-Temporal Physical Guidance (STPG) mechanism is tailored to refine the denoising process in an iterative manner, which further enforces spatial anatomical plausibility and temporal motion dynamics, rendering robust and realistic pose estimates. Extensive experiments on benchmark datasets demonstrate that StarPose outperforms state-of-the-art methods, achieving superior accuracy and temporal consistency in 3D human pose estimation. Code is available at https://github.com/wileychan/StarPose.
中文: StarPose提出了一种自回归扩散框架,通过整合历史姿态数据和时空物理引导,显著提升了三维人体姿态估计的准确性和时间一致性,优于现有方法。
English: StarPose introduces an autoregressive diffusion framework that integrates historical pose data and spatial-temporal physical guidance to significantly improve accuracy and temporal consistency in 3D human pose estimation, outperforming existing methods.

Authors:Sparsh Garg, Abhishek Aich
Title: Mapillary Vistas Validation for Fine-Grained Traffic Signs: A Benchmark Revealing Vision-Language Model Limitations
Abstract:
Obtaining high-quality fine-grained annotations for traffic signs is critical for accurate and safe decision-making in autonomous driving. Widely used datasets, such as Mapillary, often provide only coarse-grained labels - without distinguishing semantically important types such as stop signs or speed limit signs. To this end, we present a new validation set for traffic signs derived from the Mapillary dataset called Mapillary Vistas Validation for Traffic Signs (MVV), where we decompose composite traffic signs into granular, semantically meaningful categories. The dataset includes pixel-level instance masks and has been manually annotated by expert annotators to ensure label fidelity. Further, we benchmark several state-of-the-art VLMs against the self-supervised DINOv2 model on this dataset and show that DINOv2 consistently outperforms all VLM baselines-not only on traffic sign recognition, but also on heavily represented categories like vehicles and humans. Our analysis reveals significant limitations in current vision-language models for fine-grained visual understanding and establishes DINOv2 as a strong baseline for dense semantic matching in autonomous driving scenarios. This dataset and evaluation framework pave the way for more reliable, interpretable, and scalable perception systems. Code and data are available at: https://github.com/nec-labs-ma/relabeling
Chinese: 本研究推出了Mapillary交通标志验证数据集(MVV),通过提供细粒度标注解决了自动驾驶数据集中标签粗糙的问题,并验证了自监督DINOv2模型在精细识别任务中优于视觉语言模型的表现。
English: This study introduces the Mapillary Vistas Validation for Traffic Signs (MVV) dataset, which provides fine-grained annotations for traffic signs to address the limitations of coarse labels in autonomous driving datasets, and demonstrates that the self-supervised DINOv2 model outperforms vision-language models in fine-grained recognition tasks.

Authors:Chen Li, Chinthani Sugandhika, Yeo Keat Ee, Eric Peh, Hao Zhang, Hong Yang, Deepu Rajan, Basura Fernando
Title: IMoRe: Implicit Program-Guided Reasoning for Human Motion Q&A
Abstract:
Existing human motion Q\&A methods rely on explicit program execution, where the requirement for manually defined functional modules may limit the scalability and adaptability. To overcome this, we propose an implicit program-guided motion reasoning (IMoRe) framework that unifies reasoning across multiple query types without manually designed modules. Unlike existing implicit reasoning approaches that infer reasoning operations from question words, our model directly conditions on structured program functions, ensuring a more precise execution of reasoning steps. Additionally, we introduce a program-guided reading mechanism, which dynamically selects multi-level motion representations from a pretrained motion Vision Transformer (ViT), capturing both high-level semantics and fine-grained motion cues. The reasoning module iteratively refines memory representations, leveraging structured program functions to extract relevant information for different query types. Our model achieves state-of-the-art performance on Babel-QA and generalizes to a newly constructed motion Q\&A dataset based on HuMMan, demonstrating its adaptability across different motion reasoning datasets. Code and dataset are available at: https://github.com/LUNAProject22/IMoRe.
Chinese: 提出的IMoRe框架通过隐式程序引导推理和动态阅读机制,无需手动设计模块,在多个运动问答数据集上实现了最先进的性能。
English: The proposed IMoRe framework eliminates the need for manual module design by using implicit program-guided reasoning and a dynamic reading mechanism to achieve state-of-the-art performance across multiple motion Q&A datasets.

Authors:Yaroslav Prytula, Illia Tsiporenko, Ali Zeynalli, Dmytro Fishman
Title: IAUNet: Instance-Aware U-Net
Abstract:
Instance segmentation is critical in biomedical imaging to accurately distinguish individual objects like cells, which often overlap and vary in size. Recent query-based methods, where object queries guide segmentation, have shown strong performance. While U-Net has been a go-to architecture in medical image segmentation, its potential in query-based approaches remains largely unexplored. In this work, we present IAUNet, a novel query-based U-Net architecture. The core design features a full U-Net architecture, enhanced by a novel lightweight convolutional Pixel decoder, making the model more efficient and reducing the number of parameters. Additionally, we propose a Transformer decoder that refines object-specific features across multiple scales. Finally, we introduce the 2025 Revvity Full Cell Segmentation Dataset, a unique resource with detailed annotations of overlapping cell cytoplasm in brightfield images, setting a new benchmark for biomedical instance segmentation. Experiments on multiple public datasets and our own show that IAUNet outperforms most state-of-the-art fully convolutional, transformer-based, and query-based models and cell segmentation-specific models, setting a strong baseline for cell instance segmentation tasks. Code is available at https://github.com/SlavkoPrytula/IAUNet
中文摘要:IAUNet是一种新颖的基于查询的U-Net架构,采用轻量级卷积像素解码器和Transformer解码器,在包括新发布的2025 Revvity全细胞分割数据集在内的多个数据集上实现了生物医学实例分割的最先进性能。
English Summary: IAUNet is a novel query-based U-Net architecture featuring a lightweight convolutional Pixel decoder and a Transformer decoder that achieves state-of-the-art performance in biomedical instance segmentation, as demonstrated on multiple datasets including the newly introduced 2025 Revvity Full Cell Segmentation Dataset.

Authors:Weiqi Yan, Chenlu Lin, Youbiao Wang, Zhipeng Cai, Xiuhong Lin, Yangyang Shi, Weiquan Liu, Yu Zang
Title: OmniEvent: Unified Event Representation Learning
Abstract:
Event cameras have gained increasing popularity in computer vision due to their ultra-high dynamic range and temporal resolution. However, event networks heavily rely on task-specific designs due to the unstructured data distribution and spatial-temporal (S-T) inhomogeneity, making it hard to reuse existing architectures for new tasks. We propose OmniEvent, the first unified event representation learning framework that achieves SOTA performance across diverse tasks, fully removing the need of task-specific designs. Unlike previous methods that treat event data as 3D point clouds with manually tuned S-T scaling weights, OmniEvent proposes a decouple-enhance-fuse paradigm, where the local feature aggregation and enhancement is done independently on the spatial and temporal domains to avoid inhomogeneity issues. Space-filling curves are applied to enable large receptive fields while improving memory and compute efficiency. The features from individual domains are then fused by attention to learn S-T interactions. The output of OmniEvent is a grid-shaped tensor, which enables standard vision models to process event data without architecture change. With a unified framework and similar hyper-parameters, OmniEvent out-performs (tasks-specific) SOTA by up to 68.2% across 3 representative tasks and 10 datasets (Fig.1). Code will be ready in https://github.com/Wickyan/OmniEvent .
中文:OmniEvent 提出了一种统一的事件表征学习框架,通过独立处理时空特征再进行融合,无需针对不同任务进行专门设计,并在多项任务和数据集上取得了最先进的性能表现。
English: OmniEvent introduces a unified event representation learning framework that eliminates the need for task-specific designs by independently processing spatial and temporal features before fusing them, achieving state-of-the-art performance across multiple tasks and datasets.

Authors:Toufiq Musah
Title: Large Kernel MedNeXt for Breast Tumor Segmentation and Self-Normalizing Network for pCR Classification in Magnetic Resonance Images
Abstract:
Accurate breast tumor segmentation in dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) is important for downstream tasks such as pathological complete response (pCR) assessment. In this work, we address both segmentation and pCR classification using the large-scale MAMA-MIA DCE-MRI dataset. We employ a large-kernel MedNeXt architecture with a two-stage training strategy that expands the receptive field from 3x3x3 to 5x5x5 kernels using the UpKern algorithm. This approach allows stable transfer of learned features to larger kernels, improving segmentation performance on the unseen validation set. An ensemble of large-kernel models achieved a Dice score of 0.67 and a normalized Hausdorff Distance (NormHD) of 0.24. For pCR classification, we trained a self-normalizing network (SNN) on radiomic features extracted from the predicted segmentations and first post-contrast DCE-MRI, reaching an average balanced accuracy of 57\%, and up to 75\% in some subgroups. Our findings highlight the benefits of combining larger receptive fields and radiomics-driven classification while motivating future work on advanced ensembling and the integration of clinical variables to further improve performance and generalization. Code: https://github.com/toufiqmusah/caladan-mama-mia.git
中文: 本研究采用两阶段训练策略,通过逐步扩大卷积核的MedNeXt模型提升动态增强磁共振成像中乳腺肿瘤分割精度,在提高分割指标的同时实现了基于影像组学的病理反应分类,其准确度表现中等。
English: This study introduces a two-stage training approach using MedNeXt with progressively enlarged kernels to enhance breast tumor segmentation in DCE-MRI, achieving improved Dice scores and enabling radiomics-based pathological response classification with moderate accuracy.

Authors:Atom Scott, Ikuma Uchida, Kento Kuroda, Yufi Kim, Keisuke Fujii
Title: SoccerTrack v2: A Full-Pitch Multi-View Soccer Dataset for Game State Reconstruction
Abstract:
SoccerTrack v2 is a new public dataset for advancing multi-object tracking (MOT), game state reconstruction (GSR), and ball action spotting (BAS) in soccer analytics. Unlike prior datasets that use broadcast views or limited scenarios, SoccerTrack v2 provides 10 full-length, panoramic 4K recordings of university-level matches, captured with BePro cameras for complete player visibility. Each video is annotated with GSR labels (2D pitch coordinates, jersey-based player IDs, roles, teams) and BAS labels for 12 action classes (e.g., Pass, Drive, Shot). This technical report outlines the datasets structure, collection pipeline, and annotation process. SoccerTrack v2 is designed to advance research in computer vision and soccer analytics, enabling new benchmarks and practical applications in tactical analysis and automated tools.
中文:SoccerTrack v2是一个包含10场完整4K全景足球比赛视频的公共数据集,提供详细的比赛状态重建和球动作标注,旨在推动计算机视觉与足球分析的研究发展。
English: SoccerTrack v2 is a comprehensive public dataset featuring 10 full-length, 4K panoramic soccer match videos with detailed game state reconstruction and ball action annotations, designed to advance computer vision and soccer analytics research.

Authors:Xiaotong Zhang, Alexander Broersen, Gonnie CM van Erp, Silvia L. Pintea, Jouke Dijkstra
Title: Skip priors and add graph-based anatomical information, for point-based Couinaud segmentation
Abstract:
The preoperative planning of liver surgery relies on Couinaud segmentation from computed tomography (CT) images, to reduce the risk of bleeding and guide the resection procedure. Using 3D point-based representations, rather than voxelizing the CT volume, has the benefit of preserving the physical resolution of the CT. However, point-based representations need prior knowledge of the liver vessel structure, which is time consuming to acquire. Here, we propose a point-based method for Couinaud segmentation, without explicitly providing the prior liver vessel structure. To allow the model to learn this anatomical liver vessel structure, we add a graph reasoning module on top of the point features. This adds implicit anatomical information to the model, by learning affinities across point neighborhoods. Our method is competitive on the MSD and LiTS public datasets in Dice coefficient and average surface distance scores compared to four pioneering point-based methods. Our code is available at https://github.com/ZhangXiaotong015/GrPn.
Chinese Summary: 本研究提出了一种基于点表示的肝脏Couinaud分割方法,通过图推理模块从点特征中学习解剖结构,无需显式提供血管先验知识,在公开数据集上取得了与先进方法相当的成果。
English Summary: This study introduces a point-based method for Couinaud segmentation of liver CT images that eliminates the need for explicit prior vessel structure by incorporating a graph reasoning module to learn anatomical relationships from point features, achieving competitive performance on public datasets.

Authors:Zhigang Sun, Yiru Wang, Anqing Jiang, Shuo Wang, Yu Gao, Yuwen Heng, Shouyi Zhang, An He, Hao Jiang, Jinhao Chai, Zichong Gu, Wang Jijun, Shichen Tang, Lavdim Halilaj, Juergen Luettin, Hao Sun
Title: DiffSemanticFusion: Semantic Raster BEV Fusion for Autonomous Driving via Online HD Map Diffusion
Abstract:
Autonomous driving requires accurate scene understanding, including road geometry, traffic agents, and their semantic relationships. In online HD map generation scenarios, raster-based representations are well-suited to vision models but lack geometric precision, while graph-based representations retain structural detail but become unstable without precise maps. To harness the complementary strengths of both, we propose DiffSemanticFusion -- a fusion framework for multimodal trajectory prediction and planning. Our approach reasons over a semantic raster-fused BEV space, enhanced by a map diffusion module that improves both the stability and expressiveness of online HD map representations. We validate our framework on two downstream tasks: trajectory prediction and planning-oriented end-to-end autonomous driving. Experiments on real-world autonomous driving benchmarks, nuScenes and NAVSIM, demonstrate improved performance over several state-of-the-art methods. For the prediction task on nuScenes, we integrate DiffSemanticFusion with the online HD map informed QCNet, achieving a 5.1\% performance improvement. For end-to-end autonomous driving in NAVSIM, DiffSemanticFusion achieves state-of-the-art results, with a 15\% performance gain in NavHard scenarios. In addition, extensive ablation and sensitivity studies show that our map diffusion module can be seamlessly integrated into other vector-based approaches to enhance performance. All artifacts are available at https://github.com/SunZhigang7/DiffSemanticFusion.
中文摘要:提出的DiffSemanticFusion框架融合了栅格与图表示方法,通过地图扩散模块提升高精地图的稳定性和表现力,在轨迹预测和端到端自动驾驶任务中实现了显著的性能提升。
English Summary: The proposed DiffSemanticFusion framework combines raster and graph representations to enhance autonomous driving tasks by improving HD map stability and expressiveness through a map diffusion module, achieving significant performance gains in trajectory prediction and end-to-end driving on benchmark datasets.

Authors:Shuo Feng, Zihan Wang, Yuchen Li, Rui Kong, Hengyi Cai, Shuaiqiang Wang, Gim Hee Lee, Piji Li, Shuqiang Jiang
Title: VPN: Visual Prompt Navigation
Abstract:
While natural language is commonly used to guide embodied agents, the inherent ambiguity and verbosity of language often hinder the effectiveness of language-guided navigation in complex environments. To this end, we propose Visual Prompt Navigation (VPN), a novel paradigm that guides agents to navigate using only user-provided visual prompts within 2D top-view maps. This visual prompt primarily focuses on marking the visual navigation trajectory on a top-down view of a scene, offering intuitive and spatially grounded guidance without relying on language instructions. It is more friendly for non-expert users and reduces interpretive ambiguity. We build VPN tasks in both discrete and continuous navigation settings, constructing two new datasets, R2R-VP and R2R-CE-VP, by extending existing R2R and R2R-CE episodes with corresponding visual prompts. Furthermore, we introduce VPNet, a dedicated baseline network to handle the VPN tasks, with two data augmentation strategies: view-level augmentation (altering initial headings and prompt orientations) and trajectory-level augmentation (incorporating diverse trajectories from large-scale 3D scenes), to enhance navigation performance. Extensive experiments evaluate how visual prompt forms, top-view map formats, and data augmentation strategies affect the performance of visual prompt navigation. The code is available at https://github.com/farlit/VPN.
中文摘要:本文提出视觉提示导航(VPN)新范式,通过用户提供的二维俯视图视觉提示替代语言指令来引导具身智能体导航,有效降低解释歧义并提升非专业用户的使用友好度。
English Summary: The paper introduces Visual Prompt Navigation (VPN), a novel paradigm that uses visual prompts on 2D top-view maps instead of language instructions to guide embodied agents, reducing ambiguity and improving accessibility for non-expert users.

Authors:Yuxiang Zhang, Wei Li, Mengmeng Zhang, Jiawei Han, Ran Tao, Shunlin Liang
Title: SpectralX: Parameter-efficient Domain Generalization for Spectral Remote Sensing Foundation Models
Abstract:
Recent advances in Remote Sensing Foundation Models (RSFMs) have led to significant breakthroughs in the field. While many RSFMs have been pretrained with massive optical imagery, more multispectral/hyperspectral data remain lack of the corresponding foundation models. To leverage the advantages of spectral imagery in earth observation, we explore whether existing RSFMs can be effectively adapted to process diverse spectral modalities without requiring extensive spectral pretraining. In response to this challenge, we proposed SpectralX, an innovative parameter-efficient fine-tuning framework that adapt existing RSFMs as backbone while introducing a two-stage training approach to handle various spectral inputs, thereby significantly improving domain generalization performance. In the first stage, we employ a masked-reconstruction task and design a specialized Hyper Tokenizer (HyperT) to extract attribute tokens from both spatial and spectral dimensions. Simultaneously, we develop an Attribute-oriented Mixture of Adapter (AoMoA) that dynamically aggregates multi-attribute expert knowledge while performing layer-wise fine-tuning. With semantic segmentation as downstream task in the second stage, we insert an Attribute-refined Adapter (Are-adapter) into the first stage framework. By iteratively querying low-level semantic features with high-level representations, the model learns to focus on task-beneficial attributes, enabling customized adjustment of RSFMs. Following this two-phase adaptation process, SpectralX is capable of interpreting spectral imagery from new regions or seasons. The codes will be available from the website: https://github.com/YuxiangZhang-BIT.
中文摘要:本文提出SpectralX框架,通过两阶段训练方法(掩码重建与语义分割)动态调整遥感基础模型,使其无需大量光谱预训练即可处理多光谱数据,显著提升跨域泛化能力。
English Summary: The paper introduces SpectralX, a parameter-efficient fine-tuning framework that adapts existing remote sensing foundation models to process diverse spectral imagery through a two-stage training approach, enhancing domain generalization without extensive pretraining.

Authors:Han Wang, Zhuoran Wang, Roy Ka-Wei Lee
Title: HateClipSeg: A Segment-Level Annotated Dataset for Fine-Grained Hate Video Detection
Abstract:
Detecting hate speech in videos remains challenging due to the complexity of multimodal content and the lack of fine-grained annotations in existing datasets. We present HateClipSeg, a large-scale multimodal dataset with both video-level and segment-level annotations, comprising over 11,714 segments labeled as Normal or across five Offensive categories: Hateful, Insulting, Sexual, Violence, Self-Harm, along with explicit target victim labels. Our three-stage annotation process yields high inter-annotator agreement (Krippendorff's alpha = 0.817). We propose three tasks to benchmark performance: (1) Trimmed Hateful Video Classification, (2) Temporal Hateful Video Localization, and (3) Online Hateful Video Classification. Results highlight substantial gaps in current models, emphasizing the need for more sophisticated multimodal and temporally aware approaches. The HateClipSeg dataset are publicly available at https://github.com/Social-AI-Studio/HateClipSeg.git.
中文摘要:HateClipSeg数据集通过提供包含11,714个片段的细粒度多模态标注,解决了视频仇恨言论检测中的挑战,其基准测试显示现有模型存在显著性能差距,强调了开发先进多模态方法的必要性。
English Summary: The HateClipSeg dataset addresses challenges in video hate speech detection by providing fine-grained multimodal annotations across 11,714 segments, with benchmark tasks revealing significant performance gaps in current models and underscoring the need for advanced multimodal approaches.

Authors:Bowen Yang, Yun Cao, Chen He, Xiaosu Su
Title: GAID: Frame-Level Gated Audio-Visual Integration with Directional Perturbation for Text-Video Retrieval
Abstract:
Text-to-video retrieval requires precise alignment between language and temporally rich video signals. Existing methods predominantly exploit visual cues and often overlook complementary audio semantics or adopt coarse fusion strategies, leading to suboptimal multimodal representations. We present GAID, a framework that jointly address this gap via two key components: (i) a Frame-level Gated Fusion (FGF) that adaptively integrates audio and visual features under textual guidance, enabling fine-grained temporal alignment; and (ii) a Directional Adaptive Semantic Perturbation (DASP) that injects structure-aware perturbations into text embeddings, enhancing robustness and discrimination without incurring multi-pass inference. These modules complement each other -- fusion reduces modality gaps while perturbation regularizes cross-modal matching -- yielding more stable and expressive representations. Extensive experiments on MSR-VTT, DiDeMo, LSMDC, and VATEX show consistent state-of-the-art results across all retrieval metrics with notable efficiency gains. Our code is available at https://github.com/YangBowenn/GAID.
中文摘要:GAID框架通过文本引导的自适应音视频特征融合和结构化文本嵌入扰动,有效提升了文本-视频检索的精度与鲁棒性,在多个基准测试中均取得最优性能且计算效率显著提高。
English Summary: The GAID framework enhances text-to-video retrieval by adaptively fusing audio-visual features with textual guidance and injecting structured perturbations into text embeddings, achieving state-of-the-art results across multiple benchmarks with improved efficiency.

Authors:Luqi Cheng, Zhangshuo Qi, Zijie Zhou, Chao Lu, Guangming Xiong
Title: LT-Gaussian: Long-Term Map Update Using 3D Gaussian Splatting for Autonomous Driving
Abstract:
Maps play an important role in autonomous driving systems. The recently proposed 3D Gaussian Splatting (3D-GS) produces rendering-quality explicit scene reconstruction results, demonstrating the potential for map construction in autonomous driving scenarios. However, because of the time and computational costs involved in generating Gaussian scenes, how to update the map becomes a significant challenge. In this paper, we propose LT-Gaussian, a map update method for 3D-GS-based maps. LT-Gaussian consists of three main components: Multimodal Gaussian Splatting, Structural Change Detection Module, and Gaussian-Map Update Module. Firstly, the Gaussian map of the old scene is generated using our proposed Multimodal Gaussian Splatting. Subsequently, during the map update process, we compare the outdated Gaussian map with the current LiDAR data stream to identify structural changes. Finally, we perform targeted updates to the Gaussian-map to generate an up-to-date map. We establish a benchmark for map updating on the nuScenes dataset to quantitatively evaluate our method. The experimental results show that LT-Gaussian can effectively and efficiently update the Gaussian-map, handling common environmental changes in autonomous driving scenarios. Furthermore, by taking full advantage of information from both new and old scenes, LT-Gaussian is able to produce higher quality reconstruction results compared to map update strategies that reconstruct maps from scratch. Our open-source code is available at https://github.com/ChengLuqi/LT-gaussian.
中文: 提出的LT-Gaussian方法通过结构变化检测和选择性更新组件,有效实现了自动驾驶场景中3D高斯溅射地图的高效更新,在重建质量和计算效率上均优于完全重建策略。
English: The proposed LT-Gaussian method efficiently updates 3D Gaussian Splatting-based maps for autonomous driving by detecting structural changes and selectively refreshing components, outperforming full reconstruction approaches in both quality and computational efficiency.

Authors:Zhixiang Wei, Xiaoxiao Ma, Ruishen Yan, Tao Tu, Huaian Chen, Jinjin Zheng, Yi Jin, Enhong Chen
Title: Rein++: Efficient Generalization and Adaptation for Semantic Segmentation with Vision Foundation Models
Abstract:
Vision Foundation Models(VFMs) have achieved remarkable success in various computer vision tasks. However, their application to semantic segmentation is hindered by two significant challenges: (1) the disparity in data scale, as segmentation datasets are typically much smaller than those used for VFM pre-training, and (2) domain distribution shifts, where real-world segmentation scenarios are diverse and often underrepresented during pre-training. To overcome these limitations, we present Rein++, an efficient VFM-based segmentation framework that demonstrates superior generalization from limited data and enables effective adaptation to diverse unlabeled scenarios. Specifically, Rein++ comprises a domain generalization solution Rein-G and a domain adaptation solution Rein-A. Rein-G introduces a set of trainable, instance-aware tokens that effectively refine the VFM's features for the segmentation task. This parameter-efficient approach fine-tunes less than 1% of the backbone's parameters, enabling robust generalization. Building on the Rein-G, Rein-A performs unsupervised domain adaptation at both the instance and logit levels to mitigate domain shifts. In addition, it incorporates a semantic transfer module that leverages the class-agnostic capabilities of the segment anything model to enhance boundary details in the target domain. The integrated Rein++ pipeline first learns a generalizable model on a source domain (e.g., daytime scenes) and subsequently adapts it to diverse target domains (e.g., nighttime scenes) without any target labels. Comprehensive experiments demonstrate that Rein++ significantly outperforms state-of-the-art methods with efficient training, underscoring its roles an efficient, generalizable, and adaptive segmentation solution for VFMs, even for large models with billions of parameters. The code is available at https://github.com/wloves/Rein.
Chinese: Rein++框架通过参数高效的领域泛化和无监督自适应技术,解决了视觉基础模型在语义分割中面临的数据规模差异和领域分布偏移问题,以极少的训练量实现了最先进的性能。
English: The Rein++ framework overcomes data scale and domain shift challenges in vision foundation models for semantic segmentation by introducing parameter-efficient domain generalization and unsupervised adaptation techniques, achieving state-of-the-art performance with minimal training.

Authors:Na Zhang, Moran Li, Chengming Xu, Han Feng, Xiaobin Hu, Jiangning Zhang, Weijian Cao, Chengjie Wang, Yanwei Fu
Title: StrandDesigner: Towards Practical Strand Generation with Sketch Guidance
Abstract:
Realistic hair strand generation is crucial for applications like computer graphics and virtual reality. While diffusion models can generate hairstyles from text or images, these inputs lack precision and user-friendliness. Instead, we propose the first sketch-based strand generation model, which offers finer control while remaining user-friendly. Our framework tackles key challenges, such as modeling complex strand interactions and diverse sketch patterns, through two main innovations: a learnable strand upsampling strategy that encodes 3D strands into multi-scale latent spaces, and a multi-scale adaptive conditioning mechanism using a transformer with diffusion heads to ensure consistency across granularity levels. Experiments on several benchmark datasets show our method outperforms existing approaches in realism and precision. Qualitative results further confirm its effectiveness. Code will be released at [GitHub](https://github.com/fighting-Zhang/StrandDesigner).
中文: 我们首次提出基于草图控制的发丝生成模型,通过可学习的上采样策略和多尺度自适应调节机制,在保持用户友好性的同时实现了比现有方法更优越的真实感与精确度。
English: We introduce the first sketch-based hair strand generation model that enables precise and user-friendly control, utilizing a learnable upsampling strategy and multi-scale adaptive conditioning to achieve superior realism and accuracy over existing methods.

Authors:Peiyuan Jiang, Yao Liu, Qiao Liu, Zongshun Zhang, Jiaye Yang, Lu Liu, Daibing Yao
Title: DRKF: Decoupled Representations with Knowledge Fusion for Multimodal Emotion Recognition
Abstract:
Multimodal emotion recognition (MER) aims to identify emotional states by integrating and analyzing information from multiple modalities. However, inherent modality heterogeneity and inconsistencies in emotional cues remain key challenges that hinder performance. To address these issues, we propose a Decoupled Representations with Knowledge Fusion (DRKF) method for MER. DRKF consists of two main modules: an Optimized Representation Learning (ORL) Module and a Knowledge Fusion (KF) Module. ORL employs a contrastive mutual information estimation method with progressive modality augmentation to decouple task-relevant shared representations and modality-specific features while mitigating modality heterogeneity. KF includes a lightweight self-attention-based Fusion Encoder (FE) that identifies the dominant modality and integrates emotional information from other modalities to enhance the fused representation. To handle potential errors from incorrect dominant modality selection under emotionally inconsistent conditions, we introduce an Emotion Discrimination Submodule (ED), which enforces the fused representation to retain discriminative cues of emotional inconsistency. This ensures that even if the FE selects an inappropriate dominant modality, the Emotion Classification Submodule (EC) can still make accurate predictions by leveraging preserved inconsistency information. Experiments show that DRKF achieves state-of-the-art (SOTA) performance on IEMOCAP, MELD, and M3ED. The source code is publicly available at https://github.com/PANPANKK/DRKF.
中文: 提出的解耦表征与知识融合(DRKF)方法通过对比学习解耦共享和特定模态特征,并利用融合编码器与情感判别机制整合情感线索,解决了多模态情感识别中的模态异质性和情感不一致问题,在多个基准数据集上实现了最优性能。
English: The proposed Decoupled Representations with Knowledge Fusion (DRKF) method addresses modality heterogeneity and emotional inconsistencies in multimodal emotion recognition by decoupling shared and specific features through contrastive learning and integrating emotional cues via a fusion encoder with an emotion discrimination mechanism, achieving state-of-the-art performance on benchmark datasets.

Authors:Xuanzhao Dong, Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Peijie Qiu, Shao Tang, Xin Li, Yalin Wang
Title: LLaDA-MedV: Exploring Large Language Diffusion Models for Biomedical Image Understanding
Abstract:
Autoregressive models (ARMs) have long dominated the landscape of biomedical vision-language models (VLMs). Recently, masked diffusion models such as LLaDA have emerged as promising alternatives, yet their application in the biomedical domain remains largely underexplored. To bridge this gap, we introduce \textbf{LLaDA-MedV}, the first large language diffusion model tailored for biomedical image understanding through vision instruction tuning. LLaDA-MedV achieves relative performance gains of 7.855\% over LLaVA-Med and 1.867\% over LLaDA-V in the open-ended biomedical visual conversation task, and sets new state-of-the-art accuracy on the closed-form subset of three VQA benchmarks: 84.93\% on VQA-RAD, 92.31\% on SLAKE, and 95.15\% on PathVQA. Furthermore, a detailed comparison with LLaVA-Med suggests that LLaDA-MedV is capable of generating reasonably longer responses by explicitly controlling response length, which can lead to more informative outputs. We also conduct an in-depth analysis of both the training and inference stages, highlighting the critical roles of initialization weight selection, fine-tuning strategies, and the interplay between sampling steps and response repetition. The code and model weight is released at https://github.com/LLM-VLM-GSL/LLaDA-MedV.
中文:LLaDA-MedV是首个专为生物医学图像理解设计的大型语言扩散模型,在视觉问答任务中创下最新性能记录,并能生成比现有模型更长且信息更丰富的回答。
English: LLaDA-MedV is the first large language diffusion model designed for biomedical image understanding, achieving state-of-the-art performance in visual question answering and generating longer, more informative responses compared to existing models.

Authors:Kai Han, Chongwen Lyu, Lele Ma, Chengxuan Qian, Siqi Ma, Zheng Pang, Jun Chen, Zhe Liu
Title: CLIMD: A Curriculum Learning Framework for Imbalanced Multimodal Diagnosis
Abstract:
Clinicians usually combine information from multiple sources to achieve the most accurate diagnosis, and this has sparked increasing interest in leveraging multimodal deep learning for diagnosis. However, in real clinical scenarios, due to differences in incidence rates, multimodal medical data commonly face the issue of class imbalance, which makes it difficult to adequately learn the features of minority classes. Most existing methods tackle this issue with resampling or loss reweighting, but they are prone to overfitting or underfitting and fail to capture cross-modal interactions. Therefore, we propose a Curriculum Learning framework for Imbalanced Multimodal Diagnosis (CLIMD). Specifically, we first design multimodal curriculum measurer that combines two indicators, intra-modal confidence and inter-modal complementarity, to enable the model to focus on key samples and gradually adapt to complex category distributions. Additionally, a class distribution-guided training scheduler is introduced, which enables the model to progressively adapt to the imbalanced class distribution during training. Extensive experiments on multiple multimodal medical datasets demonstrate that the proposed method outperforms state-of-the-art approaches across various metrics and excels in handling imbalanced multimodal medical data. Furthermore, as a plug-and-play CL framework, CLIMD can be easily integrated into other models, offering a promising path for improving multimodal disease diagnosis accuracy. Code is publicly available at https://github.com/KHan-UJS/CLIMD.
中文: 提出的CLIMD框架通过课程学习策略,结合模态内置信度和模态间互补性指标,使模型能聚焦关键样本并逐步适应不平衡类别分布,在多模态医疗数据集上显著优于现有方法。
English: The proposed CLIMD framework addresses class imbalance in multimodal medical diagnosis by employing curriculum learning that prioritizes key samples and adapts to complex distributions, outperforming existing methods across multiple datasets.

Authors:Kun Ding, Ying Wang, Shiming Xiang
Title: EvoVLMA: Evolutionary Vision-Language Model Adaptation
Abstract:
Pre-trained Vision-Language Models (VLMs) have been exploited in various Computer Vision tasks (e.g., few-shot recognition) via model adaptation, such as prompt tuning and adapters. However, existing adaptation methods are designed by human experts, requiring significant time cost and experience. Inspired by recent advances in Large Language Models (LLMs) based code generation, we propose an Evolutionary Vision-Language Model Adaptation (EvoVLMA) method to automatically search training-free efficient adaptation algorithms for VLMs. We recognize feature selection and logits computation as the key functions in training-free VLM adaptation, and propose a two-stage LLM-assisted evolutionary algorithm for optimizing these parts in a sequential manner, effectively addressing the challenge posed by the expansive search space through a divide-and-conquer strategy. Besides, to enhance the stability and efficiency of searching process, we propose low-precision code conversion, web based code execution and process monitoring, leading to a highly effective automatic algorithm design system. Extensive experiments demonstrate that the algorithms found by EvoVLMA can obtain promising results compared to previous manually-designed ones. More specifically, in the 8-shot image classification setting, the classical APE algorithm can be improved by 1.91 points in recognition accuracy. This research opens new possibilities for automating the optimization of adaptation algorithms of pre-trained multimodal models. Code is available at: https://github.com/kding1225/EvoVLMA
中文: EvoVLMA方法通过大语言模型引导的进化算法,自动搜索无需训练的高效视觉语言模型适配算法,在少样本图像分类等任务中性能优于人工设计的方案。
English: The EvoVLMA method utilizes evolutionary algorithms guided by large language models to automatically design efficient, training-free adaptation algorithms for vision-language models, achieving superior performance over manually crafted methods in tasks like few-shot image classification.

Authors:Chengming Wang, Guodong Fan, Jinjiang Li, Min Gan, C. L. Philip Chen
Title: MGCR-Net:Multimodal Graph-Conditioned Vision-Language Reconstruction Network for Remote Sensing Change Detection
Abstract:
With the advancement of remote sensing satellite technology and the rapid progress of deep learning, remote sensing change detection (RSCD) has become a key technique for regional monitoring. Traditional change detection (CD) methods and deep learning-based approaches have made significant contributions to change analysis and detection, however, many outstanding methods still face limitations in the exploration and application of multimodal data. To address this, we propose the multimodal graph-conditioned vision-language reconstruction network (MGCR-Net) to further explore the semantic interaction capabilities of multimodal data. Multimodal large language models (MLLM) have attracted widespread attention for their outstanding performance in computer vision, particularly due to their powerful visual-language understanding and dialogic interaction capabilities. Specifically, we design a MLLM-based optimization strategy to generate multimodal textual data from the original CD images, which serve as textual input to MGCR. Visual and textual features are extracted through a dual encoder framework. For the first time in the RSCD task, we introduce a multimodal graph-conditioned vision-language reconstruction mechanism, which is integrated with graph attention to construct a semantic graph-conditioned reconstruction module (SGCM), this module generates vision-language (VL) tokens through graph-based conditions and enables cross-dimensional interaction between visual and textual features via multihead attention. The reconstructed VL features are then deeply fused using the language vision transformer (LViT), achieving fine-grained feature alignment and high-level semantic interaction. Experimental results on four public datasets demonstrate that MGCR achieves superior performance compared to mainstream CD methods. Our code is available on https://github.com/cn-xvkong/MGCR
中文摘要:提出的MGCR-Net利用多模态大语言模型,通过图条件重建和跨模态交互整合视觉与文本特征,在公开数据集上实现了优于主流方法的遥感变化检测性能。
English Summary: The proposed MGCR-Net leverages multimodal large language models to enhance remote sensing change detection by integrating visual and textual features through graph-conditioned reconstruction and cross-modal interaction, achieving state-of-the-art performance on public datasets.

Authors:Quan-Sheng Zeng, Yunheng Li, Qilong Wang, Peng-Tao Jiang, Zuxuan Wu, Ming-Ming Cheng, Qibin Hou
Title: A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models
Abstract:
Visual token compression is critical for Large Vision-Language Models (LVLMs) to efficiently process high-resolution inputs. Existing methods that typically adopt fixed compression ratios cannot adapt to scenes of varying complexity, often causing imprecise pruning that discards informative visual tokens and results in degraded model performance. To address this issue, we introduce a dynamic pruning framework, GlimpsePrune, inspired by human cognition. It takes a data-driven ''glimpse'' and prunes irrelevant visual tokens in a single forward pass before answer generation. This approach prunes 92.6% of visual tokens while on average fully retaining the baseline performance on free-form VQA tasks. The reduced computational cost also enables more effective fine-tuning: an enhanced GlimpsePrune+ achieves 110% of the baseline performance while maintaining a similarly high pruning rate. Our work paves a new way for building more powerful and efficient LVLMs.
中文摘要:GlimpsePrune提出了一种动态视觉标记剪枝框架,能够在保持基准性能的同时去除92.6%的标记,从而构建更高效的大型视觉语言模型。
English Summary: GlimpsePrune introduces a dynamic visual token pruning framework that removes 92.6% of tokens while maintaining baseline performance, enabling more efficient large vision-language models.

Authors:Rushin H. Gindra, Giovanni Palla, Mathias Nguyen, Sophia J. Wagner, Manuel Tran, Fabian J Theis, Dieter Saur, Lorin Crawford, Tingying Peng
Title: A Large-Scale Benchmark of Cross-Modal Learning for Histology and Gene Expression in Spatial Transcriptomics
Abstract:
Spatial transcriptomics enables simultaneous measurement of gene expression and tissue morphology, offering unprecedented insights into cellular organization and disease mechanisms. However, the field lacks comprehensive benchmarks for evaluating multimodal learning methods that leverage both histology images and gene expression data. Here, we present HESCAPE, a large-scale benchmark for cross-modal contrastive pretraining in spatial transcriptomics, built on a curated pan-organ dataset spanning 6 different gene panels and 54 donors. We systematically evaluated state-of-the-art image and gene expression encoders across multiple pretraining strategies and assessed their effectiveness on two downstream tasks: gene mutation classification and gene expression prediction. Our benchmark demonstrates that gene expression encoders are the primary determinant of strong representational alignment, and that gene models pretrained on spatial transcriptomics data outperform both those trained without spatial data and simple baseline approaches. However, downstream task evaluation reveals a striking contradiction: while contrastive pretraining consistently improves gene mutation classification performance, it degrades direct gene expression prediction compared to baseline encoders trained without cross-modal objectives. We identify batch effects as a key factor that interferes with effective cross-modal alignment. Our findings highlight the critical need for batch-robust multimodal learning approaches in spatial transcriptomics. To accelerate progress in this direction, we release HESCAPE, providing standardized datasets, evaluation protocols, and benchmarking tools for the community
中文: HESCAPE作为空间转录组学中跨模态对比预训练的大规模基准,揭示了预训练虽能提升基因突变分类性能,但因批次效应干扰而损害基因表达预测,凸显了对批次鲁棒性多模态学习方法的迫切需求。
English: HESCAPE is a comprehensive benchmark for cross-modal contrastive pretraining in spatial transcriptomics, showing that while such pretraining enhances gene mutation classification, it impairs gene expression prediction due to batch effects, underscoring the need for batch-robust multimodal learning methods.

Authors:Peirong Zhang, Kai Ding, Lianwen Jin
Title: Capturing More: Learning Multi-Domain Representations for Robust Online Handwriting Verification
Abstract:
In this paper, we propose SPECTRUM, a temporal-frequency synergistic model that unlocks the untapped potential of multi-domain representation learning for online handwriting verification (OHV). SPECTRUM comprises three core components: (1) a multi-scale interactor that finely combines temporal and frequency features through dual-modal sequence interaction and multi-scale aggregation, (2) a self-gated fusion module that dynamically integrates global temporal and frequency features via self-driven balancing. These two components work synergistically to achieve micro-to-macro spectral-temporal integration. (3) A multi-domain distance-based verifier then utilizes both temporal and frequency representations to improve discrimination between genuine and forged handwriting, surpassing conventional temporal-only approaches. Extensive experiments demonstrate SPECTRUM's superior performance over existing OHV methods, underscoring the effectiveness of temporal-frequency multi-domain learning. Furthermore, we reveal that incorporating multiple handwritten biometrics fundamentally enhances the discriminative power of handwriting representations and facilitates verification. These findings not only validate the efficacy of multi-domain learning in OHV but also pave the way for future research in multi-domain approaches across both feature and biometric domains. Code is publicly available at https://github.com/NiceRingNode/SPECTRUM.
中文: SPECTRUM是一种时频协同模型,通过多尺度交互和自门控融合整合多领域表征,显著提升了在线笔迹验证的性能,验证了多领域学习的有效性并推动了相关研究发展。
English: SPECTRUM is a temporal-frequency synergistic model that enhances online handwriting verification by integrating multi-domain representations through multi-scale interaction and self-gated fusion, outperforming traditional methods and demonstrating the value of multi-domain learning.

Authors:Onat Vuran, Hsuan-I Ho
Title: ReMu: Reconstructing Multi-layer 3D Clothed Human from Image Layers
Abstract:
The reconstruction of multi-layer 3D garments typically requires expensive multi-view capture setups and specialized 3D editing efforts. To support the creation of life-like clothed human avatars, we introduce ReMu for reconstructing multi-layer clothed humans in a new setup, Image Layers, which captures a subject wearing different layers of clothing with a single RGB camera. To reconstruct physically plausible multi-layer 3D garments, a unified 3D representation is necessary to model these garments in a layered manner. Thus, we first reconstruct and align each garment layer in a shared coordinate system defined by the canonical body pose. Afterwards, we introduce a collision-aware optimization process to address interpenetration and further refine the garment boundaries leveraging implicit neural fields. It is worth noting that our method is template-free and category-agnostic, which enables the reconstruction of 3D garments in diverse clothing styles. Through our experiments, we show that our method reconstructs nearly penetration-free 3D clothed humans and achieves competitive performance compared to category-specific methods. Project page: https://eth-ait.github.io/ReMu/

Authors:Haoquan Lu, Hanzhe Liang, Jie Zhang, Chenxi Hu, Jinbao Wang, Can Gao
Title: C3D-AD: Toward Continual 3D Anomaly Detection via Kernel Attention with Learnable Advisor
Abstract:
3D Anomaly Detection (AD) has shown great potential in detecting anomalies or defects of high-precision industrial products. However, existing methods are typically trained in a class-specific manner and also lack the capability of learning from emerging classes. In this study, we proposed a continual learning framework named Continual 3D Anomaly Detection (C3D-AD), which can not only learn generalized representations for multi-class point clouds but also handle new classes emerging over time.Specifically, in the feature extraction module, to extract generalized local features from diverse product types of different tasks efficiently, Kernel Attention with random feature Layer (KAL) is introduced, which normalizes the feature space. Then, to reconstruct data correctly and continually, an efficient Kernel Attention with learnable Advisor (KAA) mechanism is proposed, which learns the information from new categories while discarding redundant old information within both the encoder and decoder. Finally, to keep the representation consistency over tasks, a Reconstruction with Parameter Perturbation (RPP) module is proposed by designing a representation rehearsal loss function, which ensures that the model remembers previous category information and returns category-adaptive representation.Extensive experiments on three public datasets demonstrate the effectiveness of the proposed method, achieving an average performance of 66.4%, 83.1%, and 63.4% AUROC on Real3D-AD, Anomaly-ShapeNet, and MulSen-AD, respectively.
中文: 本研究提出了名为C3D-AD的持续学习框架,通过特征提取、重建和表示一致性等创新模块,有效处理多类别和新兴类别的3D异常检测,在公开数据集上取得了优异性能。
English: This study introduces a continual learning framework called C3D-AD for 3D anomaly detection, which effectively handles multiple and emerging classes through innovative modules for feature extraction, reconstruction, and representation consistency, achieving strong performance on public datasets.

Authors:Alec Sargood, Lemuel Puglisi, James H. Cole, Neil P. Oxtoby, Daniele Ravì, Daniel C. Alexander
Title: CoCoLIT: ControlNet-Conditioned Latent Image Translation for MRI to Amyloid PET Synthesis
Abstract:
Synthesizing amyloid PET scans from the more widely available and accessible structural MRI modality offers a promising, cost-effective approach for large-scale Alzheimer's Disease (AD) screening. This is motivated by evidence that, while MRI does not directly detect amyloid pathology, it may nonetheless encode information correlated with amyloid deposition that can be uncovered through advanced modeling. However, the high dimensionality and structural complexity of 3D neuroimaging data pose significant challenges for existing MRI-to-PET translation methods. Modeling the cross-modality relationship in a lower-dimensional latent space can simplify the learning task and enable more effective translation. As such, we present CoCoLIT (ControlNet-Conditioned Latent Image Translation), a diffusion-based latent generative framework that incorporates three main innovations: (1) a novel Weighted Image Space Loss (WISL) that improves latent representation learning and synthesis quality; (2) a theoretical and empirical analysis of Latent Average Stabilization (LAS), an existing technique used in similar generative models to enhance inference consistency; and (3) the introduction of ControlNet-based conditioning for MRI-to-PET translation. We evaluate CoCoLIT's performance on publicly available datasets and find that our model significantly outperforms state-of-the-art methods on both image-based and amyloid-related metrics. Notably, in amyloid-positivity classification, CoCoLIT outperforms the second-best method with improvements of +10.5% on the internal dataset and +23.7% on the external dataset. The code and models of our approach are available at https://github.com/brAIn-science/CoCoLIT.
中文: CoCoLIT是一种基于扩散模型的潜在生成框架,通过结构MRI合成淀粉样蛋白PET扫描,在图像质量和淀粉样蛋白分类准确性上显著优于现有方法。
English: CoCoLIT is a novel diffusion-based latent generative framework that synthesizes amyloid PET scans from structural MRI, significantly outperforming existing methods in image quality and amyloid classification accuracy.

Authors:Zeyu Pan, Ping Li, Wenxiao Wang
Title: SGCap: Decoding Semantic Group for Zero-shot Video Captioning
Abstract:
Zero-shot video captioning aims to generate sentences for describing videos without training the model on video-text pairs, which remains underexplored. Existing zero-shot image captioning methods typically adopt a text-only training paradigm, where a language decoder reconstructs single-sentence embeddings obtained from CLIP. However, directly extending them to the video domain is suboptimal, as applying average pooling over all frames neglects temporal dynamics. To address this challenge, we propose a Semantic Group Captioning (SGCap) method for zero-shot video captioning. In particular, it develops the Semantic Group Decoding (SGD) strategy to employ multi-frame information while explicitly modeling inter-frame temporal relationships. Furthermore, existing zero-shot captioning methods that rely on cosine similarity for sentence retrieval and reconstruct the description supervised by a single frame-level caption, fail to provide sufficient video-level supervision. To alleviate this, we introduce two key components, including the Key Sentences Selection (KSS) module and the Probability Sampling Supervision (PSS) module. The two modules construct semantically-diverse sentence groups that models temporal dynamics and guide the model to capture inter-sentence causal relationships, thereby enhancing its generalization ability to video captioning. Experimental results on several benchmarks demonstrate that SGCap significantly outperforms previous state-of-the-art zero-shot alternatives and even achieves performance competitive with fully supervised ones. Code is available at https://github.com/mlvccn/SGCap_Video.
中文: 提出的语义组字幕(SGCap)方法通过引入捕捉时序动态的解码策略和增强语义多样性与因果关系的模块,显著提升了零样本视频字幕生成性能,超越了现有方法并可与监督方法相媲美。
English: The proposed Semantic Group Captioning (SGCap) method advances zero-shot video captioning by introducing a decoding strategy that captures temporal dynamics and modules that enhance semantic diversity and causal relationships, outperforming prior methods and rivaling supervised approaches.

Authors:Xiaoqin Wang, Xianxu Hou, Meidan Ding, Junliang Chen, Kaijun Deng, Jinheng Xie, Linlin Shen
Title: DisFaceRep: Representation Disentanglement for Co-occurring Facial Components in Weakly Supervised Face Parsing
Abstract:
Face parsing aims to segment facial images into key components such as eyes, lips, and eyebrows. While existing methods rely on dense pixel-level annotations, such annotations are expensive and labor-intensive to obtain. To reduce annotation cost, we introduce Weakly Supervised Face Parsing (WSFP), a new task setting that performs dense facial component segmentation using only weak supervision, such as image-level labels and natural language descriptions. WSFP introduces unique challenges due to the high co-occurrence and visual similarity of facial components, which lead to ambiguous activations and degraded parsing performance. To address this, we propose DisFaceRep, a representation disentanglement framework designed to separate co-occurring facial components through both explicit and implicit mechanisms. Specifically, we introduce a co-occurring component disentanglement strategy to explicitly reduce dataset-level bias, and a text-guided component disentanglement loss to guide component separation using language supervision implicitly. Extensive experiments on CelebAMask-HQ, LaPa, and Helen demonstrate the difficulty of WSFP and the effectiveness of DisFaceRep, which significantly outperforms existing weakly supervised semantic segmentation methods. The code will be released at \href{https://github.com/CVI-SZU/DisFaceRep}{\textcolor{cyan}{https://github.com/CVI-SZU/DisFaceRep}}.
中文: 本文提出弱监督人脸解析(WSFP)以降低标注成本,通过弱监督实现面部组件分割,并设计DisFaceRep框架,利用显式和隐式机制解耦共现的面部组件,在多个数据集上取得显著优于现有方法的性能。
English: This paper introduces Weakly Supervised Face Parsing (WSFP) to reduce annotation costs by using weak supervision for facial component segmentation and proposes DisFaceRep, a framework that disentangles co-occurring facial components through explicit and implicit mechanisms, achieving superior performance on multiple datasets.

Authors:Xinyu Chen, Haotian Zhai, Can Zhang, Xiupeng Shi, Ruirui Li
Title: Multi-Cache Enhanced Prototype Learning for Test-Time Generalization of Vision-Language Models
Abstract:
In zero-shot setting, test-time adaptation adjusts pre-trained models using unlabeled data from the test phase to enhance performance on unknown test distributions. Existing cache-enhanced TTA methods rely on a low-entropy criterion to select samples for prototype construction, assuming intra-class compactness. However, low-entropy samples may be unreliable under distribution shifts, and the resulting prototypes may not ensure compact intra-class distributions. This study identifies a positive correlation between cache-enhanced performance and intra-class compactness. Based on this observation, we propose a Multi-Cache enhanced Prototype-based Test-Time Adaptation (MCP) featuring three caches: an entropy cache for initializing prototype representations with low-entropy samples, an align cache for integrating visual and textual information to achieve compact intra-class distributions, and a negative cache for prediction calibration using high-entropy samples. We further developed MCP++, a framework incorporating cross-modal prototype alignment and residual learning, introducing prototype residual fine-tuning. Comparative and ablation experiments across 15 downstream tasks demonstrate that the proposed method and framework achieve state-of-the-art generalization performance. Project Page available at: https://zhaihaotian.github.io/MCP-ICCV25/

Authors:Yuanlin Yang, Quanjian Song, Zhexian Gao, Ge Wang, Shanshan Li, Xiaoyan Zhang
Title: StyDeco: Unsupervised Style Transfer with Distilling Priors and Semantic Decoupling
Abstract:
Diffusion models have emerged as the dominant paradigm for style transfer, but their text-driven mechanism is hindered by a core limitation: it treats textual descriptions as uniform, monolithic guidance. This limitation overlooks the semantic gap between the non-spatial nature of textual descriptions and the spatially-aware attributes of visual style, often leading to the loss of semantic structure and fine-grained details during stylization. In this paper, we propose StyDeco, an unsupervised framework that resolves this limitation by learning text representations specifically tailored for the style transfer task. Our framework first employs Prior-Guided Data Distillation (PGD), a strategy designed to distill stylistic knowledge without human supervision. It leverages a powerful frozen generative model to automatically synthesize pseudo-paired data. Subsequently, we introduce Contrastive Semantic Decoupling (CSD), a task-specific objective that adapts a text encoder using domain-specific weights. CSD performs a two-class clustering in the semantic space, encouraging source and target representations to form distinct clusters. Extensive experiments on three classic benchmarks demonstrate that our framework outperforms several existing approaches in both stylistic fidelity and structural preservation, highlighting its effectiveness in style transfer with semantic preservation. In addition, our framework supports a unique de-stylization process, further demonstrating its extensibility. Our code is vailable at https://github.com/QuanjianSong/StyDeco.
中文: 扩散模型在风格转换中因文本与视觉风格间的语义鸿沟常导致语义结构丢失,而我们提出的无监督框架StyDeco通过定制化文本表征学习解决了这一问题,在风格保真度和结构保持上优于现有方法。
English: Diffusion models for style transfer often fail to preserve semantic structure due to the semantic gap between text and visual style, but our proposed unsupervised framework, StyDeco, overcomes this by learning tailored text representations and outperforms existing methods in fidelity and preservation.

Authors:Zhan Shi, Song Wang, Junbo Chen, Jianke Zhu
Title: A Coarse-to-Fine Approach to Multi-Modality 3D Occupancy Grounding
Abstract:
Visual grounding aims to identify objects or regions in a scene based on natural language descriptions, essential for spatially aware perception in autonomous driving. However, existing visual grounding tasks typically depend on bounding boxes that often fail to capture fine-grained details. Not all voxels within a bounding box are occupied, resulting in inaccurate object representations. To address this, we introduce a benchmark for 3D occupancy grounding in challenging outdoor scenes. Built on the nuScenes dataset, it integrates natural language with voxel-level occupancy annotations, offering more precise object perception compared to the traditional grounding task. Moreover, we propose GroundingOcc, an end-to-end model designed for 3D occupancy grounding through multi-modal learning. It combines visual, textual, and point cloud features to predict object location and occupancy information from coarse to fine. Specifically, GroundingOcc comprises a multimodal encoder for feature extraction, an occupancy head for voxel-wise predictions, and a grounding head to refine localization. Additionally, a 2D grounding module and a depth estimation module enhance geometric understanding, thereby boosting model performance. Extensive experiments on the benchmark demonstrate that our method outperforms existing baselines on 3D occupancy grounding. The dataset is available at https://github.com/RONINGOD/GroundingOcc.
中文摘要:本文基于nuScenes数据集提出了三维占据栅格接地基准,并设计了端到端模型GroundingOcc,通过融合视觉、文本和点云特征实现从粗到精的室外场景物体定位与体素级占据预测,在基准测试中显著优于现有方法。
English Summary: This paper introduces a 3D occupancy grounding benchmark using the nuScenes dataset and proposes GroundingOcc, an end-to-end model that integrates visual, textual, and point cloud features to achieve precise object localization and voxel-level occupancy prediction in outdoor scenes, outperforming existing methods.

Authors:Ranran Huang, Krystian Mikolajczyk
Title: No Pose at All: Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views
Abstract:
We introduce SPFSplat, an efficient framework for 3D Gaussian splatting from sparse multi-view images, requiring no ground-truth poses during training or inference. It employs a shared feature extraction backbone, enabling simultaneous prediction of 3D Gaussian primitives and camera poses in a canonical space from unposed inputs within a single feed-forward step. Alongside the rendering loss based on estimated novel-view poses, a reprojection loss is integrated to enforce the learning of pixel-aligned Gaussian primitives for enhanced geometric constraints. This pose-free training paradigm and efficient one-step feed-forward design make SPFSplat well-suited for practical applications. Remarkably, despite the absence of pose supervision, SPFSplat achieves state-of-the-art performance in novel view synthesis even under significant viewpoint changes and limited image overlap. It also surpasses recent methods trained with geometry priors in relative pose estimation. Code and trained models are available on our project page: https://ranrhuang.github.io/spfsplat/.

Authors:Xinyu Yan, Meijun Sun, Ge-Peng Ji, Fahad Shahbaz Khan, Salman Khan, Deng-Ping Fan
Title: LawDIS: Language-Window-based Controllable Dichotomous Image Segmentation
Abstract:
We present LawDIS, a language-window-based controllable dichotomous image segmentation (DIS) framework that produces high-quality object masks. Our framework recasts DIS as an image-conditioned mask generation task within a latent diffusion model, enabling seamless integration of user controls. LawDIS is enhanced with macro-to-micro control modes. Specifically, in macro mode, we introduce a language-controlled segmentation strategy (LS) to generate an initial mask based on user-provided language prompts. In micro mode, a window-controlled refinement strategy (WR) allows flexible refinement of user-defined regions (i.e., size-adjustable windows) within the initial mask. Coordinated by a mode switcher, these modes can operate independently or jointly, making the framework well-suited for high-accuracy, personalised applications. Extensive experiments on the DIS5K benchmark reveal that our LawDIS significantly outperforms 11 cutting-edge methods across all metrics. Notably, compared to the second-best model MVANet, we achieve $F_β^ω$ gains of 4.6\% with both the LS and WR strategies and 3.6\% gains with only the LS strategy on DIS-TE. Codes will be made available at https://github.com/XinyuYanTJU/LawDIS.
中文:LawDIS是一种基于语言窗口控制的二分图像分割框架,通过语言提示生成初始遮罩并结合窗口细化策略,在DIS5K基准测试中各项指标显著优于现有方法。
English: LawDIS is a novel framework for dichotomous image segmentation that integrates language prompts and adjustable window controls to generate precise object masks, demonstrating superior performance over existing methods on the DIS5K benchmark.

Authors:Yu Lei, Jinbin Bai, Qingyu Shi, Aosong Feng, Kaidong Yu
Title: Personalized Safety Alignment for Text-to-Image Diffusion Models
Abstract:
Text-to-image diffusion models have revolutionized visual content generation, but current safety mechanisms apply uniform standards that often fail to account for individual user preferences. These models overlook the diverse safety boundaries shaped by factors like age, mental health, and personal beliefs. To address this, we propose Personalized Safety Alignment (PSA), a framework that allows user-specific control over safety behaviors in generative models. PSA integrates personalized user profiles into the diffusion process, adjusting the model's behavior to match individual safety preferences while preserving image quality. We introduce a new dataset, Sage, which captures user-specific safety preferences and incorporates these profiles through a cross-attention mechanism. Experiments show that PSA outperforms existing methods in harmful content suppression and aligns generated content better with user constraints, achieving higher Win Rate and Pass Rate scores. Our code, data, and models are publicly available at https://m-e-agi-lab.github.io/PSAlign/.

Authors:Dianyi Yang, Xihan Wang, Yu Gao, Shiyang Liu, Bohan Ren, Yufeng Yue, Yi Yang
Title: OpenGS-Fusion: Open-Vocabulary Dense Mapping with Hybrid 3D Gaussian Splatting for Refined Object-Level Understanding
Abstract:
Recent advancements in 3D scene understanding have made significant strides in enabling interaction with scenes using open-vocabulary queries, particularly for VR/AR and robotic applications. Nevertheless, existing methods are hindered by rigid offline pipelines and the inability to provide precise 3D object-level understanding given open-ended queries. In this paper, we present OpenGS-Fusion, an innovative open-vocabulary dense mapping framework that improves semantic modeling and refines object-level understanding. OpenGS-Fusion combines 3D Gaussian representation with a Truncated Signed Distance Field to facilitate lossless fusion of semantic features on-the-fly. Furthermore, we introduce a novel multimodal language-guided approach named MLLM-Assisted Adaptive Thresholding, which refines the segmentation of 3D objects by adaptively adjusting similarity thresholds, achieving an improvement 17\% in 3D mIoU compared to the fixed threshold strategy. Extensive experiments demonstrate that our method outperforms existing methods in 3D object understanding and scene reconstruction quality, as well as showcasing its effectiveness in language-guided scene interaction. The code is available at https://young-bit.github.io/opengs-fusion.github.io/ .
中文: OpenGS-Fusion提出了一种开放词汇的密集建图框架,通过结合3D高斯表示和自适应阈值优化,提升了语义建模和物体级理解能力,在3D mIoU上实现了17%的提升,并在场景重建与交互中表现出卓越性能。
English: OpenGS-Fusion introduces an open-vocabulary dense mapping framework that enhances semantic modeling and object-level understanding by integrating 3D Gaussian representation with adaptive thresholding, achieving a 17% improvement in 3D mIoU and superior performance in scene reconstruction and interaction.

Authors:Huyu Wu, Duo Su, Junjie Hou, Guang Li
Title: Dataset Condensation with Color Compensation
Abstract:
Dataset condensation always faces a constitutive trade-off: balancing performance and fidelity under extreme compression. Existing methods struggle with two bottlenecks: image-level selection methods (Coreset Selection, Dataset Quantization) suffer from inefficiency condensation, while pixel-level optimization (Dataset Distillation) introduces semantic distortion due to over-parameterization. With empirical observations, we find that a critical problem in dataset condensation is the oversight of color's dual role as an information carrier and a basic semantic representation unit. We argue that improving the colorfulness of condensed images is beneficial for representation learning. Motivated by this, we propose DC3: a Dataset Condensation framework with Color Compensation. After a calibrated selection strategy, DC3 utilizes the latent diffusion model to enhance the color diversity of an image rather than creating a brand-new one. Extensive experiments demonstrate the superior performance and generalization of DC3 that outperforms SOTA methods across multiple benchmarks. To the best of our knowledge, besides focusing on downstream tasks, DC3 is the first research to fine-tune pre-trained diffusion models with condensed datasets. The FID results prove that training networks with our high-quality datasets is feasible without model collapse or other degradation issues. Code and generated data are available at https://github.com/528why/Dataset-Condensation-with-Color-Compensation.
中文摘要:提出的DC3框架通过校准选择策略和潜在扩散模型增强图像色彩多样性,解决了数据集压缩中的性能瓶颈,在多个基准测试中实现卓越性能且无语义失真。
English Summary: The proposed DC3 framework addresses dataset condensation bottlenecks by enhancing color diversity through a calibrated selection strategy and latent diffusion model, achieving superior performance and generalization across benchmarks without semantic distortion.

Authors:Chaitanya Patel, Hiroki Nakamura, Yuta Kyuragi, Kazuki Kozuka, Juan Carlos Niebles, Ehsan Adeli
Title: UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation
Abstract:
Egocentric human motion generation and forecasting with scene-context is crucial for enhancing AR/VR experiences, improving human-robot interaction, advancing assistive technologies, and enabling adaptive healthcare solutions by accurately predicting and simulating movement from a first-person perspective. However, existing methods primarily focus on third-person motion synthesis with structured 3D scene contexts, limiting their effectiveness in real-world egocentric settings where limited field of view, frequent occlusions, and dynamic cameras hinder scene perception. To bridge this gap, we introduce Egocentric Motion Generation and Egocentric Motion Forecasting, two novel tasks that utilize first-person images for scene-aware motion synthesis without relying on explicit 3D scene. We propose UniEgoMotion, a unified conditional motion diffusion model with a novel head-centric motion representation tailored for egocentric devices. UniEgoMotion's simple yet effective design supports egocentric motion reconstruction, forecasting, and generation from first-person visual inputs in a unified framework. Unlike previous works that overlook scene semantics, our model effectively extracts image-based scene context to infer plausible 3D motion. To facilitate training, we introduce EE4D-Motion, a large-scale dataset derived from EgoExo4D, augmented with pseudo-ground-truth 3D motion annotations. UniEgoMotion achieves state-of-the-art performance in egocentric motion reconstruction and is the first to generate motion from a single egocentric image. Extensive evaluations demonstrate the effectiveness of our unified framework, setting a new benchmark for egocentric motion modeling and unlocking new possibilities for egocentric applications.

Authors:Saba Ahmadi, Rabiul Awal, Ankur Sikarwar, Amirhossein Kazemnejad, Ge Ya Luo, Juan A. Rodriguez, Sai Rajeswar, Siva Reddy, Christopher Pal, Benno Krojer, Aishwarya Agrawal
Title: The Promise of RL for Autoregressive Image Editing
Abstract:
We explore three strategies to enhance performance on a wide range of image editing tasks: supervised fine-tuning (SFT), reinforcement learning (RL), and Chain-of-Thought (CoT) reasoning. In order to study all these components in one consistent framework, we adopt an autoregressive multimodal model that processes textual and visual tokens in a unified manner. We find RL combined with a large multi-modal LLM verifier to be the most effective of these strategies. As a result, we release EARL: Editing with Autoregression and RL, a strong RL-based image editing model that performs competitively on a diverse range of edits compared to strong baselines, despite using much less training data. Thus, EARL pushes the frontier of autoregressive multimodal models on image editing. We release our code, training data, and trained models at https://github.com/mair-lab/EARL.
中文摘要:本研究提出EARL自回归图像编辑模型,通过强化学习结合多模态验证器实现卓越性能,在训练数据大幅减少的情况下仍优于现有基线方法。
English Summary: The study introduces EARL, an autoregressive image editing model that demonstrates superior performance through reinforcement learning combined with a multimodal verifier, outperforming baselines with significantly less training data.

Authors:Fenghe Tang, Bingkun Nian, Jianrui Ding, Wenxin Ma, Quan Quan, Chengqi Dong, Jie Yang, Wei Liu, S. Kevin Zhou
Title: Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation
Abstract:
In clinical practice, medical image analysis often requires efficient execution on resource-constrained mobile devices. However, existing mobile models-primarily optimized for natural images-tend to perform poorly on medical tasks due to the significant information density gap between natural and medical domains. Combining computational efficiency with medical imaging-specific architectural advantages remains a challenge when developing lightweight, universal, and high-performing networks. To address this, we propose a mobile model called Mobile U-shaped Vision Transformer (Mobile U-ViT) tailored for medical image segmentation. Specifically, we employ the newly purposed ConvUtr as a hierarchical patch embedding, featuring a parameter-efficient large-kernel CNN with inverted bottleneck fusion. This design exhibits transformer-like representation learning capacity while being lighter and faster. To enable efficient local-global information exchange, we introduce a novel Large-kernel Local-Global-Local (LGL) block that effectively balances the low information density and high-level semantic discrepancy of medical images. Finally, we incorporate a shallow and lightweight transformer bottleneck for long-range modeling and employ a cascaded decoder with downsample skip connections for dense prediction. Despite its reduced computational demands, our medical-optimized architecture achieves state-of-the-art performance across eight public 2D and 3D datasets covering diverse imaging modalities, including zero-shot testing on four unseen datasets. These results establish it as an efficient yet powerful and generalization solution for mobile medical image analysis. Code is available at https://github.com/FengheTan9/Mobile-U-ViT.
中文: 提出的Mobile U-ViT模型通过结合类Transformer表征学习和优化的局部-全局信息交互,解决了移动设备上高效医学图像分割的难题,在保持低计算需求的同时,在多个数据集上实现了最先进的性能。
English: The proposed Mobile U-ViT model addresses the challenge of efficient medical image segmentation on mobile devices by combining transformer-like representation learning with optimized local-global information exchange, achieving state-of-the-art performance across multiple datasets while maintaining low computational demands.

Authors:Cihang Peng, Qiming Hou, Zhong Ren, Kun Zhou
Title: ROVI: A VLM-LLM Re-Captioned Dataset for Open-Vocabulary Instance-Grounded Text-to-Image Generation
Abstract:
We present ROVI, a high-quality synthetic dataset for instance-grounded text-to-image generation, created by labeling 1M curated web images. Our key innovation is a strategy called re-captioning, focusing on the pre-detection stage, where a VLM (Vision-Language Model) generates comprehensive visual descriptions that are then processed by an LLM (Large Language Model) to extract a flat list of potential categories for OVDs (Open-Vocabulary Detectors) to detect. This approach yields a global prompt inherently linked to instance annotations while capturing secondary visual elements humans typically overlook. Evaluations show that ROVI exceeds existing detection datasets in image quality and resolution while containing two orders of magnitude more categories with an open-vocabulary nature. For demonstrative purposes, a text-to-image model GLIGEN trained on ROVI significantly outperforms state-of-the-art alternatives in instance grounding accuracy, prompt fidelity, and aesthetic quality. Our dataset and reproducible pipeline are available at https://github.com/CihangPeng/ROVI.
中文: ROVI是一个高质量实例化文本到图像生成的合成数据集,通过重新标注策略将全局提示与实例标注相关联,在图像质量、分辨率和类别多样性上显著超越现有数据集。
English: ROVI is a high-quality synthetic dataset for instance-grounded text-to-image generation, created through a re-captioning strategy that links global prompts to instance annotations and significantly outperforms existing datasets in quality and category diversity.

Authors:Mohammad Mohammadi, Ziyi Wu, Igor Gilitschenski
Title: TESPEC: Temporally-Enhanced Self-Supervised Pretraining for Event Cameras
Abstract:
Long-term temporal information is crucial for event-based perception tasks, as raw events only encode pixel brightness changes. Recent works show that when trained from scratch, recurrent models achieve better results than feedforward models in these tasks. However, when leveraging self-supervised pre-trained weights, feedforward models can outperform their recurrent counterparts. Current self-supervised learning (SSL) methods for event-based pre-training largely mimic RGB image-based approaches. They pre-train feedforward models on raw events within a short time interval, ignoring the temporal information of events. In this work, we introduce TESPEC, a self-supervised pre-training framework tailored for learning spatio-temporal information. TESPEC is well-suited for recurrent models, as it is the first framework to leverage long event sequences during pre-training. TESPEC employs the masked image modeling paradigm with a new reconstruction target. We design a novel method to accumulate events into pseudo grayscale videos containing high-level semantic information about the underlying scene, which is robust to sensor noise and reduces motion blur. Reconstructing this target thus requires the model to reason about long-term history of events. Extensive experiments demonstrate our state-of-the-art results in downstream tasks, including object detection, semantic segmentation, and monocular depth estimation. Project webpage: https://mhdmohammadi.github.io/TESPEC_webpage.

Authors:Wenxuan Guo, Xiuwei Xu, Hang Yin, Ziwei Wang, Jianjiang Feng, Jie Zhou, Jiwen Lu
Title: IGL-Nav: Incremental 3D Gaussian Localization for Image-goal Navigation
Abstract:
Visual navigation with an image as goal is a fundamental and challenging problem. Conventional methods either rely on end-to-end RL learning or modular-based policy with topological graph or BEV map as memory, which cannot fully model the geometric relationship between the explored 3D environment and the goal image. In order to efficiently and accurately localize the goal image in 3D space, we build our navigation system upon the renderable 3D gaussian (3DGS) representation. However, due to the computational intensity of 3DGS optimization and the large search space of 6-DoF camera pose, directly leveraging 3DGS for image localization during agent exploration process is prohibitively inefficient. To this end, we propose IGL-Nav, an Incremental 3D Gaussian Localization framework for efficient and 3D-aware image-goal navigation. Specifically, we incrementally update the scene representation as new images arrive with feed-forward monocular prediction. Then we coarsely localize the goal by leveraging the geometric information for discrete space matching, which can be equivalent to efficient 3D convolution. When the agent is close to the goal, we finally solve the fine target pose with optimization via differentiable rendering. The proposed IGL-Nav outperforms existing state-of-the-art methods by a large margin across diverse experimental configurations. It can also handle the more challenging free-view image-goal setting and be deployed on real-world robotic platform using a cellphone to capture goal image at arbitrary pose. Project page: https://gwxuan.github.io/IGL-Nav/.
中文摘要:IGL-Nav提出了一种增量式3D高斯定位框架,通过单目场景更新、几何匹配和可微分渲染优化,实现了高效的3D感知图像目标导航,性能显著超越现有方法。
English Summary: IGL-Nav introduces an incremental 3D Gaussian localization framework that efficiently achieves 3D-aware image-goal navigation through monocular scene updates, geometric matching, and differentiable rendering optimization, significantly outperforming existing methods.

Authors:Irene Iele, Francesco Di Feola, Valerio Guarrasi, Paolo Soda
Title: Sample-Aware Test-Time Adaptation for Medical Image-to-Image Translation
Abstract:
Image-to-image translation has emerged as a powerful technique in medical imaging, enabling tasks such as image denoising and cross-modality conversion. However, it suffers from limitations in handling out-of-distribution samples without causing performance degradation. To address this limitation, we propose a novel Test-Time Adaptation (TTA) framework that dynamically adjusts the translation process based on the characteristics of each test sample. Our method introduces a Reconstruction Module to quantify the domain shift and a Dynamic Adaptation Block that selectively modifies the internal features of a pretrained translation model to mitigate the shift without compromising the performance on in-distribution samples that do not require adaptation. We evaluate our approach on two medical image-to-image translation tasks: low-dose CT denoising and T1 to T2 MRI translation, showing consistent improvements over both the baseline translation model without TTA and prior TTA methods. Our analysis highlights the limitations of the state-of-the-art that uniformly apply the adaptation to both out-of-distribution and in-distribution samples, demonstrating that dynamic, sample-specific adjustment offers a promising path to improve model resilience in real-world scenarios. The code is available at: https://github.com/Sample-Aware-TTA/Code.
中文: 本文提出了一种新颖的测试时自适应框架,通过重建模块和动态适应块动态调整图像翻译过程以处理分布外样本,在医学影像任务中展现出优于现有方法的性能且不影响分布内样本。
English: This paper introduces a novel Test-Time Adaptation framework that dynamically adjusts image translation for out-of-distribution samples using a Reconstruction Module and Dynamic Adaptation Block, showing improved performance in medical imaging tasks without compromising in-distribution samples.

Authors:Regine Hartwig, Dominik Muhle, Riccardo Marin, Daniel Cremers
Title: GECO: Geometrically Consistent Embedding with Lightspeed Inference
Abstract:
Recent advances in feature learning have shown that self-supervised vision foundation models can capture semantic correspondences but often lack awareness of underlying 3D geometry. GECO addresses this gap by producing geometrically coherent features that semantically distinguish parts based on geometry (e.g., left/right eyes, front/back legs). We propose a training framework based on optimal transport, enabling supervision beyond keypoints, even under occlusions and disocclusions. With a lightweight architecture, GECO runs at 30 fps, 98.2% faster than prior methods, while achieving state-of-the-art performance on PFPascal, APK, and CUB, improving PCK by 6.0%, 6.2%, and 4.1%, respectively. Finally, we show that PCK alone is insufficient to capture geometric quality and introduce new metrics and insights for more geometry-aware feature learning. Link to project page: https://reginehartwig.github.io/publications/geco/

Authors:Chende Zheng, Ruiqi suo, Chenhao Lin, Zhengyu Zhao, Le Yang, Shuai Liu, Minghui Yang, Cong Wang, Chao Shen
Title: D3: Training-Free AI-Generated Video Detection Using Second-Order Features
Abstract:
The evolution of video generation techniques, such as Sora, has made it increasingly easy to produce high-fidelity AI-generated videos, raising public concern over the dissemination of synthetic content. However, existing detection methodologies remain limited by their insufficient exploration of temporal artifacts in synthetic videos. To bridge this gap, we establish a theoretical framework through second-order dynamical analysis under Newtonian mechanics, subsequently extending the Second-order Central Difference features tailored for temporal artifact detection. Building on this theoretical foundation, we reveal a fundamental divergence in second-order feature distributions between real and AI-generated videos. Concretely, we propose Detection by Difference of Differences (D3), a novel training-free detection method that leverages the above second-order temporal discrepancies. We validate the superiority of our D3 on 4 open-source datasets (Gen-Video, VideoPhy, EvalCrafter, VidProM), 40 subsets in total. For example, on GenVideo, D3 outperforms the previous best method by 10.39% (absolute) mean Average Precision. Additional experiments on time cost and post-processing operations demonstrate D3's exceptional computational efficiency and strong robust performance. Our code is available at https://github.com/Zig-HS/D3.
中文: 本研究提出D3方法,通过利用二阶时序差异无需训练即可有效区分AI生成视频与真实视频,在多个数据集上展现出卓越性能与计算效率。
English: The study introduces D3, a training-free detection method that leverages second-order temporal discrepancies to effectively distinguish AI-generated videos from real ones, demonstrating superior performance and computational efficiency across multiple datasets.

Authors:Junhao Zheng, Jiahao Sun, Chenhao Lin, Zhengyu Zhao, Chen Ma, Chong Zhang, Cong Wang, Qian Wang, Chao Shen
Title: Revisiting Adversarial Patch Defenses on Object Detectors: Unified Evaluation, Large-Scale Dataset, and New Insights
Abstract:
Developing reliable defenses against patch attacks on object detectors has attracted increasing interest. However, we identify that existing defense evaluations lack a unified and comprehensive framework, resulting in inconsistent and incomplete assessments of current methods. To address this issue, we revisit 11 representative defenses and present the first patch defense benchmark, involving 2 attack goals, 13 patch attacks, 11 object detectors, and 4 diverse metrics. This leads to the large-scale adversarial patch dataset with 94 types of patches and 94,000 images. Our comprehensive analyses reveal new insights: (1) The difficulty in defending against naturalistic patches lies in the data distribution, rather than the commonly believed high frequencies. Our new dataset with diverse patch distributions can be used to improve existing defenses by 15.09% AP@0.5. (2) The average precision of the attacked object, rather than the commonly pursued patch detection accuracy, shows high consistency with defense performance. (3) Adaptive attacks can substantially bypass existing defenses, and defenses with complex/stochastic models or universal patch properties are relatively robust. We hope that our analyses will serve as guidance on properly evaluating patch attacks/defenses and advancing their design. Code and dataset are available at https://github.com/Gandolfczjh/APDE, where we will keep integrating new attacks/defenses.
Chinese: 本研究首次建立了针对目标检测器对抗性补丁攻击防御的综合评估基准,揭示了数据分布而非高频特征对防御的重要性,以及自适应攻击的有效性等关键发现,同时提供了数据集和框架,可将现有防御性能提升15.09%。
English: This study introduces the first comprehensive benchmark for evaluating defenses against adversarial patch attacks on object detectors, revealing key insights such as the importance of data distribution over high frequencies and the effectiveness of adaptive attacks, while providing a dataset and framework to improve defense performance by 15.09%.

Authors:Junzhe Lu, Jing Lin, Hongkun Dou, Ailing Zeng, Yue Deng, Xian Liu, Zhongang Cai, Lei Yang, Yulun Zhang, Haoqian Wang, Ziwei Liu
Title: DPoser-X: Diffusion Model as Robust 3D Whole-body Human Pose Prior
Abstract:
We present DPoser-X, a diffusion-based prior model for 3D whole-body human poses. Building a versatile and robust full-body human pose prior remains challenging due to the inherent complexity of articulated human poses and the scarcity of high-quality whole-body pose datasets. To address these limitations, we introduce a Diffusion model as body Pose prior (DPoser) and extend it to DPoser-X for expressive whole-body human pose modeling. Our approach unifies various pose-centric tasks as inverse problems, solving them through variational diffusion sampling. To enhance performance on downstream applications, we introduce a novel truncated timestep scheduling method specifically designed for pose data characteristics. We also propose a masked training mechanism that effectively combines whole-body and part-specific datasets, enabling our model to capture interdependencies between body parts while avoiding overfitting to specific actions. Extensive experiments demonstrate DPoser-X's robustness and versatility across multiple benchmarks for body, hand, face, and full-body pose modeling. Our model consistently outperforms state-of-the-art alternatives, establishing a new benchmark for whole-body human pose prior modeling.
Chinese: DPoser-X是一种基于扩散模型的3D全身人体姿态先验模型,通过将多种姿态任务统一为逆问题并采用创新的训练机制,在各项基准测试中均优于现有最优方法。
English: DPoser-X is a diffusion-based prior model for 3D whole-body human pose modeling that unifies various pose tasks as inverse problems and outperforms state-of-the-art methods through innovative training techniques.

Authors:Jiajun Le, Jiayi Ma
Title: GeoMoE: Divide-and-Conquer Motion Field Modeling with Mixture-of-Experts for Two-View Geometry
Abstract:
Recent progress in two-view geometry increasingly emphasizes enforcing smoothness and global consistency priors when estimating motion fields between pairs of images. However, in complex real-world scenes, characterized by extreme viewpoint and scale changes as well as pronounced depth discontinuities, the motion field often exhibits diverse and heterogeneous motion patterns. Most existing methods lack targeted modeling strategies and fail to explicitly account for this variability, resulting in estimated motion fields that diverge from their true underlying structure and distribution. We observe that Mixture-of-Experts (MoE) can assign dedicated experts to motion sub-fields, enabling a divide-and-conquer strategy for heterogeneous motion patterns. Building on this insight, we re-architect motion field modeling in two-view geometry with GeoMoE, a streamlined framework. Specifically, we first devise a Probabilistic Prior-Guided Decomposition strategy that exploits inlier probability signals to perform a structure-aware decomposition of the motion field into heterogeneous sub-fields, sharply curbing outlier-induced bias. Next, we introduce an MoE-Enhanced Bi-Path Rectifier that enhances each sub-field along spatial-context and channel-semantic paths and routes it to a customized expert for targeted modeling, thereby decoupling heterogeneous motion regimes, suppressing cross-sub-field interference and representational entanglement, and yielding fine-grained motion-field rectification. With this minimalist design, GeoMoE outperforms prior state-of-the-art methods in relative pose and homography estimation and shows strong generalization. The source code and pre-trained models are available at https://github.com/JiajunLe/GeoMoE.
中文摘要:近期两视图几何学进展强调在估计图像对间运动场时需加强平滑性和全局一致性,但现有方法在复杂场景中因异质运动模式而表现不佳,因此提出GeoMoE框架,利用专家混合模型分解并针对性建模这些模式,在相对位姿和单应性估计上超越现有最优方法。
English Summary: Recent advances in two-view geometry highlight the need for smoothness and global consistency in motion field estimation, but existing methods struggle with complex scenes due to heterogeneous motion patterns, leading to the introduction of GeoMoE, a framework that uses Mixture-of-Experts to decompose and model these patterns effectively, achieving superior performance in pose and homography estimation.

Authors:Marc Hölle, Walter Kellermann, Vasileios Belagiannis
Title: Uncertainty-Aware Likelihood Ratio Estimation for Pixel-Wise Out-of-Distribution Detection
Abstract:
Semantic segmentation models trained on known object classes often fail in real-world autonomous driving scenarios by confidently misclassifying unknown objects. While pixel-wise out-of-distribution detection can identify unknown objects, existing methods struggle in complex scenes where rare object classes are often confused with truly unknown objects. We introduce an uncertainty-aware likelihood ratio estimation method that addresses these limitations. Our approach uses an evidential classifier within a likelihood ratio test to distinguish between known and unknown pixel features from a semantic segmentation model, while explicitly accounting for uncertainty. Instead of producing point estimates, our method outputs probability distributions that capture uncertainty from both rare training examples and imperfect synthetic outliers. We show that by incorporating uncertainty in this way, outlier exposure can be leveraged more effectively. Evaluated on five standard benchmark datasets, our method achieves the lowest average false positive rate (2.5%) among state-of-the-art while maintaining high average precision (90.91%) and incurring only negligible computational overhead. Code is available at https://github.com/glasbruch/ULRE.
中文: 本文提出了一种不确定性感知似然比估计方法,能在自动驾驶的语义分割中有效区分已知与未知物体,以极低计算开销实现了最低误报率和较高精确度。
English: This paper introduces an uncertainty-aware likelihood ratio estimation method that effectively distinguishes known from unknown objects in semantic segmentation for autonomous driving, achieving a low false positive rate and high precision with minimal computational overhead.

Authors:Jingchao Xie, Oussema Dhaouadi, Weirong Chen, Johannes Meier, Jacques Kaiser, Daniel Cremers
Title: CoProU-VO: Combining Projected Uncertainty for End-to-End Unsupervised Monocular Visual Odometry
Abstract:
Visual Odometry (VO) is fundamental to autonomous navigation, robotics, and augmented reality, with unsupervised approaches eliminating the need for expensive ground-truth labels. However, these methods struggle when dynamic objects violate the static scene assumption, leading to erroneous pose estimations. We tackle this problem by uncertainty modeling, which is a commonly used technique that creates robust masks to filter out dynamic objects and occlusions without requiring explicit motion segmentation. Traditional uncertainty modeling considers only single-frame information, overlooking the uncertainties across consecutive frames. Our key insight is that uncertainty must be propagated and combined across temporal frames to effectively identify unreliable regions, particularly in dynamic scenes. To address this challenge, we introduce Combined Projected Uncertainty VO (CoProU-VO), a novel end-to-end approach that combines target frame uncertainty with projected reference frame uncertainty using a principled probabilistic formulation. Built upon vision transformer backbones, our model simultaneously learns depth, uncertainty estimation, and camera poses. Consequently, experiments on the KITTI and nuScenes datasets demonstrate significant improvements over previous unsupervised monocular end-to-end two-frame-based methods and exhibit strong performance in challenging highway scenes where other approaches often fail. Additionally, comprehensive ablation studies validate the effectiveness of cross-frame uncertainty propagation.

Authors:Jizhihui Liu, Feiyi Du, Guangdao Zhu, Niu Lian, Jun Li, Bin Chen
Title: HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models
Abstract:
Vision-Language Models (VLMs) encode images into lengthy sequences of visual tokens, leading to excessive computational overhead and limited inference efficiency. While prior efforts prune or merge tokens to address this issue, they often rely on special tokens (e.g., CLS) or require task-specific training, hindering scalability across architectures. In this paper, we propose HiPrune, a training-free and model-agnostic token Pruning framework that exploits the Hierarchical attention structure within vision encoders. We identify that middle layers attend to object-centric regions, while deep layers capture global contextual features. Based on this observation, HiPrune selects three types of informative tokens: (1) Anchor tokens with high attention in object-centric layers, (2) Buffer tokens adjacent to anchors for spatial continuity, and (3) Register tokens with strong attention in deep layers for global summarization. Our method requires no retraining and integrates seamlessly with any ViT-based VLM. Extensive experiments on LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL demonstrate that HiPrune achieves state-of-the-art pruning performance, preserving up to 99.3% task accuracy with only 33.3% tokens, and maintaining 99.5% accuracy with just 11.1% tokens. Meanwhile, it reduces inference FLOPs and latency by up to 9$\times$, showcasing strong generalization across models and tasks. Code is available at https://github.com/Danielement321/HiPrune.
Chinese: HiPrune是一种无需训练、模型无关的令牌剪枝框架,利用视觉编码器中的分层注意力结构保留信息丰富的令牌,仅用11.1%的令牌即可保持高达99.5%的任务准确率,同时将推理计算量和延迟降低高达9倍。
English: HiPrune is a training-free, model-agnostic token pruning framework that leverages hierarchical attention in vision encoders to retain informative tokens, achieving up to 99.5% task accuracy with only 11.1% tokens while reducing inference FLOPs and latency by up to 9 times.

Authors:Seunghyun Shin, Dongmin Shin, Jisu Shin, Hae-Gon Jeon, Joon-Young Lee
Title: Video Color Grading via Look-Up Table Generation
Abstract:
Different from color correction and transfer, color grading involves adjusting colors for artistic or storytelling purposes in a video, which is used to establish a specific look or mood. However, due to the complexity of the process and the need for specialized editing skills, video color grading remains primarily the domain of professional colorists. In this paper, we present a reference-based video color grading framework. Our key idea is explicitly generating a look-up table (LUT) for color attribute alignment between reference scenes and input video via a diffusion model. As a training objective, we enforce that high-level features of the reference scenes like look, mood, and emotion should be similar to that of the input video. Our LUT-based approach allows for color grading without any loss of structural details in the whole video frames as well as achieving fast inference. We further build a pipeline to incorporate a user-preference via text prompts for low-level feature enhancement such as contrast and brightness, etc. Experimental results, including extensive user studies, demonstrate the effectiveness of our approach for video color grading. Codes are publicly available at https://github.com/seunghyuns98/VideoColorGrading.
Chinese: 本文提出了一种基于参考的视频色彩分级框架,通过扩散模型生成查找表来对齐参考场景与输入视频的色彩属性,实现高效且保持结构细节的色彩调整,并可结合文本提示进行低层次特征增强。
English: This paper introduces a reference-based video color grading framework that uses a diffusion model to generate a lookup table for aligning color attributes with reference scenes, enabling efficient and structurally preserved color adjustments with optional text-guided enhancements.

Authors:Shuo Liang, Yiwu Zhong, Zi-Yuan Hu, Yeyao Tao, Liwei Wang
Title: Fine-grained Spatiotemporal Grounding on Egocentric Videos
Abstract:
Spatiotemporal video grounding aims to localize target entities in videos based on textual queries. While existing research has made significant progress in exocentric videos, the egocentric setting remains relatively underexplored, despite its growing importance in applications such as augmented reality and robotics. In this work, we conduct a systematic analysis of the discrepancies between egocentric and exocentric videos, revealing key challenges such as shorter object durations, sparser trajectories, smaller object sizes, and larger positional shifts. To address these challenges, we introduce EgoMask, the first pixel-level benchmark for fine-grained spatiotemporal grounding in egocentric videos. It is constructed by our proposed automatic annotation pipeline, which annotates referring expressions and object masks across short-, medium-, and long-term videos. Additionally, we create EgoMask-Train, a large-scale training dataset to facilitate model development. Experiments demonstrate that the state-of-the-art spatiotemporal grounding models perform poorly on our benchmark EgoMask, but fine-tuning on EgoMask-Train yields significant improvements, while preserving performance on exocentric datasets. Our work thus provides essential resources and insights for advancing egocentric video understanding. Our code is available at https://github.com/LaVi-Lab/EgoMask .
中文: 本研究提出了首个以自我为中心视频的像素级时空定位基准EgoMask,通过自动标注流程和大规模训练数据集解决了物体持续时间短、轨迹稀疏等挑战,显著提升了模型性能并保持了对外中心视频数据集的兼容性。
English: This study introduces EgoMask, the first pixel-level benchmark for spatiotemporal video grounding in egocentric videos, addressing challenges like shorter object durations and sparser trajectories through an automatic annotation pipeline and a large-scale training dataset, which significantly improves model performance while maintaining exocentric dataset compatibility.

Authors:Mohammed Kamran, Maria Bernathova, Raoul Varga, Christian F. Singer, Zsuzsanna Bago-Horvath, Thomas Helbich, Georg Langs, Philipp Seeböck
Title: LesiOnTime -- Joint Temporal and Clinical Modeling for Small Breast Lesion Segmentation in Longitudinal DCE-MRI
Abstract:
Accurate segmentation of small lesions in Breast Dynamic Contrast-Enhanced MRI (DCE-MRI) is critical for early cancer detection, especially in high-risk patients. While recent deep learning methods have advanced lesion segmentation, they primarily target large lesions and neglect valuable longitudinal and clinical information routinely used by radiologists. In real-world screening, detecting subtle or emerging lesions requires radiologists to compare across timepoints and consider previous radiology assessments, such as the BI-RADS score. We propose LesiOnTime, a novel 3D segmentation approach that mimics clinical diagnostic workflows by jointly leveraging longitudinal imaging and BIRADS scores. The key components are: (1) a Temporal Prior Attention (TPA) block that dynamically integrates information from previous and current scans; and (2) a BI-RADS Consistency Regularization (BCR) loss that enforces latent space alignment for scans with similar radiological assessments, thus embedding domain knowledge into the training process. Evaluated on a curated in-house longitudinal dataset of high-risk patients with DCE-MRI, our approach outperforms state-of-the-art single-timepoint and longitudinal baselines by 5% in terms of Dice. Ablation studies demonstrate that both TPA and BCR contribute complementary performance gains. These results highlight the importance of incorporating temporal and clinical context for reliable early lesion segmentation in real-world breast cancer screening. Our code is publicly available at https://github.com/cirmuw/LesiOnTime
中文: LesiOnTime提出了一种融合纵向MRI扫描和BI-RADS评分的3D分割方法,通过时序注意力机制和一致性正则化,在乳腺癌筛查中实现早期小病灶检测的Dice指标提升5%。
English: LesiOnTime introduces a 3D segmentation method that integrates longitudinal MRI scans and BI-RADS scores through temporal attention and consistency regularization, achieving a 5% Dice improvement for early small lesion detection in breast cancer screening.

Authors:Yuzhuo Chen, Zehua Ma, Jianhua Wang, Kai Kang, Shunyu Yao, Weiming Zhang
Title: LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer
Abstract:
In controllable image synthesis, generating coherent and consistent images from multiple references with spatial layout awareness remains an open challenge. We present LAMIC, a Layout-Aware Multi-Image Composition framework that, for the first time, extends single-reference diffusion models to multi-reference scenarios in a training-free manner. Built upon the MMDiT model, LAMIC introduces two plug-and-play attention mechanisms: 1) Group Isolation Attention (GIA) to enhance entity disentanglement; and 2) Region-Modulated Attention (RMA) to enable layout-aware generation. To comprehensively evaluate model capabilities, we further introduce three metrics: 1) Inclusion Ratio (IN-R) and Fill Ratio (FI-R) for assessing layout control; and 2) Background Similarity (BG-S) for measuring background consistency. Extensive experiments show that LAMIC achieves state-of-the-art performance across most major metrics: it consistently outperforms existing multi-reference baselines in ID-S, BG-S, IN-R and AVG scores across all settings, and achieves the best DPG in complex composition tasks. These results demonstrate LAMIC's superior abilities in identity keeping, background preservation, layout control, and prompt-following, all achieved without any training or fine-tuning, showcasing strong zero-shot generalization ability. By inheriting the strengths of advanced single-reference models and enabling seamless extension to multi-image scenarios, LAMIC establishes a new training-free paradigm for controllable multi-image composition. As foundation models continue to evolve, LAMIC's performance is expected to scale accordingly. Our implementation is available at: https://github.com/Suchenl/LAMIC.
中文:LAMIC是一种无需训练的框架,通过创新的注意力机制将单参考扩散模型扩展至多参考图像合成,具备布局感知能力,并以全面的评估指标实现了最先进的性能。
English: LAMIC is a training-free framework that extends single-reference diffusion models to multi-reference image synthesis with layout awareness, achieving state-of-the-art performance through novel attention mechanisms and comprehensive evaluation metrics.

Authors:Longfei Huang, Yu Liang, Hao Zhang, Jinwei Chen, Wei Dong, Lunde Chen, Wanyu Liu, Bo Li, Peng-Tao Jiang
Title: SDMatte: Grafting Diffusion Models for Interactive Matting
Abstract:
Recent interactive matting methods have shown satisfactory performance in capturing the primary regions of objects, but they fall short in extracting fine-grained details in edge regions. Diffusion models trained on billions of image-text pairs, demonstrate exceptional capability in modeling highly complex data distributions and synthesizing realistic texture details, while exhibiting robust text-driven interaction capabilities, making them an attractive solution for interactive matting. To this end, we propose SDMatte, a diffusion-driven interactive matting model, with three key contributions. First, we exploit the powerful priors of diffusion models and transform the text-driven interaction capability into visual prompt-driven interaction capability to enable interactive matting. Second, we integrate coordinate embeddings of visual prompts and opacity embeddings of target objects into U-Net, enhancing SDMatte's sensitivity to spatial position information and opacity information. Third, we propose a masked self-attention mechanism that enables the model to focus on areas specified by visual prompts, leading to better performance. Extensive experiments on multiple datasets demonstrate the superior performance of our method, validating its effectiveness in interactive matting. Our code and model are available at https://github.com/vivoCameraResearch/SDMatte.
最近的交互式抠图方法在主物体区域表现良好,但在精细边缘细节提取上存在不足,为此我们提出了SDMatte模型,通过视觉提示驱动、坐标嵌入和掩码自注意力机制显著提升细节处理能力,多数据集实验验证了其优越性能。
Recent interactive matting methods perform well on primary object regions but struggle with fine edge details, prompting the development of SDMatte, a diffusion-based model that enhances detail extraction through visual prompts, coordinate embeddings, and masked self-attention, validated by superior experimental results.

Authors:M. A. Pérez-Cutiño, J. Valverde, J. Capitán, J. M. Díaz-Báñez
Title: Reducing the gap between general purpose data and aerial images in concentrated solar power plants
Abstract:
In the context of Concentrated Solar Power (CSP) plants, aerial images captured by drones present a unique set of challenges. Unlike urban or natural landscapes commonly found in existing datasets, solar fields contain highly reflective surfaces, and domain-specific elements that are uncommon in traditional computer vision benchmarks. As a result, machine learning models trained on generic datasets struggle to generalize to this setting without extensive retraining and large volumes of annotated data. However, collecting and labeling such data is costly and time-consuming, making it impractical for rapid deployment in industrial applications. To address this issue, we propose a novel approach: the creation of AerialCSP, a virtual dataset that simulates aerial imagery of CSP plants. By generating synthetic data that closely mimic real-world conditions, our objective is to facilitate pretraining of models before deployment, significantly reducing the need for extensive manual labeling. Our main contributions are threefold: (1) we introduce AerialCSP, a high-quality synthetic dataset for aerial inspection of CSP plants, providing annotated data for object detection and image segmentation; (2) we benchmark multiple models on AerialCSP, establishing a baseline for CSP-related vision tasks; and (3) we demonstrate that pretraining on AerialCSP significantly improves real-world fault detection, particularly for rare and small defects, reducing the need for extensive manual labeling. AerialCSP is made publicly available at https://mpcutino.github.io/aerialcsp/.

Authors:Sumin Seo, In Kyu Lee, Hyun-Woo Kim, Jaesik Min, Chung-Hwan Jung
Title: Diffusion-Based User-Guided Data Augmentation for Coronary Stenosis Detection
Abstract:
Coronary stenosis is a major risk factor for ischemic heart events leading to increased mortality, and medical treatments for this condition require meticulous, labor-intensive analysis. Coronary angiography provides critical visual cues for assessing stenosis, supporting clinicians in making informed decisions for diagnosis and treatment. Recent advances in deep learning have shown great potential for automated localization and severity measurement of stenosis. In real-world scenarios, however, the success of these competent approaches is often hindered by challenges such as limited labeled data and class imbalance. In this study, we propose a novel data augmentation approach that uses an inpainting method based on a diffusion model to generate realistic lesions, allowing user-guided control of severity. Extensive evaluation on lesion detection and severity classification across various synthetic dataset sizes shows superior performance of our method on both a large-scale in-house dataset and a public coronary angiography dataset. Furthermore, our approach maintains high detection and classification performance even when trained with limited data, highlighting its clinical importance in improving the assessment of severity of stenosis and optimizing data utilization for more reliable decision support.
Chinese: 本研究提出了一种基于扩散模型的数据增强方法,通过修复生成用户可控严重程度的逼真冠状动脉病变,显著提升了狭窄检测与分类性能,尤其在数据有限时效果突出,并在内部和公共血管造影数据集上得到验证。
English: This study introduces a novel data augmentation method using a diffusion model for inpainting to generate realistic coronary lesions with user-controlled severity, which significantly enhances stenosis detection and classification performance, especially with limited data, as validated on both in-house and public angiography datasets.

Authors:Runmin Cong, Zongji Yu, Hao Fang, Haoyan Sun, Sam Kwong
Title: UIS-Mamba: Exploring Mamba for Underwater Instance Segmentation via Dynamic Tree Scan and Hidden State Weaken
Abstract:
Underwater Instance Segmentation (UIS) tasks are crucial for underwater complex scene detection. Mamba, as an emerging state space model with inherently linear complexity and global receptive fields, is highly suitable for processing image segmentation tasks with long sequence features. However, due to the particularity of underwater scenes, there are many challenges in applying Mamba to UIS. The existing fixed-patch scanning mechanism cannot maintain the internal continuity of scanned instances in the presence of severely underwater color distortion and blurred instance boundaries, and the hidden state of the complex underwater background can also inhibit the understanding of instance objects. In this work, we propose the first Mamba-based underwater instance segmentation model UIS-Mamba, and design two innovative modules, Dynamic Tree Scan (DTS) and Hidden State Weaken (HSW), to migrate Mamba to the underwater task. DTS module maintains the continuity of the internal features of the instance objects by allowing the patches to dynamically offset and scale, thereby guiding the minimum spanning tree and providing dynamic local receptive fields. HSW module suppresses the interference of complex backgrounds and effectively focuses the information flow of state propagation to the instances themselves through the Ncut-based hidden state weakening mechanism. Experimental results show that UIS-Mamba achieves state-of-the-art performance on both UIIS and USIS10K datasets, while maintaining a low number of parameters and computational complexity. Code is available at https://github.com/Maricalce/UIS-Mamba.
中文摘要:UIS-Mamba模型通过动态树扫描和隐藏状态弱化模块,成功将Mamba架构应用于水下实例分割任务,在保持计算效率的同时实现了最优性能。
English Summary: The UIS-Mamba model introduces Dynamic Tree Scan and Hidden State Weaken modules to adapt the Mamba architecture for underwater instance segmentation, achieving state-of-the-art performance while maintaining computational efficiency.

Authors:Sangwoo Youn, Minji Lee, Nokap Tony Park, Yeonggyoo Jeon, Taeyoung Na
Title: IN2OUT: Fine-Tuning Video Inpainting Model for Video Outpainting Using Hierarchical Discriminator
Abstract:
Video outpainting presents a unique challenge of extending the borders while maintaining consistency with the given content. In this paper, we suggest the use of video inpainting models that excel in object flow learning and reconstruction in outpainting rather than solely generating the background as in existing methods. However, directly applying or fine-tuning inpainting models to outpainting has shown to be ineffective, often leading to blurry results. Our extensive experiments on discriminator designs reveal that a critical component missing in the outpainting fine-tuning process is a discriminator capable of effectively assessing the perceptual quality of the extended areas. To tackle this limitation, we differentiate the objectives of adversarial training into global and local goals and introduce a hierarchical discriminator that meets both objectives. Additionally, we develop a specialized outpainting loss function that leverages both local and global features of the discriminator. Fine-tuning on this adversarial loss function enhances the generator's ability to produce both visually appealing and globally coherent outpainted scenes. Our proposed method outperforms state-of-the-art methods both quantitatively and qualitatively. Supplementary materials including the demo video and the code are available in SigPort.
Chinese: 本文提出了一种分层判别器和专门设计的损失函数,通过提升感知质量和全局一致性来改进视频外推效果,在定量和定性评估中均优于现有方法。
English: This paper introduces a hierarchical discriminator and a specialized loss function to enhance video outpainting by improving perceptual quality and global coherence, outperforming existing methods.

Authors:Junyu Chen, Dongyun Zou, Wenkun He, Junsong Chen, Enze Xie, Song Han, Han Cai
Title: DC-AE 1.5: Accelerating Diffusion Model Convergence with Structured Latent Space
Abstract:
We present DC-AE 1.5, a new family of deep compression autoencoders for high-resolution diffusion models. Increasing the autoencoder's latent channel number is a highly effective approach for improving its reconstruction quality. However, it results in slow convergence for diffusion models, leading to poorer generation quality despite better reconstruction quality. This issue limits the quality upper bound of latent diffusion models and hinders the employment of autoencoders with higher spatial compression ratios. We introduce two key innovations to address this challenge: i) Structured Latent Space, a training-based approach to impose a desired channel-wise structure on the latent space with front latent channels capturing object structures and latter latent channels capturing image details; ii) Augmented Diffusion Training, an augmented diffusion training strategy with additional diffusion training objectives on object latent channels to accelerate convergence. With these techniques, DC-AE 1.5 delivers faster convergence and better diffusion scaling results than DC-AE. On ImageNet 512x512, DC-AE-1.5-f64c128 delivers better image generation quality than DC-AE-f32c32 while being 4x faster. Code: https://github.com/dc-ai-projects/DC-Gen.
中文摘要:DC-AE 1.5通过结构化潜在空间和增强扩散训练技术,解决了高分辨率扩散模型收敛慢的问题,在提升图像生成质量的同时实现了更快的处理速度。
English Summary: DC-AE 1.5 introduces structured latent space and augmented diffusion training to overcome slow convergence issues in high-resolution diffusion models, achieving both superior image generation quality and faster processing speeds.

Authors:Janika Deborah Gajo, Gerarld Paul Merales, Jerome Escarcha, Brenden Ashley Molina, Gian Nartea, Emmanuel G. Maminta, Juan Carlos Roldan, Rowel O. Atienza
Title: Sari Sandbox: A Virtual Retail Store Environment for Embodied AI Agents
Abstract:
We present Sari Sandbox, a high-fidelity, photorealistic 3D retail store simulation for benchmarking embodied agents against human performance in shopping tasks. Addressing a gap in retail-specific sim environments for embodied agent training, Sari Sandbox features over 250 interactive grocery items across three store configurations, controlled via an API. It supports both virtual reality (VR) for human interaction and a vision language model (VLM)-powered embodied agent. We also introduce SariBench, a dataset of annotated human demonstrations across varied task difficulties. Our sandbox enables embodied agents to navigate, inspect, and manipulate retail items, providing baselines against human performance. We conclude with benchmarks, performance analysis, and recommendations for enhancing realism and scalability. The source code can be accessed via https://github.com/upeee/sari-sandbox-env.
中文: Sari Sandbox 是一个高保真、照片级真实的 3D 零售商店模拟环境,用于在购物任务中对具身智能体进行训练和性能基准测试,支持虚拟现实交互和视觉语言模型驱动的智能体操作。
English: Sari Sandbox is a photorealistic 3D retail simulation environment designed to train and benchmark embodied agents against human performance in shopping tasks, featuring interactive grocery items and supporting both VR interactions and VLM-powered agents.

Authors:Raiyaan Abdullah, Yogesh Singh Rawat, Shruti Vyas
Title: iSafetyBench: A video-language benchmark for safety in industrial environment
Abstract:
Recent advances in vision-language models (VLMs) have enabled impressive generalization across diverse video understanding tasks under zero-shot settings. However, their capabilities in high-stakes industrial domains-where recognizing both routine operations and safety-critical anomalies is essential-remain largely underexplored. To address this gap, we introduce iSafetyBench, a new video-language benchmark specifically designed to evaluate model performance in industrial environments across both normal and hazardous scenarios. iSafetyBench comprises 1,100 video clips sourced from real-world industrial settings, annotated with open-vocabulary, multi-label action tags spanning 98 routine and 67 hazardous action categories. Each clip is paired with multiple-choice questions for both single-label and multi-label evaluation, enabling fine-grained assessment of VLMs in both standard and safety-critical contexts. We evaluate eight state-of-the-art video-language models under zero-shot conditions. Despite their strong performance on existing video benchmarks, these models struggle with iSafetyBench-particularly in recognizing hazardous activities and in multi-label scenarios. Our results reveal significant performance gaps, underscoring the need for more robust, safety-aware multimodal models for industrial applications. iSafetyBench provides a first-of-its-kind testbed to drive progress in this direction. The dataset is available at: https://github.com/iSafetyBench/data.
中文摘要:尽管视觉语言模型在标准视频任务中表现出强大的泛化能力,但在工业安全场景下表现不佳,为此我们开发了iSafetyBench——一个包含1100个真实工业视频的专用基准测试,揭示了模型在危险活动识别方面的显著不足,强调了开发更鲁棒的安全感知模型的必要性。
English Summary: Recent vision-language models show strong generalization in standard video tasks but perform poorly in industrial safety scenarios, prompting the creation of iSafetyBench—a specialized benchmark with 1,100 real-world industrial videos—which reveals significant gaps in recognizing hazardous activities and highlights the need for more robust safety-aware models.

Authors:Fei Zhang, Tianfei Zhou, Jiangchao Yao, Ya Zhang, Ivor W. Tsang, Yanfeng Wang
Title: Decouple before Align: Visual Disentanglement Enhances Prompt Tuning
Abstract:
Prompt tuning (PT), as an emerging resource-efficient fine-tuning paradigm, has showcased remarkable effectiveness in improving the task-specific transferability of vision-language models. This paper delves into a previously overlooked information asymmetry issue in PT, where the visual modality mostly conveys more context than the object-oriented textual modality. Correspondingly, coarsely aligning these two modalities could result in the biased attention, driving the model to merely focus on the context area. To address this, we propose DAPT, an effective PT framework based on an intuitive decouple-before-align concept. First, we propose to explicitly decouple the visual modality into the foreground and background representation via exploiting coarse-and-fine visual segmenting cues, and then both of these decoupled patterns are aligned with the original foreground texts and the hand-crafted background classes, thereby symmetrically strengthening the modal alignment. To further enhance the visual concentration, we propose a visual pull-push regularization tailored for the foreground-background patterns, directing the original visual representation towards unbiased attention on the region-of-interest object. We demonstrate the power of architecture-free DAPT through few-shot learning, base-to-novel generalization, and data-efficient learning, all of which yield superior performance across prevailing benchmarks. Our code will be released at https://github.com/Ferenas/DAPT.
中文: 提示调优作为一种高效的微调方法,虽能提升视觉语言模型性能,但存在视觉与文本信息不对称问题;本文提出的DAPT框架通过解耦视觉前景与背景并分别与文本对齐,有效增强模态对称性和注意力聚焦。
English: Prompt tuning is an efficient fine-tuning method that enhances vision-language models but suffers from visual-textual information asymmetry, which the proposed DAPT framework addresses by decoupling and aligning visual foreground and background with text to improve modal symmetry and attention focus.

Authors:Guanjie Huang, Danny H. K. Tsang, Shan Yang, Guangzhi Lei, Li Liu
Title: Cued-Agent: A Collaborative Multi-Agent System for Automatic Cued Speech Recognition
Abstract:
Cued Speech (CS) is a visual communication system that combines lip-reading with hand coding to facilitate communication for individuals with hearing impairments. Automatic CS Recognition (ACSR) aims to convert CS hand gestures and lip movements into text via AI-driven methods. Traditionally, the temporal asynchrony between hand and lip movements requires the design of complex modules to facilitate effective multimodal fusion. However, constrained by limited data availability, current methods demonstrate insufficient capacity for adequately training these fusion mechanisms, resulting in suboptimal performance. Recently, multi-agent systems have shown promising capabilities in handling complex tasks with limited data availability. To this end, we propose the first collaborative multi-agent system for ACSR, named Cued-Agent. It integrates four specialized sub-agents: a Multimodal Large Language Model-based Hand Recognition agent that employs keyframe screening and CS expert prompt strategies to decode hand movements, a pretrained Transformer-based Lip Recognition agent that extracts lip features from the input video, a Hand Prompt Decoding agent that dynamically integrates hand prompts with lip features during inference in a training-free manner, and a Self-Correction Phoneme-to-Word agent that enables post-process and end-to-end conversion from phoneme sequences to natural language sentences for the first time through semantic refinement. To support this study, we expand the existing Mandarin CS dataset by collecting data from eight hearing-impaired cuers, establishing a mixed dataset of fourteen subjects. Extensive experiments demonstrate that our Cued-Agent performs superbly in both normal and hearing-impaired scenarios compared with state-of-the-art methods. The implementation is available at https://github.com/DennisHgj/Cued-Agent.
中文: 提出的Cued-Agent系统采用协作多智能体框架解决提示性语音自动识别中的数据限制问题,通过集成手部动作解码、唇部特征提取、多模态融合和语义优化等专用智能体,在正常和听力受损场景下均实现了卓越性能。
English: The proposed Cued-Agent system employs a collaborative multi-agent framework to overcome data limitations in automatic Cued Speech recognition, integrating specialized agents for hand gesture decoding, lip feature extraction, multimodal fusion, and semantic refinement to achieve superior performance in both normal and hearing-impaired scenarios.

Authors:Won June Cho, Hongjun Yoon, Daeky Jeong, Hyeongyeol Lim, Yosep Chong
Title: $MV_{Hybrid}$: Improving Spatial Transcriptomics Prediction with Hybrid State Space-Vision Transformer Backbone in Pathology Vision Foundation Models
Abstract:
Spatial transcriptomics reveals gene expression patterns within tissue context, enabling precision oncology applications such as treatment response prediction, but its high cost and technical complexity limit clinical adoption. Predicting spatial gene expression (biomarkers) from routine histopathology images offers a practical alternative, yet current vision foundation models (VFMs) in pathology based on Vision Transformer (ViT) backbones perform below clinical standards. Given that VFMs are already trained on millions of diverse whole slide images, we hypothesize that architectural innovations beyond ViTs may better capture the low-frequency, subtle morphological patterns correlating with molecular phenotypes. By demonstrating that state space models initialized with negative real eigenvalues exhibit strong low-frequency bias, we introduce $MV_{Hybrid}$, a hybrid backbone architecture combining state space models (SSMs) with ViT. We compare five other different backbone architectures for pathology VFMs, all pretrained on identical colorectal cancer datasets using the DINOv2 self-supervised learning method. We evaluate all pretrained models using both random split and leave-one-study-out (LOSO) settings of the same biomarker dataset. In LOSO evaluation, $MV_{Hybrid}$ achieves 57% higher correlation than the best-performing ViT and shows 43% smaller performance degradation compared to random split in gene expression prediction, demonstrating superior performance and robustness, respectively. Furthermore, $MV_{Hybrid}$ shows equal or better downstream performance in classification, patch retrieval, and survival prediction tasks compared to that of ViT, showing its promise as a next-generation pathology VFM backbone. Our code is publicly available at: https://github.com/deepnoid-ai/MVHybrid.
中文摘要:本研究提出的MV_{Hybrid}混合架构结合状态空间模型与视觉Transformer,在从病理图像预测空间基因表达方面显著优于现有模型,同时在多项临床任务中展现出卓越的鲁棒性和性能表现。
English Summary: The study introduces MV_{Hybrid}, a hybrid architecture combining state space models with Vision Transformers, which significantly outperforms existing models in predicting spatial gene expression from pathology images while demonstrating superior robustness and performance across multiple clinical tasks.

Authors:Joonmyung Choi, Sanghyeok Lee, Byungoh Ko, Eunseo Kim, Jihyung Kil, Hyunwoo J. Kim
Title: Representation Shift: Unifying Token Compression with FlashAttention
Abstract:
Transformers have demonstrated remarkable success across vision, language, and video. Yet, increasing task complexity has led to larger models and more tokens, raising the quadratic cost of self-attention and the overhead of GPU memory access. To reduce the computation cost of self-attention, prior work has proposed token compression techniques that drop redundant or less informative tokens. Meanwhile, fused attention kernels such as FlashAttention have been developed to alleviate memory overhead by avoiding attention map construction and its associated I/O to HBM. This, however, makes it incompatible with most training-free token compression methods, which rely on attention maps to determine token importance. Here, we propose Representation Shift, a training-free, model-agnostic metric that measures the degree of change in each token's representation. This seamlessly integrates token compression with FlashAttention, without attention maps or retraining. Our method further generalizes beyond Transformers to CNNs and state space models. Extensive experiments show that Representation Shift enables effective token compression compatible with FlashAttention, yielding significant speedups of up to 5.5% and 4.4% in video-text retrieval and video QA, respectively. Code is available at https://github.com/mlvlab/Representation-Shift.
中文: 提出的表征偏移指标通过测量令牌表征变化,实现了与内存优化型FlashAttention兼容的免训练令牌压缩,在视频任务中无需注意力图或重新训练即可获得显著加速效果。
English: The proposed Representation Shift metric enables training-free token compression compatible with memory-efficient FlashAttention by measuring token representation changes, achieving significant speedups in video tasks without attention maps or retraining.

Authors:Liang Han, Xu Zhang, Haichuan Song, Kanle Shi, Yu-Shen Liu, Zhizhong Han
Title: SparseRecon: Neural Implicit Surface Reconstruction from Sparse Views with Feature and Depth Consistencies
Abstract:
Surface reconstruction from sparse views aims to reconstruct a 3D shape or scene from few RGB images. The latest methods are either generalization-based or overfitting-based. However, the generalization-based methods do not generalize well on views that were unseen during training, while the reconstruction quality of overfitting-based methods is still limited by the limited geometry clues. To address this issue, we propose SparseRecon, a novel neural implicit reconstruction method for sparse views with volume rendering-based feature consistency and uncertainty-guided depth constraint. Firstly, we introduce a feature consistency loss across views to constrain the neural implicit field. This design alleviates the ambiguity caused by insufficient consistency information of views and ensures completeness and smoothness in the reconstruction results. Secondly, we employ an uncertainty-guided depth constraint to back up the feature consistency loss in areas with occlusion and insignificant features, which recovers geometry details for better reconstruction quality. Experimental results demonstrate that our method outperforms the state-of-the-art methods, which can produce high-quality geometry with sparse-view input, especially in the scenarios with small overlapping views. Project page: https://hanl2010.github.io/SparseRecon/.
中文摘要:SparseRecon提出了一种结合体积渲染特征一致性和不确定性引导深度约束的神经隐式重建方法,有效解决了稀疏视图重建中的几何线索不足问题,在重叠视图较少的场景中实现了最先进的重建质量。
English Summary: SparseRecon introduces a neural implicit method with feature consistency loss and uncertainty-guided depth constraints to overcome limitations in sparse-view 3D reconstruction, achieving superior geometry quality particularly in low-overlap scenarios.

Authors:Tianshuang Qiu, Zehan Ma, Karim El-Refai, Hiya Shah, Chung Min Kim, Justin Kerr, Ken Goldberg
Title: Omni-Scan: Creating Visually-Accurate Digital Twin Object Models Using a Bimanual Robot with Handover and Gaussian Splat Merging
Abstract:
3D Gaussian Splats (3DGSs) are 3D object models derived from multi-view images. Such "digital twins" are useful for simulations, virtual reality, marketing, robot policy fine-tuning, and part inspection. 3D object scanning usually requires multi-camera arrays, precise laser scanners, or robot wrist-mounted cameras, which have restricted workspaces. We propose Omni-Scan, a pipeline for producing high-quality 3D Gaussian Splat models using a bi-manual robot that grasps an object with one gripper and rotates the object with respect to a stationary camera. The object is then re-grasped by a second gripper to expose surfaces that were occluded by the first gripper. We present the Omni-Scan robot pipeline using DepthAny-thing, Segment Anything, as well as RAFT optical flow models to identify and isolate objects held by a robot gripper while removing the gripper and the background. We then modify the 3DGS training pipeline to support concatenated datasets with gripper occlusion, producing an omni-directional (360 degree view) model of the object. We apply Omni-Scan to part defect inspection, finding that it can identify visual or geometric defects in 12 different industrial and household objects with an average accuracy of 83%. Interactive videos of Omni-Scan 3DGS models can be found at https://berkeleyautomation.github.io/omni-scan/

Authors:Suhang Cai, Xiaohao Peng, Chong Wang, Xiaojie Cai, Jiangbo Qian
Title: GV-VAD : Exploring Video Generation for Weakly-Supervised Video Anomaly Detection
Abstract:
Video anomaly detection (VAD) plays a critical role in public safety applications such as intelligent surveillance. However, the rarity, unpredictability, and high annotation cost of real-world anomalies make it difficult to scale VAD datasets, which limits the performance and generalization ability of existing models. To address this challenge, we propose a generative video-enhanced weakly-supervised video anomaly detection (GV-VAD) framework that leverages text-conditioned video generation models to produce semantically controllable and physically plausible synthetic videos. These virtual videos are used to augment training data at low cost. In addition, a synthetic sample loss scaling strategy is utilized to control the influence of generated synthetic samples for efficient training. The experiments show that the proposed framework outperforms state-of-the-art methods on UCF-Crime datasets. The code is available at https://github.com/Sumutan/GV-VAD.git.
中文: 提出的GV-VAD框架通过生成合成视频来增强训练数据,在UCF-Crime数据集上超越了现有最优方法,提升了视频异常检测性能。
English: The proposed GV-VAD framework enhances video anomaly detection by generating synthetic videos to augment training data, outperforming state-of-the-art methods on the UCF-Crime dataset.

Authors:Henghui Ding, Song Tang, Shuting He, Chang Liu, Zuxuan Wu, Yu-Gang Jiang
Title: Multimodal Referring Segmentation: A Survey
Abstract:
Multimodal referring segmentation aims to segment target objects in visual scenes, such as images, videos, and 3D scenes, based on referring expressions in text or audio format. This task plays a crucial role in practical applications requiring accurate object perception based on user instructions. Over the past decade, it has gained significant attention in the multimodal community, driven by advances in convolutional neural networks, transformers, and large language models, all of which have substantially improved multimodal perception capabilities. This paper provides a comprehensive survey of multimodal referring segmentation. We begin by introducing this field's background, including problem definitions and commonly used datasets. Next, we summarize a unified meta architecture for referring segmentation and review representative methods across three primary visual scenes, including images, videos, and 3D scenes. We further discuss Generalized Referring Expression (GREx) methods to address the challenges of real-world complexity, along with related tasks and practical applications. Extensive performance comparisons on standard benchmarks are also provided. We continually track related works at https://github.com/henghuiding/Awesome-Multimodal-Referring-Segmentation.
Chinese: 本文全面综述了多模态指称分割,涵盖其背景、统一架构及在图像、视频和3D场景中的应用方法,同时探讨了应对现实复杂性的策略并提供了性能对比分析。
English: This paper presents a comprehensive survey of multimodal referring segmentation, detailing its background, unified architecture, and methods across images, videos, and 3D scenes, while addressing real-world challenges and providing performance comparisons.

Authors:Bhavya Goyal, Felipe Gutierrez-Barragan, Wei Lin, Andreas Velten, Yin Li, Mohit Gupta
Title: Robust 3D Object Detection using Probabilistic Point Clouds from Single-Photon LiDARs
Abstract:
LiDAR-based 3D sensors provide point clouds, a canonical 3D representation used in various scene understanding tasks. Modern LiDARs face key challenges in several real-world scenarios, such as long-distance or low-albedo objects, producing sparse or erroneous point clouds. These errors, which are rooted in the noisy raw LiDAR measurements, get propagated to downstream perception models, resulting in potentially severe loss of accuracy. This is because conventional 3D processing pipelines do not retain any uncertainty information from the raw measurements when constructing point clouds. We propose Probabilistic Point Clouds (PPC), a novel 3D scene representation where each point is augmented with a probability attribute that encapsulates the measurement uncertainty (or confidence) in the raw data. We further introduce inference approaches that leverage PPC for robust 3D object detection; these methods are versatile and can be used as computationally lightweight drop-in modules in 3D inference pipelines. We demonstrate, via both simulations and real captures, that PPC-based 3D inference methods outperform several baselines using LiDAR as well as camera-LiDAR fusion models, across challenging indoor and outdoor scenarios involving small, distant, and low-albedo objects, as well as strong ambient light. Our project webpage is at https://bhavyagoyal.github.io/ppc .

Authors:Tomasz Szczepański, Szymon Płotka, Michal K. Grzeszczyk, Arleta Adamowicz, Piotr Fudalej, Przemysław Korzeniowski, Tomasz Trzciński, Arkadiusz Sitek
Title: GEPAR3D: Geometry Prior-Assisted Learning for 3D Tooth Segmentation
Abstract:
Tooth segmentation in Cone-Beam Computed Tomography (CBCT) remains challenging, especially for fine structures like root apices, which is critical for assessing root resorption in orthodontics. We introduce GEPAR3D, a novel approach that unifies instance detection and multi-class segmentation into a single step tailored to improve root segmentation. Our method integrates a Statistical Shape Model of dentition as a geometric prior, capturing anatomical context and morphological consistency without enforcing restrictive adjacency constraints. We leverage a deep watershed method, modeling each tooth as a continuous 3D energy basin encoding voxel distances to boundaries. This instance-aware representation ensures accurate segmentation of narrow, complex root apices. Trained on publicly available CBCT scans from a single center, our method is evaluated on external test sets from two in-house and two public medical centers. GEPAR3D achieves the highest overall segmentation performance, averaging a Dice Similarity Coefficient (DSC) of 95.0% (+2.8% over the second-best method) and increasing recall to 95.2% (+9.5%) across all test sets. Qualitative analyses demonstrated substantial improvements in root segmentation quality, indicating significant potential for more accurate root resorption assessment and enhanced clinical decision-making in orthodontics. We provide the implementation and dataset at https://github.com/tomek1911/GEPAR3D.
中文: GEPAR3D提出了一种结合统计形状模型与深度分水岭算法的统一检测分割方法,在CBCT影像中实现了95.0%的Dice系数,显著提升了牙根尖端分割精度,为正畸治疗中的牙根吸收评估提供了更可靠的解决方案。
English: GEPAR3D introduces a unified deep learning approach combining instance detection and multi-class segmentation with a statistical shape model, achieving superior tooth segmentation performance in CBCT scans with a 95.0% Dice score and significant improvements in root apex delineation for orthodontic applications.

Authors:Li Mi, Manon Bechaz, Zeming Chen, Antoine Bosselut, Devis Tuia
Title: GeoExplorer: Active Geo-localization with Curiosity-Driven Exploration
Abstract:
Active Geo-localization (AGL) is the task of localizing a goal, represented in various modalities (e.g., aerial images, ground-level images, or text), within a predefined search area. Current methods approach AGL as a goal-reaching reinforcement learning (RL) problem with a distance-based reward. They localize the goal by implicitly learning to minimize the relative distance from it. However, when distance estimation becomes challenging or when encountering unseen targets and environments, the agent exhibits reduced robustness and generalization ability due to the less reliable exploration strategy learned during training. In this paper, we propose GeoExplorer, an AGL agent that incorporates curiosity-driven exploration through intrinsic rewards. Unlike distance-based rewards, our curiosity-driven reward is goal-agnostic, enabling robust, diverse, and contextually relevant exploration based on effective environment modeling. These capabilities have been proven through extensive experiments across four AGL benchmarks, demonstrating the effectiveness and generalization ability of GeoExplorer in diverse settings, particularly in localizing unfamiliar targets and environments.

Authors:Ashkan Shakarami, Yousef Yeganeh, Azade Farshad, Lorenzo Nicole, Stefano Ghidoni, Nassir Navab
Title: Stress-Aware Resilient Neural Training
Abstract:
This paper introduces Stress-Aware Learning, a resilient neural training paradigm in which deep neural networks dynamically adjust their optimization behavior - whether under stable training regimes or in settings with uncertain dynamics - based on the concept of Temporary (Elastic) and Permanent (Plastic) Deformation, inspired by structural fatigue in materials science. To instantiate this concept, we propose Plastic Deformation Optimizer, a stress-aware mechanism that injects adaptive noise into model parameters whenever an internal stress signal - reflecting stagnation in training loss and accuracy - indicates persistent optimization difficulty. This enables the model to escape sharp minima and converge toward flatter, more generalizable regions of the loss landscape. Experiments across six architectures, four optimizers, and seven vision benchmarks demonstrate improved robustness and generalization with minimal computational overhead. The code and 3D visuals will be available on GitHub: https://github.com/Stress-Aware-Learning/SAL.
中文: 本文提出应力感知学习这一弹性神经训练范式,通过塑性变形优化器向模型参数注入自适应噪声,使模型能够逃离尖锐极小值并收敛至更平坦、泛化能力更强的损失区域,在多种架构和基准测试中展现出卓越的鲁棒性。
English: This paper presents Stress-Aware Learning, a resilient neural training paradigm that uses a Plastic Deformation Optimizer to inject adaptive noise into model parameters, enabling escape from sharp minima and convergence toward flatter, more generalizable loss regions with demonstrated robustness across multiple architectures and benchmarks.

Authors:Raiyaan Abdullah, Jared Claypoole, Michael Cogswell, Ajay Divakaran, Yogesh Rawat
Title: Punching Bag vs. Punching Person: Motion Transferability in Videos
Abstract:
Action recognition models demonstrate strong generalization, but can they effectively transfer high-level motion concepts across diverse contexts, even within similar distributions? For example, can a model recognize the broad action "punching" when presented with an unseen variation such as "punching person"? To explore this, we introduce a motion transferability framework with three datasets: (1) Syn-TA, a synthetic dataset with 3D object motions; (2) Kinetics400-TA; and (3) Something-Something-v2-TA, both adapted from natural video datasets. We evaluate 13 state-of-the-art models on these benchmarks and observe a significant drop in performance when recognizing high-level actions in novel contexts. Our analysis reveals: 1) Multimodal models struggle more with fine-grained unknown actions than with coarse ones; 2) The bias-free Syn-TA proves as challenging as real-world datasets, with models showing greater performance drops in controlled settings; 3) Larger models improve transferability when spatial cues dominate but struggle with intensive temporal reasoning, while reliance on object and background cues hinders generalization. We further explore how disentangling coarse and fine motions can improve recognition in temporally challenging datasets. We believe this study establishes a crucial benchmark for assessing motion transferability in action recognition. Datasets and relevant code: https://github.com/raiyaan-abdullah/Motion-Transfer.
中文摘要:本研究提出一个运动可迁移性框架,评估动作识别模型在新情境下泛化高级运动概念的能力,发现模型性能显著下降,并揭示了时序推理和空间偏差对迁移效果的关键影响。
English Summary: This study introduces a motion transferability framework to evaluate how well action recognition models generalize high-level motion concepts across novel contexts, revealing significant performance drops and highlighting challenges in temporal reasoning and spatial bias.

Authors:Bowen Zhang, Sicheng Xu, Chuxin Wang, Jiaolong Yang, Feng Zhao, Dong Chen, Baining Guo
Title: Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis
Abstract:
In this paper, we present a novel framework for video-to-4D generation that creates high-quality dynamic 3D content from single video inputs. Direct 4D diffusion modeling is extremely challenging due to costly data construction and the high-dimensional nature of jointly representing 3D shape, appearance, and motion. We address these challenges by introducing a Direct 4DMesh-to-GS Variation Field VAE that directly encodes canonical Gaussian Splats (GS) and their temporal variations from 3D animation data without per-instance fitting, and compresses high-dimensional animations into a compact latent space. Building upon this efficient representation, we train a Gaussian Variation Field diffusion model with temporal-aware Diffusion Transformer conditioned on input videos and canonical GS. Trained on carefully-curated animatable 3D objects from the Objaverse dataset, our model demonstrates superior generation quality compared to existing methods. It also exhibits remarkable generalization to in-the-wild video inputs despite being trained exclusively on synthetic data, paving the way for generating high-quality animated 3D content. Project page: https://gvfdiffusion.github.io/.
本文提出了一种新颖的视频转4D框架,通过采用在合成数据上训练的高斯变化场扩散模型,从单视频生成动态3D内容,实现了卓越的生成质量和对真实场景输入的泛化能力。
This paper introduces a novel video-to-4D framework that generates dynamic 3D content from single videos by employing a Gaussian Variation Field diffusion model trained on synthetic data, achieving superior quality and generalization to real-world inputs.

Authors:Jessica Bader, Leander Girrbach, Stephan Alaniz, Zeynep Akata
Title: SUB: Benchmarking CBM Generalization via Synthetic Attribute Substitutions
Abstract:
Concept Bottleneck Models (CBMs) and other concept-based interpretable models show great promise for making AI applications more transparent, which is essential in fields like medicine. Despite their success, we demonstrate that CBMs struggle to reliably identify the correct concepts under distribution shifts. To assess the robustness of CBMs to concept variations, we introduce SUB: a fine-grained image and concept benchmark containing 38,400 synthetic images based on the CUB dataset. To create SUB, we select a CUB subset of 33 bird classes and 45 concepts to generate images which substitute a specific concept, such as wing color or belly pattern. We introduce a novel Tied Diffusion Guidance (TDG) method to precisely control generated images, where noise sharing for two parallel denoising processes ensures that both the correct bird class and the correct attribute are generated. This novel benchmark enables rigorous evaluation of CBMs and similar interpretable models, contributing to the development of more robust methods. Our code is available at https://github.com/ExplainableML/sub and the dataset at http://huggingface.co/datasets/Jessica-bader/SUB.
Chinese: 概念瓶颈模型(CBMs)在分布变化下难以可靠识别正确概念,为此我们引入了包含38,400张合成图像的SUB基准和捆绑扩散引导方法,以严格评估并推动更稳健可解释模型的发展。
English: Concept Bottleneck Models (CBMs) face challenges in accurately identifying concepts under distribution shifts, prompting the development of the SUB benchmark with 38,400 synthetic images and a Tied Diffusion Guidance method to evaluate and enhance their robustness.

Authors:Zihan Wang, Jeff Tan, Tarasha Khurana, Neehar Peri, Deva Ramanan
Title: MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion
Abstract:
We address the problem of dynamic scene reconstruction from sparse-view videos. Prior work often requires dense multi-view captures with hundreds of calibrated cameras (e.g. Panoptic Studio). Such multi-view setups are prohibitively expensive to build and cannot capture diverse scenes in-the-wild. In contrast, we aim to reconstruct dynamic human behaviors, such as repairing a bike or dancing, from a small set of sparse-view cameras with complete scene coverage (e.g. four equidistant inward-facing static cameras). We find that dense multi-view reconstruction methods struggle to adapt to this sparse-view setup due to limited overlap between viewpoints. To address these limitations, we carefully align independent monocular reconstructions of each camera to produce time- and view-consistent dynamic scene reconstructions. Extensive experiments on PanopticStudio and Ego-Exo4D demonstrate that our method achieves higher quality reconstructions than prior art, particularly when rendering novel views. Code, data, and data-processing scripts are available on https://github.com/ImNotPrepared/MonoFusion.
Chinese: 本研究提出一种通过对齐独立单目重建来从稀疏视角视频中重构动态场景的方法,在新视角合成方面相比现有技术取得了更优的效果。
English: This study introduces a method for reconstructing dynamic scenes from sparse-view videos by aligning independent monocular reconstructions, achieving superior results over existing techniques especially in novel view synthesis.

Authors:Miaosen Zhang, Ziqiang Xu, Jialiang Zhu, Qi Dai, Kai Qiu, Yifan Yang, Chong Luo, Tianyi Chen, Justin Wagle, Tim Franklin, Baining Guo
Title: Phi-Ground Tech Report: Advancing Perception in GUI Grounding
Abstract:
With the development of multimodal reasoning models, Computer Use Agents (CUAs), akin to Jarvis from \textit{"Iron Man"}, are becoming a reality. GUI grounding is a core component for CUAs to execute actual actions, similar to mechanical control in robotics, and it directly leads to the success or failure of the system. It determines actions such as clicking and typing, as well as related parameters like the coordinates for clicks. Current end-to-end grounding models still achieve less than 65\% accuracy on challenging benchmarks like ScreenSpot-pro and UI-Vision, indicating they are far from being ready for deployment. % , as a single misclick can result in unacceptable consequences. In this work, we conduct an empirical study on the training of grounding models, examining details from data collection to model training. Ultimately, we developed the \textbf{Phi-Ground} model family, which achieves state-of-the-art performance across all five grounding benchmarks for models under $10B$ parameters in agent settings. In the end-to-end model setting, our model still achieves SOTA results with scores of \textit{\textbf{43.2}} on ScreenSpot-pro and \textit{\textbf{27.2}} on UI-Vision. We believe that the various details discussed in this paper, along with our successes and failures, not only clarify the construction of grounding models but also benefit other perception tasks. Project homepage: \href{https://zhangmiaosen2000.github.io/Phi-Ground/}{https://zhangmiaosen2000.github.io/Phi-Ground/}
中文: Phi-Ground模型系列在计算机使用代理的GUI基础任务中取得了最先进的性能,解决了当前模型在准确执行操作方面的关键需求。
English: The Phi-Ground model family achieves state-of-the-art performance in GUI grounding for computer use agents, addressing the critical need for accurate action execution despite current models' limitations.

Authors:Justin Kay, Grant Van Horn, Subhransu Maji, Daniel Sheldon, Sara Beery
Title: Consensus-Driven Active Model Selection
Abstract:
The widespread availability of off-the-shelf machine learning models poses a challenge: which model, of the many available candidates, should be chosen for a given data analysis task? This question of model selection is traditionally answered by collecting and annotating a validation dataset -- a costly and time-intensive process. We propose a method for active model selection, using predictions from candidate models to prioritize the labeling of test data points that efficiently differentiate the best candidate. Our method, CODA, performs consensus-driven active model selection by modeling relationships between classifiers, categories, and data points within a probabilistic framework. The framework uses the consensus and disagreement between models in the candidate pool to guide the label acquisition process, and Bayesian inference to update beliefs about which model is best as more information is collected. We validate our approach by curating a collection of 26 benchmark tasks capturing a range of model selection scenarios. CODA outperforms existing methods for active model selection significantly, reducing the annotation effort required to discover the best model by upwards of 70% compared to the previous state-of-the-art. Code and data are available at https://github.com/justinkay/coda.
Chinese Summary: CODA提出了一种主动模型选择方法,利用候选模型间的共识与分歧来优先标注数据,相比现有技术将发现最佳模型所需的标注工作量减少了70%以上。
English Summary: CODA introduces an active model selection method that uses consensus and disagreement among candidate models to prioritize data labeling, significantly reducing annotation effort by over 70% compared to existing approaches.

Authors:Rongzhen Zhao, Yi Zhao, Juho Kannala, Joni Pajarinen
Title: Slot Attention with Re-Initialization and Self-Distillation
Abstract:
Unlike popular solutions based on dense feature maps, Object-Centric Learning (OCL) represents visual scenes as sub-symbolic object-level feature vectors, termed slots, which are highly versatile for tasks involving visual modalities. OCL typically aggregates object superpixels into slots by iteratively applying competitive cross attention, known as Slot Attention, with the slots as the query. However, once initialized, these slots are reused naively, causing redundant slots to compete with informative ones for representing objects. This often results in objects being erroneously segmented into parts. Additionally, mainstream methods derive supervision signals solely from decoding slots into the input's reconstruction, overlooking potential supervision based on internal information. To address these issues, we propose Slot Attention with re-Initialization and self-Distillation (DIAS): $\emph{i)}$ We reduce redundancy in the aggregated slots and re-initialize extra aggregation to update the remaining slots; $\emph{ii)}$ We drive the bad attention map at the first aggregation iteration to approximate the good at the last iteration to enable self-distillation. Experiments demonstrate that DIAS achieves state-of-the-art on OCL tasks like object discovery and recognition, while also improving advanced visual prediction and reasoning. Our source code and model checkpoints are available on https://github.com/Genera1Z/DIAS.
中文: 提出的DIAS方法通过重新初始化冗余槽位并实现注意力图的自蒸馏,增强了以对象为中心的学习,在物体发现和识别任务中达到了最先进的性能。
English: The proposed DIAS method enhances Object-Centric Learning by re-initializing redundant slots and implementing self-distillation of attention maps, achieving state-of-the-art performance in object discovery and recognition tasks.

Authors:Dongming Wu, Yanping Fu, Saike Huang, Yingfei Liu, Fan Jia, Nian Liu, Feng Dai, Tiancai Wang, Rao Muhammad Anwer, Fahad Shahbaz Khan, Jianbing Shen
Title: RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping
Abstract:
General robotic grasping systems require accurate object affordance perception in diverse open-world scenarios following human instructions. However, current studies suffer from the problem of lacking reasoning-based large-scale affordance prediction data, leading to considerable concern about open-world effectiveness. To address this limitation, we build a large-scale grasping-oriented affordance segmentation benchmark with human-like instructions, named RAGNet. It contains 273k images, 180 categories, and 26k reasoning instructions. The images cover diverse embodied data domains, such as wild, robot, ego-centric, and even simulation data. They are carefully annotated with an affordance map, while the difficulty of language instructions is largely increased by removing their category name and only providing functional descriptions. Furthermore, we propose a comprehensive affordance-based grasping framework, named AffordanceNet, which consists of a VLM pre-trained on our massive affordance data and a grasping network that conditions an affordance map to grasp the target. Extensive experiments on affordance segmentation benchmarks and real-robot manipulation tasks show that our model has a powerful open-world generalization ability. Our data and code is available at https://github.com/wudongming97/AffordanceNet.
中文: 研究人员构建了RAGNet大规模功能分割基准和AffordanceNet框架,通过结合视觉语言模型与功能地图,显著提升了机器人在开放环境中的抓取能力,实验证明其具有优异的泛化性能。
English: Researchers developed RAGNet, a large-scale affordance segmentation benchmark with human-like instructions, and AffordanceNet, a framework that enhances open-world robotic grasping by integrating visual-language models with affordance maps, demonstrating strong generalization in experiments.

Authors:Emery Pierson, Lei Li, Angela Dai, Maks Ovsjanikov
Title: DiffuMatch: Category-Agnostic Spectral Diffusion Priors for Robust Non-rigid Shape Matching
Abstract:
Deep functional maps have recently emerged as a powerful tool for solving non-rigid shape correspondence tasks. Methods that use this approach combine the power and flexibility of the functional map framework, with data-driven learning for improved accuracy and generality. However, most existing methods in this area restrict the learning aspect only to the feature functions and still rely on axiomatic modeling for formulating the training loss or for functional map regularization inside the networks. This limits both the accuracy and the applicability of the resulting approaches only to scenarios where assumptions of the axiomatic models hold. In this work, we show, for the first time, that both in-network regularization and functional map training can be replaced with data-driven methods. For this, we first train a generative model of functional maps in the spectral domain using score-based generative modeling, built from a large collection of high-quality maps. We then exploit the resulting model to promote the structural properties of ground truth functional maps on new shape collections. Remarkably, we demonstrate that the learned models are category-agnostic, and can fully replace commonly used strategies such as enforcing Laplacian commutativity or orthogonality of functional maps. Our key technical contribution is a novel distillation strategy from diffusion models in the spectral domain. Experiments demonstrate that our learned regularization leads to better results than axiomatic approaches for zero-shot non-rigid shape matching. Our code is available at: https://github.com/daidedou/diffumatch/
中文摘要:本研究首次提出用数据驱动的谱域扩散生成模型替代传统公理化的功能映射正则化方法,实现了无需类别先验的零样本非刚性形状匹配,其性能优于现有公理化方法。
English Summary: This paper introduces a novel data-driven approach that replaces traditional axiomatic regularization in deep functional maps with a category-agnostic generative model trained via spectral domain diffusion, achieving superior zero-shot shape matching performance.

Authors:Yang Gao, Po-Chien Luan, Kaouther Messaoud, Lan Feng, Alexandre Alahi
Title: OmniTraj: Pre-Training on Heterogeneous Data for Adaptive and Zero-Shot Human Trajectory Prediction
Abstract:
While large-scale pre-training has advanced human trajectory prediction, a critical challenge remains: zero-shot transfer to unseen dataset with varying temporal dynamics. State-of-the-art pre-trained models often require fine-tuning to adapt to new datasets with different frame rates or observation horizons, limiting their scalability and practical utility. In this work, we systematically investigate this limitation and propose a robust solution. We first demonstrate that existing data-aware discrete models struggle when transferred to new scenarios with shifted temporal setups. We then isolate the temporal generalization from dataset shift, revealing that a simple, explicit conditioning mechanism for temporal metadata is a highly effective solution. Based on this insight, we present OmniTraj, a Transformer-based model pre-trained on a large-scale, heterogeneous dataset. Our experiments show that explicitly conditioning on the frame rate enables OmniTraj to achieve state-of-the-art zero-shot transfer performance, reducing prediction error by over 70\% in challenging cross-setup scenarios. After fine-tuning, OmniTraj achieves state-of-the-art results on four datasets, including NBA, JTA, WorldPose, and ETH-UCY. The code is publicly available: https://github.com/vita-epfl/omnitraj
中文: 该研究提出了OmniTraj模型,通过显式的时间元数据调节机制,在人类轨迹预测中实现了卓越的零样本迁移能力,并在多个数据集上取得了最优性能。
English: The study introduces OmniTraj, a Transformer-based model that uses explicit temporal metadata conditioning to achieve superior zero-shot transfer and state-of-the-art performance in human trajectory prediction across diverse datasets.

Authors:Dustin Carrión-Ojeda, Stefan Roth, Simone Schaub-Meyer
Title: Efficient Masked Attention Transformer for Few-Shot Classification and Segmentation
Abstract:
Few-shot classification and segmentation (FS-CS) focuses on jointly performing multi-label classification and multi-class segmentation using few annotated examples. Although the current state of the art (SOTA) achieves high accuracy in both tasks, it struggles with small objects. To overcome this, we propose the Efficient Masked Attention Transformer (EMAT), which improves classification and segmentation accuracy, especially for small objects. EMAT introduces three modifications: a novel memory-efficient masked attention mechanism, a learnable downscaling strategy, and parameter-efficiency enhancements. EMAT outperforms all FS-CS methods on the PASCAL-5$^i$ and COCO-20$^i$ datasets, using at least four times fewer trainable parameters. Moreover, as the current FS-CS evaluation setting discards available annotations, despite their costly collection, we introduce two novel evaluation settings that consider these annotations to better reflect practical scenarios.

Authors:Xin Li, Keren Fu, Qijun Zhao
Title: Mamba-based Efficient Spatio-Frequency Motion Perception for Video Camouflaged Object Detection
Abstract:
Existing video camouflaged object detection (VCOD) methods primarily rely on spatial appearance features to perceive motion cues for breaking camouflage. However, the high similarity between foreground and background in VCOD results in limited discriminability of spatial appearance features (e.g., color and texture), restricting detection accuracy and completeness. Recent studies demonstrate that frequency features can not only enhance feature representation to compensate for appearance limitations but also perceive motion through dynamic variations in frequency energy. Furthermore, the emerging state space model called Mamba, enables efficient perception of motion cues in frame sequences due to its linear-time long-sequence modeling capability. Motivated by this, we propose a novel visual camouflage Mamba (Vcamba) based on spatio-frequency motion perception that integrates frequency and spatial features for efficient and accurate VCOD. Specifically, we propose a receptive field visual state space (RFVSS) module to extract multi-scale spatial features after sequence modeling. For frequency learning, we introduce an adaptive frequency component enhancement (AFE) module with a novel frequency-domain sequential scanning strategy to maintain semantic consistency. Then we propose a space-based long-range motion perception (SLMP) module and a frequency-based long-range motion perception (FLMP) module to model spatio-temporal and frequency-temporal sequences in spatial and frequency phase domains. Finally, the space and frequency motion fusion module (SFMF) integrates dual-domain features for unified motion representation. Experimental results show that our Vcamba outperforms state-of-the-art methods across 6 evaluation metrics on 2 datasets with lower computation cost, confirming the superiority of Vcamba. Our code is available at: https://github.com/BoydeLi/Vcamba.
中文摘要:提出的Vcamba模型通过整合空间与频域特征,结合专门的运动感知模块,在保持计算效率的同时,在多项评估指标上实现了卓越的视频伪装目标检测性能。
English Summary: The proposed Vcamba model integrates spatial and frequency features with specialized motion perception modules to achieve superior video camouflaged object detection performance across multiple metrics while maintaining computational efficiency.

Authors:Zijian Dong, Longteng Duan, Jie Song, Michael J. Black, Andreas Geiger
Title: MoGA: 3D Generative Avatar Prior for Monocular Gaussian Avatar Reconstruction
Abstract:
We present MoGA, a novel method to reconstruct high-fidelity 3D Gaussian avatars from a single-view image. The main challenge lies in inferring unseen appearance and geometric details while ensuring 3D consistency and realism. Most previous methods rely on 2D diffusion models to synthesize unseen views; however, these generated views are sparse and inconsistent, resulting in unrealistic 3D artifacts and blurred appearance. To address these limitations, we leverage a generative avatar model, that can generate diverse 3D avatars by sampling deformed Gaussians from a learned prior distribution. Due to limited 3D training data, such a 3D model alone cannot capture all image details of unseen identities. Consequently, we integrate it as a prior, ensuring 3D consistency by projecting input images into its latent space and enforcing additional 3D appearance and geometric constraints. Our novel approach formulates Gaussian avatar creation as model inversion by fitting the generative avatar to synthetic views from 2D diffusion models. The generative avatar provides an initialization for model fitting, enforces 3D regularization, and helps in refining pose. Experiments show that our method surpasses state-of-the-art techniques and generalizes well to real-world scenarios. Our Gaussian avatars are also inherently animatable. For code, see https://zj-dong.github.io/MoGA/.

Authors:Yaoye Zhu, Zhe Wang, Yan Wang
Title: MamV2XCalib: V2X-based Target-less Infrastructure Camera Calibration with State Space Model
Abstract:
As cooperative systems that leverage roadside cameras to assist autonomous vehicle perception become increasingly widespread, large-scale precise calibration of infrastructure cameras has become a critical issue. Traditional manual calibration methods are often time-consuming, labor-intensive, and may require road closures. This paper proposes MamV2XCalib, the first V2X-based infrastructure camera calibration method with the assistance of vehicle-side LiDAR. MamV2XCalib only requires autonomous vehicles equipped with LiDAR to drive near the cameras to be calibrated in the infrastructure, without the need for specific reference objects or manual intervention. We also introduce a new targetless LiDAR-camera calibration method, which combines multi-scale features and a 4D correlation volume to estimate the correlation between vehicle-side point clouds and roadside images. We model the temporal information and estimate the rotation angles with Mamba, effectively addressing calibration failures in V2X scenarios caused by defects in the vehicle-side data (such as occlusions) and large differences in viewpoint. We evaluate MamV2XCalib on the V2X-Seq and TUMTraf-V2X real-world datasets, demonstrating the effectiveness and robustness of our V2X-based automatic calibration approach. Compared to previous LiDAR-camera methods designed for calibration on one car, our approach achieves better and more stable calibration performance in V2X scenarios with fewer parameters. The code is available at https://github.com/zhuyaoye/MamV2XCalib.
中文: 本文提出MamV2XCalib,首个基于车联网技术的路侧摄像头自动校准方法,通过车载激光雷达数据无需人工干预或特定参照物,在实际车联网场景中展现出更优的校准性能和稳定性。
English: This paper introduces MamV2XCalib, a novel V2X-based automatic calibration method for roadside cameras that utilizes vehicle LiDAR data without requiring manual intervention or reference objects, demonstrating superior performance and robustness in real-world V2X scenarios.

Authors:Woo Kyoung Han, Yongjun Lee, Byeonghun Lee, Sang Hyun Park, Sunghoon Im, Kyong Hwan Jin
Title: JPEG Processing Neural Operator for Backward-Compatible Coding
Abstract:
Despite significant advances in learning-based lossy compression algorithms, standardizing codecs remains a critical challenge. In this paper, we present the JPEG Processing Neural Operator (JPNeO), a next-generation JPEG algorithm that maintains full backward compatibility with the current JPEG format. Our JPNeO improves chroma component preservation and enhances reconstruction fidelity compared to existing artifact removal methods by incorporating neural operators in both the encoding and decoding stages. JPNeO achieves practical benefits in terms of reduced memory usage and parameter count. We further validate our hypothesis about the existence of a space with high mutual information through empirical evidence. In summary, the JPNeO functions as a high-performance out-of-the-box image compression pipeline without changing source coding's protocol. Our source code is available at https://github.com/WooKyoungHan/JPNeO.
中文: JPNeO是一种新一代JPEG算法,在完全向后兼容的基础上,通过神经算子提升色度保留和重建保真度,同时减少内存占用和参数数量,且无需更改源编码协议。
English: JPNeO is a next-generation JPEG algorithm that ensures full backward compatibility while enhancing chroma preservation and reconstruction fidelity through neural operators, offering reduced memory usage and parameter counts without altering source coding protocols.

Authors:Mutian Xu, Chongjie Ye, Haolin Liu, Yushuang Wu, Jiahao Chang, Xiaoguang Han
Title: Stable-Sim2Real: Exploring Simulation of Real-Captured 3D Data with Two-Stage Depth Diffusion
Abstract:
3D data simulation aims to bridge the gap between simulated and real-captured 3D data, which is a fundamental problem for real-world 3D visual tasks. Most 3D data simulation methods inject predefined physical priors but struggle to capture the full complexity of real data. An optimal approach involves learning an implicit mapping from synthetic to realistic data in a data-driven manner, but progress in this solution has met stagnation in recent studies. This work explores a new solution path of data-driven 3D simulation, called Stable-Sim2Real, based on a novel two-stage depth diffusion model. The initial stage finetunes Stable-Diffusion to generate the residual between the real and synthetic paired depth, producing a stable but coarse depth, where some local regions may deviate from realistic patterns. To enhance this, both the synthetic and initial output depth are fed into a second-stage diffusion, where diffusion loss is adjusted to prioritize these distinct areas identified by a 3D discriminator. We provide a new benchmark scheme to evaluate 3D data simulation methods. Extensive experiments show that training the network with the 3D simulated data derived from our method significantly enhances performance in real-world 3D visual tasks. Moreover, the evaluation demonstrates the high similarity between our 3D simulated data and real-captured patterns. Project page: https://mutianxu.github.io/stable-sim2real/.
Chinese Summary: 本文提出Stable-Sim2Real双阶段深度扩散模型,通过首阶段生成深度残差修正、次阶段优化局部区域,显著提升了3D数据模拟质量,并在真实3D视觉任务中验证了其优越性能。
English Summary: This paper introduces Stable-Sim2Real, a two-stage depth diffusion model that enhances 3D data simulation by first generating residual depth corrections and then refining local regions, demonstrating improved performance in real-world 3D visual tasks.

Authors:Ting Huang, Zeyu Zhang, Hao Tang
Title: 3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding
Abstract:
Large vision-language models (VLMs) have made significant strides in 2D visual understanding tasks, sparking interest in extending these capabilities to 3D scene understanding. However, current 3D VLMs often struggle with robust reasoning and generalization due to limitations in high-quality spatial data and the static nature of viewpoint assumptions. To address these challenges, we propose 3D-R1, a foundation model that enhances the reasoning capabilities of 3D VLMs. Specifically, we first construct a high-quality synthetic dataset with CoT, named Scene-30K, leveraging existing 3D-VL datasets and a data engine based on Gemini 2.5 Pro. It serves as cold-start initialization data for 3D-R1. Moreover, we leverage RLHF policy such as GRPO in the reinforcement learning training process to enhance reasoning capabilities and introduce three reward functions: a perception reward, a semantic similarity reward and a format reward to maintain detection accuracy and answer semantic precision. Furthermore, we introduce a dynamic view selection strategy that adaptively chooses the most informative perspectives for 3D scene understanding. Extensive experiments demonstrate that 3D-R1 delivers an average improvement of 10% across various 3D scene benchmarks, highlighting its effectiveness in enhancing reasoning and generalization in 3D scene understanding. Code: https://github.com/AIGeeksGroup/3D-R1. Website: https://aigeeksgroup.github.io/3D-R1.
中文: 3D-R1基础模型通过集成思维链合成数据集、带人类反馈的强化学习和动态视角选择策略,显著提升了三维视觉语言模型的推理与泛化能力,在多个三维场景基准测试中平均性能提高10%。
English: The 3D-R1 foundation model enhances 3D vision-language models by incorporating a synthetic dataset with Chain-of-Thought reasoning, reinforcement learning with human feedback, and dynamic view selection, achieving a 10% average improvement across 3D scene benchmarks.

Authors:Yijie Zhu, Lingsen Zhang, Zitong Yu, Rui Shao, Tao Tan, Liqiang Nie
Title: UniEmo: Unifying Emotional Understanding and Generation with Learnable Expert Queries
Abstract:
Emotional understanding and generation are often treated as separate tasks, yet they are inherently complementary and can mutually enhance each other. In this paper, we propose the UniEmo, a unified framework that seamlessly integrates these two tasks. The key challenge lies in the abstract nature of emotions, necessitating the extraction of visual representations beneficial for both tasks. To address this, we propose a hierarchical emotional understanding chain with learnable expert queries that progressively extracts multi-scale emotional features, thereby serving as a foundational step for unification. Simultaneously, we fuse these expert queries and emotional representations to guide the diffusion model in generating emotion-evoking images. To enhance the diversity and fidelity of the generated emotional images, we further introduce the emotional correlation coefficient and emotional condition loss into the fusion process. This step facilitates fusion and alignment for emotional generation guided by the understanding. In turn, we demonstrate that joint training allows the generation component to provide implicit feedback to the understanding part. Furthermore, we propose a novel data filtering algorithm to select high-quality and diverse emotional images generated by the well-trained model, which explicitly feedback into the understanding part. Together, these generation-driven dual feedback processes enhance the model's understanding capacity. Extensive experiments show that UniEmo significantly outperforms state-of-the-art methods in both emotional understanding and generation tasks. The code for the proposed method is available at https://github.com/JiuTian-VL/UniEmo.
Chinese: UniEmo框架通过分层情感理解链提取多尺度情感特征,并指导扩散模型生成情感图像,结合双重反馈机制提升理解与生成能力,在实验中显著优于现有先进方法。
English: The UniEmo framework unifies emotional understanding and generation by using a hierarchical chain to extract multi-scale emotional features and guide a diffusion model for image generation, with dual feedback mechanisms enhancing both tasks and achieving state-of-the-art results.

Authors:Ali Youssef
Title: VMatcher: State-Space Semi-Dense Local Feature Matching
Abstract:
This paper introduces VMatcher, a hybrid Mamba-Transformer network for semi-dense feature matching between image pairs. Learning-based feature matching methods, whether detector-based or detector-free, achieve state-of-the-art performance but depend heavily on the Transformer's attention mechanism, which, while effective, incurs high computational costs due to its quadratic complexity. In contrast, Mamba introduces a Selective State-Space Model (SSM) that achieves comparable or superior performance with linear complexity, offering significant efficiency gains. VMatcher leverages a hybrid approach, integrating Mamba's highly efficient long-sequence processing with the Transformer's attention mechanism. Multiple VMatcher configurations are proposed, including hierarchical architectures, demonstrating their effectiveness in setting new benchmarks efficiently while ensuring robustness and practicality for real-time applications where rapid inference is crucial. Source Code is available at: https://github.com/ayoussf/VMatcher
中文: VMatcher是一种混合Mamba-Transformer网络,通过结合Mamba的线性复杂度高效性和Transformer的注意力机制,实现了半稠密特征匹配,以卓越的效率和鲁棒性为实时应用树立了新标杆。
English: VMatcher is a hybrid Mamba-Transformer network that combines the efficiency of Mamba's linear complexity with Transformer's attention for semi-dense feature matching, achieving state-of-the-art performance with enhanced speed and robustness for real-time applications.

Authors:Ji Ma, Wei Suo, Peng Wang, Yanning Zhang
Title: Short-LVLM: Compressing and Accelerating Large Vision-Language Models by Pruning Redundant Layers
Abstract:
Although large vision-language models (LVLMs) have demonstrated impressive capabilities in multi-modal understanding and reasoning, their practical applications are still limited by massive model parameters and high computational costs. Recent efforts from natural language processing (NLP) have shown the effectiveness of layer pruning, offering a plausible training-free compression solution. However, due to the modality divergence between vision and language, it is unclear whether these NLP techniques are still effective in LVLMs. In this paper, we empirically prove that directly applying these layer pruning methods to LVLMs is ineffective. Through extensive experiments, we find that non-essential vision-language (VL) tokens and inter-layer feature gaps pose critical challenges to pruning layers in LVLMs. Based on these insights, we propose a novel framework Short-LVLM (SVL) that can utilize important VL tokens and mitigate the layer-wise feature gaps. Notably, Short-LVLM not only achieves a superior trade-off between performance and efficiency but also exhibits several potential advantages, i.e., training-free, model-agnostic, and highly compatible. The code for this work is publicly available at https://github.com/ASGO-MM/Short-LVLM.
Chinese: 大型视觉语言模型直接应用自然语言处理的层剪枝方法效果不佳,而Short-LVLM通过聚焦关键视觉语言标记和减小层间特征差异,实现了无需训练的高效性能提升。
English: Large vision-language models face inefficiency with NLP-based layer pruning, but Short-LVLM overcomes this by targeting key tokens and feature gaps to boost performance and efficiency without training.

Authors:Yingjie Zhou, Jiezhang Cao, Zicheng Zhang, Farong Wen, Yanwei Jiang, Jun Jia, Xiaohong Liu, Xiongkuo Min, Guangtao Zhai
Title: Who is a Better Talker: Subjective and Objective Quality Assessment for AI-Generated Talking Heads
Abstract:
Speech-driven methods for portraits are figuratively known as "Talkers" because of their capability to synthesize speaking mouth shapes and facial movements. Especially with the rapid development of the Text-to-Image (T2I) models, AI-Generated Talking Heads (AGTHs) have gradually become an emerging digital human media. However, challenges persist regarding the quality of these talkers and AGTHs they generate, and comprehensive studies addressing these issues remain limited. To address this gap, this paper presents the largest AGTH quality assessment dataset THQA-10K to date, which selects 12 prominent T2I models and 14 advanced talkers to generate AGTHs for 14 prompts. After excluding instances where AGTH generation is unsuccessful, the THQA-10K dataset contains 10,457 AGTHs. Then, volunteers are recruited to subjectively rate the AGTHs and give the corresponding distortion categories. In our analysis for subjective experimental results, we evaluate the performance of talkers in terms of generalizability and quality, and also expose the distortions of existing AGTHs. Finally, an objective quality assessment method based on the first frame, Y-T slice and tone-lip consistency is proposed. Experimental results show that this method can achieve state-of-the-art (SOTA) performance in AGTH quality assessment. The work is released at https://github.com/zyj-2000/Talker.
中文摘要:本文提出了迄今最大的AI生成说话头部质量评估数据集THQA-10K,通过主观评估分析了12种文本生成图像模型和14种说话人生成器的表现,并提出基于首帧、Y-T切片和音唇一致性的客观评估方法,实现了最优性能。
English Summary: This paper introduces THQA-10K, the largest quality assessment dataset for AI-generated talking heads, which evaluates 12 text-to-image models and 14 talkers through subjective ratings and proposes an objective assessment method achieving state-of-the-art performance.

Authors:Vineet Kumar Rakesh, Soumya Mazumdar, Tapas Samanta, Sarbajit Pal, Amitabha Das
Title: Impact of Hyperparameter Optimization on the Accuracy of Lightweight Deep Learning Models for Real-Time Image Classification
Abstract:
Lightweight convolutional and transformer-based models have become vital for real-time image classification in resource-constrained applications, such as embedded systems and edge devices. This work analyzes the influence of hyperparameter adjustment on the accuracy and convergence behavior of seven efficient deep learning architectures: EfficientNetV2-S, ConvNeXt-T, MobileViT v2 (XXS/XS/S), MobileNetV3-L, TinyViT-21M, and RepVGG-A2. All models are trained on the ImageNet-1K dataset under consistent training settings, with an emphasis on real-time practicality. An comprehensive ablation study is undertaken to separate the effect of critical hyperparameters, including learning rate schedules, batch sizes, input resolution, data augmentation, regularization approaches, and optimizer choice. To assess appropriateness for real-time applications, each model is assessed not only in terms of Top-1 and Top-5 classification accuracy, but also in terms of inference time, parameter count, model size, and frames-per-second (FPS) on a GPU-accelerated edge deployment simulation. Results demonstrate that cosine learning rate decay and adjustable batch size may greatly boost both accuracy and convergence speed, while keeping low latency and memory cost. Notably, RepVGG-A2 achieves over 80% Top-1 accuracy with efficient inference performance, offering a compelling balance between accuracy and deployment cost for VGG-style models. The results give practical guidance for constructing resource-efficient deep learning models appropriate for real-time image processing pipelines. All code and training logs are publicly accessible at https://github.com/VineetKumarRakesh/lcnn-opt.
中文: 本研究评估了超参数调整对七种高效深度学习模型在实时图像分类中性能的影响,发现如余弦学习率衰减等策略可在保持低资源消耗的同时提升准确性和收敛速度。
English: This study evaluates how hyperparameter tuning affects the performance of seven efficient deep learning models for real-time image classification, finding that strategies like cosine learning rate decay enhance accuracy and speed while maintaining low resource use.

Authors:Alfio Ferrara, Sergio Picascia, Elisabetta Rocchetti
Title: The Cow of Rembrandt - Analyzing Artistic Prompt Interpretation in Text-to-Image Models
Abstract:
Text-to-image diffusion models have demonstrated remarkable capabilities in generating artistic content by learning from billions of images, including popular artworks. However, the fundamental question of how these models internally represent concepts, such as content and style in paintings, remains unexplored. Traditional computer vision assumes content and style are orthogonal, but diffusion models receive no explicit guidance about this distinction during training. In this work, we investigate how transformer-based text-to-image diffusion models encode content and style concepts when generating artworks. We leverage cross-attention heatmaps to attribute pixels in generated images to specific prompt tokens, enabling us to isolate image regions influenced by content-describing versus style-describing tokens. Our findings reveal that diffusion models demonstrate varying degrees of content-style separation depending on the specific artistic prompt and style requested. In many cases, content tokens primarily influence object-related regions while style tokens affect background and texture areas, suggesting an emergent understanding of the content-style distinction. These insights contribute to our understanding of how large-scale generative models internally represent complex artistic concepts without explicit supervision. We share the code and dataset, together with an exploratory tool for visualizing attention maps at https://github.com/umilISLab/artistic-prompt-interpretation.
中文: 本研究通过交叉注意力分析发现,基于Transformer的文生图扩散模型在没有明确训练指导的情况下,能自发区分艺术创作中的内容与风格——内容词主要聚焦物体区域,而风格词则影响背景纹理,揭示了模型对艺术概念的内在表征机制。
English: This study explores how transformer-based text-to-image diffusion models internally represent artistic concepts like content and style, revealing through cross-attention analysis that they exhibit emergent separation between object-focused content tokens and texture-affecting style tokens without explicit training guidance.

Authors:Xihang Hu, Fuming Sun, Jiazhe Liu, Feilong Xu, Xiaoli Zhang
Title: ST-SAM: SAM-Driven Self-Training Framework for Semi-Supervised Camouflaged Object Detection
Abstract:
Semi-supervised Camouflaged Object Detection (SSCOD) aims to reduce reliance on costly pixel-level annotations by leveraging limited annotated data and abundant unlabeled data. However, existing SSCOD methods based on Teacher-Student frameworks suffer from severe prediction bias and error propagation under scarce supervision, while their multi-network architectures incur high computational overhead and limited scalability. To overcome these limitations, we propose ST-SAM, a highly annotation-efficient yet concise framework that breaks away from conventional SSCOD constraints. Specifically, ST-SAM employs Self-Training strategy that dynamically filters and expands high-confidence pseudo-labels to enhance a single-model architecture, thereby fundamentally circumventing inter-model prediction bias. Furthermore, by transforming pseudo-labels into hybrid prompts containing domain-specific knowledge, ST-SAM effectively harnesses the Segment Anything Model's potential for specialized tasks to mitigate error accumulation in self-training. Experiments on COD benchmark datasets demonstrate that ST-SAM achieves state-of-the-art performance with only 1\% labeled data, outperforming existing SSCOD methods and even matching fully supervised methods. Remarkably, ST-SAM requires training only a single network, without relying on specific models or loss functions. This work establishes a new paradigm for annotation-efficient SSCOD. Codes will be available at https://github.com/hu-xh/ST-SAM.
Chinese: ST-SAM提出了一种自训练框架,通过高置信度伪标签和混合提示利用Segment Anything模型,仅需少量标注数据和单一网络即可实现最先进的半监督伪装目标检测。
English: ST-SAM introduces a self-training framework that uses high-confidence pseudo-labels and hybrid prompts to leverage the Segment Anything Model, achieving state-of-the-art semi-supervised camouflaged object detection with minimal labeled data and a single network.

Authors:Hanshen Zhu, Zhen Zhu, Kaile Zhang, Yiming Gong, Yuliang Liu, Xiang Bai
Title: Training-free Geometric Image Editing on Diffusion Models
Abstract:
We tackle the task of geometric image editing, where an object within an image is repositioned, reoriented, or reshaped while preserving overall scene coherence. Previous diffusion-based editing methods often attempt to handle all relevant subtasks in a single step, proving difficult when transformations become large or structurally complex. We address this by proposing a decoupled pipeline that separates object transformation, source region inpainting, and target region refinement. Both inpainting and refinement are implemented using a training-free diffusion approach, FreeFine. In experiments on our new GeoBench benchmark, which contains both 2D and 3D editing scenarios, FreeFine outperforms state-of-the-art alternatives in image fidelity, and edit precision, especially under demanding transformations. Code and benchmark are available at: https://github.com/CIawevy/FreeFine
中文摘要:本文提出一种解耦的几何图像编辑流程,通过无需训练的扩散方法FreeFine分别处理对象变换、源区域修复和目标区域优化,在新型GeoBench基准测试中展现出卓越的图像保真度和编辑精度。
English Summary: This paper introduces a decoupled pipeline for geometric image editing that separates object transformation, inpainting, and refinement using a training-free diffusion method called FreeFine, which demonstrates superior performance in image fidelity and edit precision on the new GeoBench benchmark.

Authors:Dohwan Ko, Ji Soo Lee, Minhyuk Choi, Zihang Meng, Hyunwoo J. Kim
Title: Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval
Abstract:
Text-Video Retrieval aims to find the most relevant text (or video) candidate given a video (or text) query from large-scale online databases. Recent work leverages multi-modal large language models (MLLMs) to improve retrieval, especially for long or complex query-candidate pairs. However, we observe that the naive application of MLLMs, i.e., retrieval based on candidate likelihood, introduces candidate prior bias, favoring candidates with inherently higher priors over those more relevant to the query. To this end, we propose a novel retrieval framework, Bidirectional Likelihood Estimation with MLLM (BLiM), which leverages both query and candidate likelihoods by training the model to generate text from a given video as well as video features from a given text. Furthermore, we introduce Candidate Prior Normalization (CPN), a simple yet effective training-free score calibration module designed to mitigate candidate prior bias in candidate likelihood. On four Text-Video Retrieval benchmarks, our BLiM equipped with CPN outperforms previous state-of-the-art models by 6.4 R@1 on average, effectively alleviating candidate prior bias and emphasizing query-candidate relevance. Our in-depth analysis across various multi-modal tasks beyond retrieval highlights the broad applicability of CPN which enhances visual understanding by reducing reliance on textual priors. Code is available at https://github.com/mlvlab/BLiM.
中文摘要:本研究提出的BLiM框架结合候选先验归一化(CPN),通过双向似然估计解决文本-视频检索中的候选先验偏差问题,在多个基准测试中实现了最优性能。
English Summary: The proposed BLiM framework with Candidate Prior Normalization (CPN) addresses candidate prior bias in text-video retrieval by leveraging bidirectional likelihood estimation, achieving state-of-the-art performance across multiple benchmarks.

Authors:Gyeongjin Kang, Seungtae Nam, Xiangyu Sun, Sameh Khamis, Abdelrahman Mohamed, Eunbyung Park
Title: iLRM: An Iterative Large 3D Reconstruction Model
Abstract:
Feed-forward 3D modeling has emerged as a promising approach for rapid and high-quality 3D reconstruction. In particular, directly generating explicit 3D representations, such as 3D Gaussian splatting, has attracted significant attention due to its fast and high-quality rendering, as well as numerous applications. However, many state-of-the-art methods, primarily based on transformer architectures, suffer from severe scalability issues because they rely on full attention across image tokens from multiple input views, resulting in prohibitive computational costs as the number of views or image resolution increases. Toward a scalable and efficient feed-forward 3D reconstruction, we introduce an iterative Large 3D Reconstruction Model (iLRM) that generates 3D Gaussian representations through an iterative refinement mechanism, guided by three core principles: (1) decoupling the scene representation from input-view images to enable compact 3D representations; (2) decomposing fully-attentional multi-view interactions into a two-stage attention scheme to reduce computational costs; and (3) injecting high-resolution information at every layer to achieve high-fidelity reconstruction. Experimental results on widely used datasets, such as RE10K and DL3DV, demonstrate that iLRM outperforms existing methods in both reconstruction quality and speed. Notably, iLRM exhibits superior scalability, delivering significantly higher reconstruction quality under comparable computational cost by efficiently leveraging a larger number of input views.

Authors:Yang Ren, Hai Jiang, Wei Li, Menglong Yang, Heng Zhang, Zehua Sheng, Qingsheng Ye, Shuaicheng Liu
Title: Learning Arbitrary-Scale RAW Image Downscaling with Wavelet-based Recurrent Reconstruction
Abstract:
Image downscaling is critical for efficient storage and transmission of high-resolution (HR) images. Existing learning-based methods focus on performing downscaling within the sRGB domain, which typically suffers from blurred details and unexpected artifacts. RAW images, with their unprocessed photonic information, offer greater flexibility but lack specialized downscaling frameworks. In this paper, we propose a wavelet-based recurrent reconstruction framework that leverages the information lossless attribute of wavelet transformation to fulfill the arbitrary-scale RAW image downscaling in a coarse-to-fine manner, in which the Low-Frequency Arbitrary-Scale Downscaling Module (LASDM) and the High-Frequency Prediction Module (HFPM) are proposed to preserve structural and textural integrity of the reconstructed low-resolution (LR) RAW images, alongside an energy-maximization loss to align high-frequency energy between HR and LR domain. Furthermore, we introduce the Realistic Non-Integer RAW Downscaling (Real-NIRD) dataset, featuring a non-integer downscaling factor of 1.3$\times$, and incorporate it with publicly available datasets with integer factors (2$\times$, 3$\times$, 4$\times$) for comprehensive benchmarking arbitrary-scale image downscaling purposes. Extensive experiments demonstrate that our method outperforms existing state-of-the-art competitors both quantitatively and visually. The code and dataset will be released at https://github.com/RenYangSCU/ASRD.
中文摘要:本文提出了一种基于小波变换的循环重建框架,通过低频任意尺度下采样模块和高频预测模块实现RAW图像的无损下采样,并在包含非整数缩放因子的新数据集上验证了该方法在定量和视觉评估中均优于现有技术。
English Summary: This paper introduces a wavelet-based recurrent reconstruction framework for arbitrary-scale RAW image downscaling that preserves structural and textural integrity through specialized modules and an energy-maximization loss, outperforming existing methods as demonstrated on the newly created Real-NIRD dataset.

Authors:Youngsun Jang, Dongyoun Kim, Chulwoo Pack, Kwanghee Won
Title: A Novel Dataset for Flood Detection Robust to Seasonal Changes in Satellite Imagery
Abstract:
This study introduces a novel dataset for segmenting flooded areas in satellite images. After reviewing 77 existing benchmarks utilizing satellite imagery, we identified a shortage of suitable datasets for this specific task. To fill this gap, we collected satellite imagery of the 2019 Midwestern USA floods from Planet Explorer by Planet Labs (Image \c{opyright} 2024 Planet Labs PBC). The dataset consists of 10 satellite images per location, each containing both flooded and non-flooded areas. We selected ten locations from each of the five states: Iowa, Kansas, Montana, Nebraska, and South Dakota. The dataset ensures uniform resolution and resizing during data processing. For evaluating semantic segmentation performance, we tested state-of-the-art models in computer vision and remote sensing on our dataset. Additionally, we conducted an ablation study varying window sizes to capture temporal characteristics. Overall, the models demonstrated modest results, suggesting a requirement for future multimodal and temporal learning strategies. The dataset will be publicly available on .
Chinese: 本研究针对卫星图像中洪水区域分割任务,创建了一个基于2019年美国中西部洪水的新数据集,并通过评估先进模型发现结果一般,表明未来需发展多模态和时间序列学习策略。
English: This study introduces a new dataset for flood area segmentation in satellite imagery, addressing a gap in existing resources by compiling images from the 2019 Midwestern USA floods and evaluating state-of-the-art models, which showed modest results indicating a need for improved multimodal and temporal learning approaches.

Authors:Xiaochen Zhao, Hongyi Xu, Guoxian Song, You Xie, Chenxu Zhang, Xiu Li, Linjie Luo, Jinli Suo, Yebin Liu
Title: X-NeMo: Expressive Neural Motion Reenactment via Disentangled Latent Attention
Abstract:
We propose X-NeMo, a novel zero-shot diffusion-based portrait animation pipeline that animates a static portrait using facial movements from a driving video of a different individual. Our work first identifies the root causes of the key issues in prior approaches, such as identity leakage and difficulty in capturing subtle and extreme expressions. To address these challenges, we introduce a fully end-to-end training framework that distills a 1D identity-agnostic latent motion descriptor from driving image, effectively controlling motion through cross-attention during image generation. Our implicit motion descriptor captures expressive facial motion in fine detail, learned end-to-end from a diverse video dataset without reliance on pretrained motion detectors. We further enhance expressiveness and disentangle motion latents from identity cues by supervising their learning with a dual GAN decoder, alongside spatial and color augmentations. By embedding the driving motion into a 1D latent vector and controlling motion via cross-attention rather than additive spatial guidance, our design eliminates the transmission of spatial-aligned structural clues from the driving condition to the diffusion backbone, substantially mitigating identity leakage. Extensive experiments demonstrate that X-NeMo surpasses state-of-the-art baselines, producing highly expressive animations with superior identity resemblance. Our code and models are available for research.
中文:X-NeMo是一种基于零样本扩散的肖像动画流程,通过提取驱动视频中的一维身份无关潜在运动描述符,能够以不同个体的面部动作驱动静态肖像动画,有效解决了身份泄露问题并精细捕捉各种表情。
English: X-NeMo is a zero-shot diffusion-based portrait animation pipeline that uses a 1D identity-agnostic latent motion descriptor to animate static portraits with facial movements from different individuals, effectively addressing identity leakage and capturing subtle expressions through an end-to-end training framework.

Authors:Dmitry Demidov, Zaigham Zaheer, Omkar Thawakar, Salman Khan, Fahad Shahbaz Khan
Title: Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model
Abstract:
Fine-grained image classification, the task of distinguishing between visually similar subcategories within a broader category (e.g., bird species, car models, flower types), is a challenging computer vision problem. Traditional approaches rely heavily on fixed vocabularies and closed-set classification paradigms, limiting their scalability and adaptability in real-world settings where novel classes frequently emerge. Recent research has demonstrated that combining large language models (LLMs) with vision-language models (VLMs) makes open-set recognition possible without the need for predefined class labels. However, the existing methods are often limited in harnessing the power of LLMs at the classification phase, and also rely heavily on the guessed class names provided by an LLM without thorough analysis and refinement. To address these bottlenecks, we propose our training-free method, Enriched-FineR (or E-FineR for short), which demonstrates state-of-the-art results in fine-grained visual recognition while also offering greater interpretability, highlighting its strong potential in real-world scenarios and new domains where expert annotations are difficult to obtain. Additionally, we demonstrate the application of our proposed approach to zero-shot and few-shot classification, where it demonstrated performance on par with the existing SOTA while being training-free and not requiring human interventions. Overall, our vocabulary-free framework supports the shift in image classification from rigid label prediction to flexible, language-driven understanding, enabling scalable and generalizable systems for real-world applications. Well-documented code is available on https://github.com/demidovd98/e-finer.
中文摘要:本文提出的免训练E-FineR方法通过结合大语言模型与视觉语言模型,无需预定义标签即可实现最先进的细粒度图像分类,在零样本和少样本场景中展现出卓越的 interpretability 与性能表现。
English Summary: The proposed training-free E-FineR method combines large language and vision-language models to achieve state-of-the-art fine-grained image classification without predefined labels, offering superior interpretability and performance in zero-shot and few-shot scenarios.

Authors:Giuseppe Cartella, Vittorio Cuculo, Alessandro D'Amelio, Marcella Cornia, Giuseppe Boccignone, Rita Cucchiara
Title: Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction
Abstract:
Predicting human gaze scanpaths is crucial for understanding visual attention, with applications in human-computer interaction, autonomous systems, and cognitive robotics. While deep learning models have advanced scanpath prediction, most existing approaches generate averaged behaviors, failing to capture the variability of human visual exploration. In this work, we present ScanDiff, a novel architecture that combines diffusion models with Vision Transformers to generate diverse and realistic scanpaths. Our method explicitly models scanpath variability by leveraging the stochastic nature of diffusion models, producing a wide range of plausible gaze trajectories. Additionally, we introduce textual conditioning to enable task-driven scanpath generation, allowing the model to adapt to different visual search objectives. Experiments on benchmark datasets show that ScanDiff surpasses state-of-the-art methods in both free-viewing and task-driven scenarios, producing more diverse and accurate scanpaths. These results highlight its ability to better capture the complexity of human visual behavior, pushing forward gaze prediction research. Source code and models are publicly available at https://aimagelab.github.io/ScanDiff.
Chinese: ScanDiff提出了一种结合扩散模型与视觉Transformer的新架构,通过文本条件化生成多样且真实的人类注视扫描路径,在自由观看和任务驱动场景下均超越现有最优方法,更好地模拟了人类视觉行为的复杂性。
English: ScanDiff introduces a diffusion-based architecture with Vision Transformers to generate diverse and realistic human gaze scanpaths, outperforming state-of-the-art methods in both free-viewing and task-driven scenarios by capturing variability through textual conditioning.

Authors:Zhensheng Yuan, Haozhi Huang, Zhen Xiong, Di Wang, Guanghua Yang
Title: Robust and Efficient 3D Gaussian Splatting for Urban Scene Reconstruction
Abstract:
We present a framework that enables fast reconstruction and real-time rendering of urban-scale scenes while maintaining robustness against appearance variations across multi-view captures. Our approach begins with scene partitioning for parallel training, employing a visibility-based image selection strategy to optimize training efficiency. A controllable level-of-detail (LOD) strategy explicitly regulates Gaussian density under a user-defined budget, enabling efficient training and rendering while maintaining high visual fidelity. The appearance transformation module mitigates the negative effects of appearance inconsistencies across images while enabling flexible adjustments. Additionally, we utilize enhancement modules, such as depth regularization, scale regularization, and antialiasing, to improve reconstruction fidelity. Experimental results demonstrate that our method effectively reconstructs urban-scale scenes and outperforms previous approaches in both efficiency and quality. The source code is available at: https://yzslab.github.io/REUrbanGS.

Authors:Ruslan Khrulev
Title: CHECK-MAT: Checking Hand-Written Mathematical Answers for the Russian Unified State Exam
Abstract:
This paper introduces a novel benchmark, EGE-Math Solutions Assessment Benchmark, for evaluating Vision-Language Models (VLMs) on their ability to assess hand-written mathematical solutions. Unlike existing benchmarks that focus on problem solving, our approach centres on understanding student solutions, identifying mistakes, and assigning grades according to fixed criteria. We compile 122 scanned solutions from the Russian Unified State Exam (EGE) together with official expert grades, and evaluate seven modern VLMs from Google, OpenAI, Arcee AI, and Alibaba Cloud in three inference modes. The results reveal current limitations in mathematical reasoning and human-rubric alignment, opening new research avenues in AI-assisted assessment. You can find code in https://github.com/Karifannaa/Auto-check-EGE-math
中文: 本文提出EGE-Math解题评估基准,这一新型评估工具专注于对手写数学解题过程进行评分而非解题本身,通过测试七个先进视觉语言模型揭示了当前在数学推理能力方面的局限。
English: This paper presents the EGE-Math Solutions Assessment Benchmark, a novel evaluation tool for Vision-Language Models that focuses on grading handwritten math solutions rather than solving problems, revealing current limitations in mathematical reasoning through testing seven modern VLMs.

Authors:Harry Shomer, Jiejun Xu
Title: Automated Label Placement on Maps via Large Language Models
Abstract:
Label placement is a critical aspect of map design, serving as a form of spatial annotation that directly impacts clarity and interpretability. Despite its importance, label placement remains largely manual and difficult to scale, as existing automated systems struggle to integrate cartographic conventions, adapt to context, or interpret labeling instructions. In this work, we introduce a new paradigm for automatic label placement (ALP) that formulates the task as a data editing problem and leverages large language models (LLMs) for context-aware spatial annotation. To support this direction, we curate MAPLE, the first known benchmarking dataset for evaluating ALP on real-world maps, encompassing diverse landmark types and label placement annotations from open-source data. Our method retrieves labeling guidelines relevant to each landmark type leveraging retrieval-augmented generation (RAG), integrates them into prompts, and employs instruction-tuned LLMs to generate ideal label coordinates. We evaluate four open-source LLMs on MAPLE, analyzing both overall performance and generalization across different types of landmarks. This includes both zero-shot and instruction-tuned performance. Our results demonstrate that LLMs, when guided by structured prompts and domain-specific retrieval, can learn to perform accurate spatial edits, aligning the generated outputs with expert cartographic standards. Overall, our work presents a scalable framework for AI-assisted map finishing and demonstrates the potential of foundation models in structured data editing tasks. The code and data can be found at https://github.com/HarryShomer/MAPLE.
中文摘要:本研究提出了一种利用大型语言模型结合制图规范的新型自动标签放置方法,通过构建MAPLE基准数据集实现了符合专业标准的地图空间标注。
English Summary: This paper introduces a novel automatic label placement method using large language models guided by cartographic rules, achieving expert-level spatial annotations through a curated benchmark dataset called MAPLE.

Authors:Xiaoyu Pan, Yang Bai, Ke Zou, Yang Zhou, Jun Zhou, Huazhu Fu, Yih-Chung Tham, Yong Liu
Title: EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow
Abstract:
Medical Large Language Models (MLLMs) play a crucial role in ophthalmic diagnosis, holding significant potential to address vision-threatening diseases. However, their accuracy is constrained by hallucinations stemming from limited ophthalmic knowledge, insufficient visual localization and reasoning capabilities, and a scarcity of multimodal ophthalmic data, which collectively impede precise lesion detection and disease diagnosis. Furthermore, existing medical benchmarks fail to effectively evaluate various types of hallucinations or provide actionable solutions to mitigate them. To address the above challenges, we introduce EH-Benchmark, a novel ophthalmology benchmark designed to evaluate hallucinations in MLLMs. We categorize MLLMs' hallucinations based on specific tasks and error types into two primary classes: Visual Understanding and Logical Composition, each comprising multiple subclasses. Given that MLLMs predominantly rely on language-based reasoning rather than visual processing, we propose an agent-centric, three-phase framework, including the Knowledge-Level Retrieval stage, the Task-Level Case Studies stage, and the Result-Level Validation stage. Experimental results show that our multi-agent framework significantly mitigates both types of hallucinations, enhancing accuracy, interpretability, and reliability. Our project is available at https://github.com/ppxy1/EH-Benchmark.
中文摘要:医疗大语言模型在眼科诊断中存在因知识不足和多模态数据缺乏导致的幻觉问题,EH-Benchmark通过将幻觉分类为视觉理解与逻辑组合两大类型,并采用包含知识检索与结果验证的三阶段多智能体框架,显著提升了诊断准确性与可靠性。
English Summary: Medical Large Language Models (MLLMs) face accuracy limitations in ophthalmology due to hallucinations from insufficient knowledge and multimodal data, which the proposed EH-Benchmark and multi-agent framework effectively mitigate by categorizing and addressing these errors through knowledge retrieval and validation stages.

Authors:Siqi Luo, Haoran Yang, Yi Xin, Mingyang Yi, Guangyang Wu, Guangtao Zhai, Xiaohong Liu
Title: TR-PTS: Task-Relevant Parameter and Token Selection for Efficient Tuning
Abstract:
Large pre-trained models achieve remarkable performance in vision tasks but are impractical for fine-tuning due to high computational and storage costs. Parameter-Efficient Fine-Tuning (PEFT) methods mitigate this issue by updating only a subset of parameters; however, most existing approaches are task-agnostic, failing to fully exploit task-specific adaptations, which leads to suboptimal efficiency and performance. To address this limitation, we propose Task-Relevant Parameter and Token Selection (TR-PTS), a task-driven framework that enhances both computational efficiency and accuracy. Specifically, we introduce Task-Relevant Parameter Selection, which utilizes the Fisher Information Matrix (FIM) to identify and fine-tune only the most informative parameters in a layer-wise manner, while keeping the remaining parameters frozen. Simultaneously, Task-Relevant Token Selection dynamically preserves the most informative tokens and merges redundant ones, reducing computational overhead. By jointly optimizing parameters and tokens, TR-PTS enables the model to concentrate on task-discriminative information. We evaluate TR-PTS on benchmark, including FGVC and VTAB-1k, where it achieves state-of-the-art performance, surpassing full fine-tuning by 3.40% and 10.35%, respectively. The code are available at https://github.com/synbol/TR-PTS.
中文: 提出的TR-PTS框架通过选择性更新任务相关参数和标记来增强视觉模型微调,在基准测试中以更高效率实现了最先进的性能。
English: The proposed TR-PTS framework enhances vision model fine-tuning by selectively updating task-relevant parameters and tokens, achieving state-of-the-art performance with improved efficiency on benchmark datasets.

Authors:Yilei Jiang, Yaozhi Zheng, Yuxuan Wan, Jiaming Han, Qunzhong Wang, Michael R. Lyu, Xiangyu Yue
Title: ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents
Abstract:
Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While recent large language models (LLMs) have demonstrated progress in text-to-code generation, many existing approaches rely solely on natural language prompts, limiting their effectiveness in capturing spatial layout and visual design intent. In contrast, UI development in practice is inherently multimodal, often starting from visual sketches or mockups. To address this gap, we introduce a modular multi-agent framework that performs UI-to-code generation in three interpretable stages: grounding, planning, and generation. The grounding agent uses a vision-language model to detect and label UI components, the planning agent constructs a hierarchical layout using front-end engineering priors, and the generation agent produces HTML/CSS code via adaptive prompt-based synthesis. This design improves robustness, interpretability, and fidelity over end-to-end black-box methods. Furthermore, we extend the framework into a scalable data engine that automatically produces large-scale image-code pairs. Using these synthetic examples, we fine-tune and reinforce an open-source VLM, yielding notable gains in UI understanding and code quality. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness. Our code is made publicly available at https://github.com/leigest519/ScreenCoder.
中文: 本文提出了一种模块化的多智能体框架,通过定位、规划和生成三个可解释阶段将用户界面设计转化为前端代码,显著提升了鲁棒性,并在布局准确性和代码质量方面达到了领先水平。
English: This paper introduces a modular multi-agent framework that transforms UI designs into front-end code through three interpretable stages—grounding, planning, and generation—improving robustness and achieving state-of-the-art performance in layout accuracy and code quality.

Authors:Hossein Mirzaei, Zeinab Taghavi, Sepehr Rezaee, Masoud Hadi, Moein Madadi, Mackenzie W. Mathis
Title: DISTIL: Data-Free Inversion of Suspicious Trojan Inputs via Latent Diffusion
Abstract:
Deep neural networks have demonstrated remarkable success across numerous tasks, yet they remain vulnerable to Trojan (backdoor) attacks, raising serious concerns about their safety in real-world mission-critical applications. A common countermeasure is trigger inversion -- reconstructing malicious "shortcut" patterns (triggers) inserted by an adversary during training. Current trigger-inversion methods typically search the full pixel space under specific assumptions but offer no assurances that the estimated trigger is more than an adversarial perturbation that flips the model output. Here, we propose a data-free, zero-shot trigger-inversion strategy that restricts the search space while avoiding strong assumptions on trigger appearance. Specifically, we incorporate a diffusion-based generator guided by the target classifier; through iterative generation, we produce candidate triggers that align with the internal representations the model relies on for malicious behavior. Empirical evaluations, both quantitative and qualitative, show that our approach reconstructs triggers that effectively distinguish clean versus Trojaned models. DISTIL surpasses alternative methods by high margins, achieving up to 7.1% higher accuracy on the BackdoorBench dataset and a 9.4% improvement on trojaned object detection model scanning, offering a promising new direction for reliable backdoor defense without reliance on extensive data or strong prior assumptions about triggers. The code is available at https://github.com/AdaptiveMotorControlLab/DISTIL.
中文: 本文提出DISTIL方法,这是一种无需数据、零样本的触发器反演策略,通过扩散生成器在目标分类器引导下重构后门触发器,无需大量数据或强假设即能有效区分正常与中毒模型,在多项测试中以显著优势超越现有方法。
English: This paper introduces DISTIL, a data-free, zero-shot trigger-inversion method that uses a diffusion-based generator guided by the target classifier to reconstruct Trojan triggers effectively, outperforming existing approaches with significant accuracy improvements in backdoor detection without relying on extensive data or strong trigger assumptions.

Authors:Dongli He, Hu Wang, Mohammad Yaqub
Title: Advancing Fetal Ultrasound Image Quality Assessment in Low-Resource Settings
Abstract:
Accurate fetal biometric measurements, such as abdominal circumference, play a vital role in prenatal care. However, obtaining high-quality ultrasound images for these measurements heavily depends on the expertise of sonographers, posing a significant challenge in low-income countries due to the scarcity of trained personnel. To address this issue, we leverage FetalCLIP, a vision-language model pretrained on a curated dataset of over 210,000 fetal ultrasound image-caption pairs, to perform automated fetal ultrasound image quality assessment (IQA) on blind-sweep ultrasound data. We introduce FetalCLIP$_{CLS}$, an IQA model adapted from FetalCLIP using Low-Rank Adaptation (LoRA), and evaluate it on the ACOUSLIC-AI dataset against six CNN and Transformer baselines. FetalCLIP$_{CLS}$ achieves the highest F1 score of 0.757. Moreover, we show that an adapted segmentation model, when repurposed for classification, further improves performance, achieving an F1 score of 0.771. Our work demonstrates how parameter-efficient fine-tuning of fetal ultrasound foundation models can enable task-specific adaptations, advancing prenatal care in resource-limited settings. The experimental code is available at: https://github.com/donglihe-hub/FetalCLIP-IQA.
中文: 本研究采用FetalCLIP视觉语言模型,通过低秩适应技术实现胎儿超声图像质量自动评估,以0.771的F1分数展现优异性能,为资源有限地区的产前护理提供了有效解决方案。
English: The study introduces FetalCLIP, a vision-language model adapted using Low-Rank Adaptation for automated fetal ultrasound image quality assessment, achieving superior performance with an F1 score of 0.771 and demonstrating potential to enhance prenatal care in resource-limited areas.

Authors:Hang Su, Yunlong Feng, Daniel Gehrig, Panfeng Jiang, Ling Gao, Xavier Lagorce, Laurent Kneip
Title: A Linear N-Point Solver for Structure and Motion from Asynchronous Tracks
Abstract:
Structure and continuous motion estimation from point correspondences is a fundamental problem in computer vision that has been powered by well-known algorithms such as the familiar 5-point or 8-point algorithm. However, despite their acclaim, these algorithms are limited to processing point correspondences originating from a pair of views each one representing an instantaneous capture of the scene. Yet, in the case of rolling shutter cameras, or more recently, event cameras, this synchronization breaks down. In this work, we present a unified approach for structure and linear motion estimation from 2D point correspondences with arbitrary timestamps, from an arbitrary set of views. By formulating the problem in terms of first-order dynamics and leveraging a constant velocity motion model, we derive a novel, linear point incidence relation allowing for the efficient recovery of both linear velocity and 3D points with predictable degeneracies and solution multiplicities. Owing to its general formulation, it can handle correspondences from a wide range of sensing modalities such as global shutter, rolling shutter, and event cameras, and can even combine correspondences from different collocated sensors. We validate the effectiveness of our solver on both simulated and real-world data, where we show consistent improvement across all modalities when compared to recent approaches. We believe our work opens the door to efficient structure and motion estimation from asynchronous data. Code can be found at https://github.com/suhang99/AsyncTrack-Motion-Solver.
中文: 本文提出了一种统一方法,可从任意时间戳的二维点对应中估计结构和线性运动,有效适用于全局快门、卷帘快门和事件相机等多种传感器。
English: This paper introduces a unified method for estimating structure and linear motion from 2D point correspondences with arbitrary timestamps, enabling efficient processing across various camera types including global shutter, rolling shutter, and event cameras.

Authors:Thuy Tran, Ruochen Chen, Shaifali Parashar
Title: Image-Guided Shape-from-Template Using Mesh Inextensibility Constraints
Abstract:
Shape-from-Template (SfT) refers to the class of methods that reconstruct the 3D shape of a deforming object from images/videos using a 3D template. Traditional SfT methods require point correspondences between images and the texture of the 3D template in order to reconstruct 3D shapes from images/videos in real time. Their performance severely degrades when encountered with severe occlusions in the images because of the unavailability of correspondences. In contrast, modern SfT methods use a correspondence-free approach by incorporating deep neural networks to reconstruct 3D objects, thus requiring huge amounts of data for supervision. Recent advances use a fully unsupervised or self-supervised approach by combining differentiable physics and graphics to deform 3D template to match input images. In this paper, we propose an unsupervised SfT which uses only image observations: color features, gradients and silhouettes along with a mesh inextensibility constraint to reconstruct at a $400\times$ faster pace than (best-performing) unsupervised SfT. Moreover, when it comes to generating finer details and severe occlusions, our method outperforms the existing methodologies by a large margin. Code is available at https://github.com/dvttran/nsft.
中文: 我们提出的无监督形状模板法仅利用图像特征和网格约束,以比最优方法快400倍的速度重建三维物体,并在处理精细细节和严重遮挡方面远超现有技术。
English: Our unsupervised Shape-from-Template method reconstructs 3D objects 400 times faster than leading approaches by using only image features and mesh constraints, while significantly outperforming existing methods in handling fine details and severe occlusions.

Authors:Federico Girella, Davide Talon, Ziyue Liu, Zanxi Ruan, Yiming Wang, Marco Cristani
Title: LOTS of Fashion! Multi-Conditioning for Image Generation via Sketch-Text Pairing
Abstract:
Fashion design is a complex creative process that blends visual and textual expressions. Designers convey ideas through sketches, which define spatial structure and design elements, and textual descriptions, capturing material, texture, and stylistic details. In this paper, we present LOcalized Text and Sketch for fashion image generation (LOTS), an approach for compositional sketch-text based generation of complete fashion outlooks. LOTS leverages a global description with paired localized sketch + text information for conditioning and introduces a novel step-based merging strategy for diffusion adaptation. First, a Modularized Pair-Centric representation encodes sketches and text into a shared latent space while preserving independent localized features; then, a Diffusion Pair Guidance phase integrates both local and global conditioning via attention-based guidance within the diffusion model's multi-step denoising process. To validate our method, we build on Fashionpedia to release Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image. Quantitative results show LOTS achieves state-of-the-art image generation performance on both global and localized metrics, while qualitative examples and a human evaluation study highlight its unprecedented level of design customization.

Authors:Shenghao Zhu, Yifei Chen, Weihong Chen, Yuanhan Wang, Chang Liu, Shuo Jiang, Feiwei Qin, Changmiao Wang
Title: Bridging the Gap in Missing Modalities: Leveraging Knowledge Distillation and Style Matching for Brain Tumor Segmentation
Abstract:
Accurate and reliable brain tumor segmentation, particularly when dealing with missing modalities, remains a critical challenge in medical image analysis. Previous studies have not fully resolved the challenges of tumor boundary segmentation insensitivity and feature transfer in the absence of key imaging modalities. In this study, we introduce MST-KDNet, aimed at addressing these critical issues. Our model features Multi-Scale Transformer Knowledge Distillation to effectively capture attention weights at various resolutions, Dual-Mode Logit Distillation to improve the transfer of knowledge, and a Global Style Matching Module that integrates feature matching with adversarial learning. Comprehensive experiments conducted on the BraTS and FeTS 2024 datasets demonstrate that MST-KDNet surpasses current leading methods in both Dice and HD95 scores, particularly in conditions with substantial modality loss. Our approach shows exceptional robustness and generalization potential, making it a promising candidate for real-world clinical applications. Our source code is available at https://github.com/Quanato607/MST-KDNet.
Chinese: MST-KDNet通过多尺度Transformer知识蒸馏和全局风格匹配技术,有效提升了脑肿瘤分割在模态缺失情况下的准确性和鲁棒性,在主流数据集上超越了现有最优方法。
English: MST-KDNet introduces multi-scale transformer knowledge distillation and global style matching to enhance brain tumor segmentation accuracy and robustness, particularly under missing modalities, outperforming state-of-the-art methods on major datasets.

Authors:Takuma Amada, Kazuya Kakizaki, Taiki Miyagawa, Akinori F. Ebihara, Kaede Shiohara, Toshihiko Yamasaki
Title: Robust Deepfake Detection for Electronic Know Your Customer Systems Using Registered Images
Abstract:
In this paper, we present a deepfake detection algorithm specifically designed for electronic Know Your Customer (eKYC) systems. To ensure the reliability of eKYC systems against deepfake attacks, it is essential to develop a robust deepfake detector capable of identifying both face swapping and face reenactment, while also being robust to image degradation. We address these challenges through three key contributions: (1)~Our approach evaluates the video's authenticity by detecting temporal inconsistencies in identity vectors extracted by face recognition models, leading to comprehensive detection of both face swapping and face reenactment. (2)~In addition to processing video input, the algorithm utilizes a registered image (assumed to be genuine) to calculate identity discrepancies between the input video and the registered image, significantly improving detection accuracy. (3)~We find that employing a face feature extractor trained on a larger dataset enhances both detection performance and robustness against image degradation. Our experimental results show that our proposed method accurately detects both face swapping and face reenactment comprehensively and is robust against various forms of unseen image degradation. Our source code is publicly available https://github.com/TaikiMiyagawa/DeepfakeDetection4eKYC.
中文: 本文提出一种面向电子客户识别的深度伪造检测算法,通过分析人脸识别模型提取身份向量的时序不一致性,并利用注册图像计算身份差异,能够全面检测换脸和表情操纵,且对图像质量退化具有强鲁棒性。
English: This paper introduces a deepfake detection algorithm for eKYC systems that identifies face swapping and reenactment by analyzing temporal inconsistencies in identity vectors and leveraging registered images for enhanced accuracy, while demonstrating robustness against image degradation.

Authors:Galadrielle Humblot-Renaux, Gianni Franchi, Sergio Escalera, Thomas B. Moeslund
Title: COOkeD: Ensemble-based OOD detection in the era of zero-shot CLIP
Abstract:
Out-of-distribution (OOD) detection is an important building block in trustworthy image recognition systems as unknown classes may arise at test-time. OOD detection methods typically revolve around a single classifier, leading to a split in the research field between the classical supervised setting (e.g. ResNet18 classifier trained on CIFAR100) vs. the zero-shot setting (class names fed as prompts to CLIP). In both cases, an overarching challenge is that the OOD detection performance is implicitly constrained by the classifier's capabilities on in-distribution (ID) data. In this work, we show that given a little open-mindedness from both ends, remarkable OOD detection can be achieved by instead creating a heterogeneous ensemble - COOkeD combines the predictions of a closed-world classifier trained end-to-end on a specific dataset, a zero-shot CLIP classifier, and a linear probe classifier trained on CLIP image features. While bulky at first sight, this approach is modular, post-hoc and leverages the availability of pre-trained VLMs, thus introduces little overhead compared to training a single standard classifier. We evaluate COOkeD on popular CIFAR100 and ImageNet benchmarks, but also consider more challenging, realistic settings ranging from training-time label noise, to test-time covariate shift, to zero-shot shift which has been previously overlooked. Despite its simplicity, COOkeD achieves state-of-the-art performance and greater robustness compared to both classical and CLIP-based OOD detection methods. Code is available at https://github.com/glhr/COOkeD
中文: COOkeD提出了一种异构集成方法,融合了闭域分类器、零样本CLIP分类器和线性探针分类器,在多种挑战性场景下实现了最先进的分布外检测性能并展现出更强的鲁棒性。
English: COOkeD introduces a heterogeneous ensemble method combining closed-world, zero-shot CLIP, and linear probe classifiers to achieve state-of-the-art OOD detection performance with enhanced robustness across diverse challenging scenarios.

Authors:Shijing Chen, Xinrui Zhou, Yuhao Wang, Yuhao Huang, Ao Chang, Dong Ni, Ruobing Huang
Title: Subtyping Breast Lesions via Generative Augmentation based Long-tailed Recognition in Ultrasound
Abstract:
Accurate identification of breast lesion subtypes can facilitate personalized treatment and interventions. Ultrasound (US), as a safe and accessible imaging modality, is extensively employed in breast abnormality screening and diagnosis. However, the incidence of different subtypes exhibits a skewed long-tailed distribution, posing significant challenges for automated recognition. Generative augmentation provides a promising solution to rectify data distribution. Inspired by this, we propose a dual-phase framework for long-tailed classification that mitigates distributional bias through high-fidelity data synthesis while avoiding overuse that corrupts holistic performance. The framework incorporates a reinforcement learning-driven adaptive sampler, dynamically calibrating synthetic-real data ratios by training a strategic multi-agent to compensate for scarcities of real data while ensuring stable discriminative capability. Furthermore, our class-controllable synthetic network integrates a sketch-grounded perception branch that harnesses anatomical priors to maintain distinctive class features while enabling annotation-free inference. Extensive experiments on an in-house long-tailed and a public imbalanced breast US datasets demonstrate that our method achieves promising performance compared to state-of-the-art approaches. More synthetic images can be found at https://github.com/Stinalalala/Breast-LT-GenAug.
Chinese: 本研究提出了一种双阶段框架,通过生成式增强和强化学习解决乳腺超声病灶分类中的长尾分布问题,利用自适应数据合成和类别可控特征保持技术,实现了优越的分类性能。
English: This study introduces a dual-phase framework that uses generative augmentation and reinforcement learning to address the long-tailed distribution challenge in breast ultrasound lesion classification, achieving superior performance through adaptive data synthesis and class-controllable feature preservation.

Authors:Weicheng Gao
Title: Exploration of Low-Cost but Accurate Radar-Based Human Motion Direction Determination
Abstract:
This work is completed on a whim after discussions with my junior colleague. The motion direction angle affects the micro-Doppler spectrum width, thus determining the human motion direction can provide important prior information for downstream tasks such as gait recognition. However, Doppler-Time map (DTM)-based methods still have room for improvement in achieving feature augmentation and motion determination simultaneously. In response, a low-cost but accurate radar-based human motion direction determination (HMDD) method is explored in this paper. In detail, the radar-based human gait DTMs are first generated, and then the feature augmentation is achieved using feature linking model. Subsequently, the HMDD is implemented through a lightweight and fast Vision Transformer-Convolutional Neural Network hybrid model structure. The effectiveness of the proposed method is verified through open-source dataset. The open-source code of this work is released at: https://github.com/JoeyBGOfficial/Low-Cost-Accurate-Radar-Based-Human-Motion-Direction-Determination.
中文: 本文提出了一种低成本的雷达方法,通过结合视觉Transformer和CNN的混合模型增强微多普勒特征,以精确识别人体运动方向,并在开源数据集上验证了其有效性。
English: This paper introduces a low-cost radar-based method that uses a hybrid Vision Transformer-CNN model to enhance micro-Doppler features and accurately determine human motion direction, validated on an open-source dataset.

Authors:Xincheng Yao, Yijun Yang, Kangwei Guo, Ruiqiang Xiao, Haipeng Zhou, Haisu Tao, Jian Yang, Lei Zhu
Title: HRVVS: A High-resolution Video Vasculature Segmentation Network via Hierarchical Autoregressive Residual Priors
Abstract:
The segmentation of the hepatic vasculature in surgical videos holds substantial clinical significance in the context of hepatectomy procedures. However, owing to the dearth of an appropriate dataset and the inherently complex task characteristics, few researches have been reported in this domain. To address this issue, we first introduce a high quality frame-by-frame annotated hepatic vasculature dataset containing 35 long hepatectomy videos and 11442 high-resolution frames. On this basis, we propose a novel high-resolution video vasculature segmentation network, dubbed as HRVVS. We innovatively embed a pretrained visual autoregressive modeling (VAR) model into different layers of the hierarchical encoder as prior information to reduce the information degradation generated during the downsampling process. In addition, we designed a dynamic memory decoder on a multi-view segmentation network to minimize the transmission of redundant information while preserving more details between frames. Extensive experiments on surgical video datasets demonstrate that our proposed HRVVS significantly outperforms the state-of-the-art methods. The source code and dataset will be publicly available at \{https://github.com/scott-yjyang/HRVVS}.
中文: 本研究提出了一种高分辨率肝血管分割网络(HRVVS)并发布了新标注数据集,通过嵌入预训练VAR模型和动态记忆解码器,在手术视频分析中显著优于现有方法。
English: This study introduces a high-resolution hepatic vasculature segmentation network (HRVVS) and a new annotated dataset, which significantly outperforms existing methods by integrating a pretrained VAR model and a dynamic memory decoder to enhance surgical video analysis.

Authors:Ziyi Wang, Peiming Li, Hong Liu, Zhichao Deng, Can Wang, Jun Liu, Junsong Yuan, Mengyuan Liu
Title: Recognizing Actions from Robotic View for Natural Human-Robot Interaction
Abstract:
Natural Human-Robot Interaction (N-HRI) requires robots to recognize human actions at varying distances and states, regardless of whether the robot itself is in motion or stationary. This setup is more flexible and practical than conventional human action recognition tasks. However, existing benchmarks designed for traditional action recognition fail to address the unique complexities in N-HRI due to limited data, modalities, task categories, and diversity of subjects and environments. To address these challenges, we introduce ACTIVE (Action from Robotic View), a large-scale dataset tailored specifically for perception-centric robotic views prevalent in mobile service robots. ACTIVE comprises 30 composite action categories, 80 participants, and 46,868 annotated video instances, covering both RGB and point cloud modalities. Participants performed various human actions in diverse environments at distances ranging from 3m to 50m, while the camera platform was also mobile, simulating real-world scenarios of robot perception with varying camera heights due to uneven ground. This comprehensive and challenging benchmark aims to advance action and attribute recognition research in N-HRI. Furthermore, we propose ACTIVE-PC, a method that accurately perceives human actions at long distances using Multilevel Neighborhood Sampling, Layered Recognizers, Elastic Ellipse Query, and precise decoupling of kinematic interference from human actions. Experimental results demonstrate the effectiveness of ACTIVE-PC. Our code is available at: https://github.com/wangzy01/ACTIVE-Action-from-Robotic-View.
中文摘要:ACTIVE数据集针对自然人机交互中现有基准的不足,通过从移动平台采集包含RGB和点云模态的全面视频数据,覆盖不同距离和环境,旨在推动该领域的研究发展。
English Summary: The ACTIVE dataset is introduced to address the limitations of existing benchmarks in Natural Human-Robot Interaction by providing comprehensive video data with RGB and point cloud modalities, collected from mobile platforms across varying distances and environments.

Authors:Haipeng Li, Tianhao Zhou, Zhanglei Yang, Yi Wu, Yan Chen, Zijing Mao, Shen Cheng, Bing Zeng, Shuaicheng Liu
Title: Estimating 2D Camera Motion with Hybrid Motion Basis
Abstract:
Estimating 2D camera motion is a fundamental computer vision task that models the projection of 3D camera movements onto the 2D image plane. Current methods rely on either homography-based approaches, limited to planar scenes, or meshflow techniques that use grid-based local homographies but struggle with complex non-linear transformations. A key insight of our work is that combining flow fields from different homographies creates motion patterns that cannot be represented by any single homography. We introduce CamFlow, a novel framework that represents camera motion using hybrid motion bases: physical bases derived from camera geometry and stochastic bases for complex scenarios. Our approach includes a hybrid probabilistic loss function based on the Laplace distribution that enhances training robustness. For evaluation, we create a new benchmark by masking dynamic objects in existing optical flow datasets to isolate pure camera motion. Experiments show CamFlow outperforms state-of-the-art methods across diverse scenarios, demonstrating superior robustness and generalization in zero-shot settings. Code and datasets are available at our project page: https://lhaippp.github.io/CamFlow/.
中文:CamFlow提出了一种新颖的相机运动估计框架,通过混合运动基和鲁棒的概率损失函数,在多样化场景中超越现有方法,并在零样本设置下展现出卓越的泛化能力。
English: CamFlow introduces a novel camera motion estimation framework using hybrid motion bases and a robust probabilistic loss, outperforming existing methods across diverse scenarios and demonstrating superior generalization in zero-shot settings.

Authors:Hui Liu, Chen Jia, Fan Shi, Xu Cheng, Mengfei Shi, Xia Xie, Shengyong Chen
Title: LIDAR: Lightweight Adaptive Cue-Aware Fusion Vision Mamba for Multimodal Segmentation of Structural Cracks
Abstract:
Achieving pixel-level segmentation with low computational cost using multimodal data remains a key challenge in crack segmentation tasks. Existing methods lack the capability for adaptive perception and efficient interactive fusion of cross-modal features. To address these challenges, we propose a Lightweight Adaptive Cue-Aware Vision Mamba network (LIDAR), which efficiently perceives and integrates morphological and textural cues from different modalities under multimodal crack scenarios, generating clear pixel-level crack segmentation maps. Specifically, LIDAR is composed of a Lightweight Adaptive Cue-Aware Visual State Space module (LacaVSS) and a Lightweight Dual Domain Dynamic Collaborative Fusion module (LD3CF). LacaVSS adaptively models crack cues through the proposed mask-guided Efficient Dynamic Guided Scanning Strategy (EDG-SS), while LD3CF leverages an Adaptive Frequency Domain Perceptron (AFDP) and a dual-pooling fusion strategy to effectively capture spatial and frequency-domain cues across modalities. Moreover, we design a Lightweight Dynamically Modulated Multi-Kernel convolution (LDMK) to perceive complex morphological structures with minimal computational overhead, replacing most convolutional operations in LIDAR. Experiments on three datasets demonstrate that our method outperforms other state-of-the-art (SOTA) methods. On the light-field depth dataset, our method achieves 0.8204 in F1 and 0.8465 in mIoU with only 5.35M parameters. Code and datasets are available at https://github.com/Karl1109/LIDAR-Mamba.
中文: 提出的轻量自适应线索感知视觉Mamba网络(LIDAR)通过自适应感知与融合模块有效整合多模态裂缝特征,以较低计算成本实现了优于现有方法的像素级分割效果。
English: The proposed Lightweight Adaptive Cue-Aware Vision Mamba network (LIDAR) effectively integrates multimodal crack features through adaptive perception and fusion modules, achieving superior pixel-level segmentation with minimal computational cost compared to existing methods.

Authors:Zheng Xiangyu, He Songcheng, Li Wanyun, Li Xiaoqiang, Zhang Wei
Title: Shallow Features Matter: Hierarchical Memory with Heterogeneous Interaction for Unsupervised Video Object Segmentation
Abstract:
Unsupervised Video Object Segmentation (UVOS) aims to predict pixel-level masks for the most salient objects in videos without any prior annotations. While memory mechanisms have been proven critical in various video segmentation paradigms, their application in UVOS yield only marginal performance gains despite sophisticated design. Our analysis reveals a simple but fundamental flaw in existing methods: over-reliance on memorizing high-level semantic features. UVOS inherently suffers from the deficiency of lacking fine-grained information due to the absence of pixel-level prior knowledge. Consequently, memory design relying solely on high-level features, which predominantly capture abstract semantic cues, is insufficient to generate precise predictions. To resolve this fundamental issue, we propose a novel hierarchical memory architecture to incorporate both shallow- and high-level features for memory, which leverages the complementary benefits of pixel and semantic information. Furthermore, to balance the simultaneous utilization of the pixel and semantic memory features, we propose a heterogeneous interaction mechanism to perform pixel-semantic mutual interactions, which explicitly considers their inherent feature discrepancies. Through the design of Pixel-guided Local Alignment Module (PLAM) and Semantic-guided Global Integration Module (SGIM), we achieve delicate integration of the fine-grained details in shallow-level memory and the semantic representations in high-level memory. Our Hierarchical Memory with Heterogeneous Interaction Network (HMHI-Net) consistently achieves state-of-the-art performance across all UVOS and video saliency detection benchmarks. Moreover, HMHI-Net consistently exhibits high performance across different backbones, further demonstrating its superiority and robustness. Project page: https://github.com/ZhengxyFlow/HMHI-Net .
中文: 本文指出现有无监督视频对象分割方法过度依赖高级语义记忆的缺陷,提出HMHI-Net层次化记忆架构,通过异质交互机制融合浅层与深层特征,在多个基准测试中均取得最优性能。
English: This paper identifies the limitation of existing UVOS methods in over-relying on high-level semantic memory and proposes HMHI-Net, a hierarchical memory architecture with heterogeneous interaction that integrates both shallow and deep features to achieve state-of-the-art performance across benchmarks.

Authors:Jaeha Kim, Junghun Oh, Kyoung Mu Lee
Title: Exploiting Diffusion Prior for Task-driven Image Restoration
Abstract:
Task-driven image restoration (TDIR) has recently emerged to address performance drops in high-level vision tasks caused by low-quality (LQ) inputs. Previous TDIR methods struggle to handle practical scenarios in which images are degraded by multiple complex factors, leaving minimal clues for restoration. This motivates us to leverage the diffusion prior, one of the most powerful natural image priors. However, while the diffusion prior can help generate visually plausible results, using it to restore task-relevant details remains challenging, even when combined with recent TDIR methods. To address this, we propose EDTR, which effectively harnesses the power of diffusion prior to restore task-relevant details. Specifically, we propose directly leveraging useful clues from LQ images in the diffusion process by generating from pixel-error-based pre-restored LQ images with mild noise added. Moreover, we employ a small number of denoising steps to prevent the generation of redundant details that dilute crucial task-related information. We demonstrate that our method effectively utilizes diffusion prior for TDIR, significantly enhancing task performance and visual quality across diverse tasks with multiple complex degradations.
中文: 提出的EDTR方法有效利用扩散先验,通过基于像素误差的预修复和可控去噪步骤,在多重复杂退化图像中恢复任务相关细节,显著提升了任务性能和视觉质量。
English: The proposed EDTR method effectively leverages diffusion prior to restore task-relevant details in images degraded by multiple complex factors, enhancing both task performance and visual quality through pixel-error-based restoration and controlled denoising steps.

Authors:Jiuming Liu, Zheng Huang, Mengmeng Liu, Tianchen Deng, Francesco Nex, Hao Cheng, Hesheng Wang
Title: TopoLiDM: Topology-Aware LiDAR Diffusion Models for Interpretable and Realistic LiDAR Point Cloud Generation
Abstract:
LiDAR scene generation is critical for mitigating real-world LiDAR data collection costs and enhancing the robustness of downstream perception tasks in autonomous driving. However, existing methods commonly struggle to capture geometric realism and global topological consistency. Recent LiDAR Diffusion Models (LiDMs) predominantly embed LiDAR points into the latent space for improved generation efficiency, which limits their interpretable ability to model detailed geometric structures and preserve global topological consistency. To address these challenges, we propose TopoLiDM, a novel framework that integrates graph neural networks (GNNs) with diffusion models under topological regularization for high-fidelity LiDAR generation. Our approach first trains a topological-preserving VAE to extract latent graph representations by graph construction and multiple graph convolutional layers. Then we freeze the VAE and generate novel latent topological graphs through the latent diffusion models. We also introduce 0-dimensional persistent homology (PH) constraints, ensuring the generated LiDAR scenes adhere to real-world global topological structures. Extensive experiments on the KITTI-360 dataset demonstrate TopoLiDM's superiority over state-of-the-art methods, achieving improvements of 22.6% lower Frechet Range Image Distance (FRID) and 9.2% lower Minimum Matching Distance (MMD). Notably, our model also enables fast generation speed with an average inference time of 1.68 samples/s, showcasing its scalability for real-world applications. We will release the related codes at https://github.com/IRMVLab/TopoLiDM.
中文: TopoLiDM是一种新颖的框架,通过在图神经网络与扩散模型中引入拓扑正则化,能够生成高保真度的激光雷达场景,在几何真实性和全局拓扑一致性方面显著优于现有方法。
English: TopoLiDM is a novel framework that integrates graph neural networks with diffusion models under topological regularization to generate high-fidelity LiDAR scenes, achieving superior geometric realism and global topological consistency compared to existing methods.

Authors:Youngho Kim, Hoonhee Cho, Kuk-Jin Yoon
Title: From Sharp to Blur: Unsupervised Domain Adaptation for 2D Human Pose Estimation Under Extreme Motion Blur Using Event Cameras
Abstract:
Human pose estimation is critical for applications such as rehabilitation, sports analytics, and AR/VR systems. However, rapid motion and low-light conditions often introduce motion blur, significantly degrading pose estimation due to the domain gap between sharp and blurred images. Most datasets assume stable conditions, making models trained on sharp images struggle in blurred environments. To address this, we introduce a novel domain adaptation approach that leverages event cameras, which capture high temporal resolution motion data and are inherently robust to motion blur. Using event-based augmentation, we generate motion-aware blurred images, effectively bridging the domain gap between sharp and blurred domains without requiring paired annotations. Additionally, we develop a student-teacher framework that iteratively refines pseudo-labels, leveraging mutual uncertainty masking to eliminate incorrect labels and enable more effective learning. Experimental results demonstrate that our approach outperforms conventional domain-adaptive human pose estimation methods, achieving robust pose estimation under motion blur without requiring annotations in the target domain. Our findings highlight the potential of event cameras as a scalable and effective solution for domain adaptation in real-world motion blur environments. Our project codes are available at https://github.com/kmax2001/EvSharp2Blur.
中文: 本研究提出一种利用事件相机生成运动感知模糊图像的领域自适应方法,并结合师生框架与互不确定性掩码,在无需目标域标注的情况下实现了运动模糊下的鲁棒人体姿态估计。
English: This study introduces a domain adaptation method using event cameras to generate motion-aware blurred images and a student-teacher framework with mutual uncertainty masking, achieving robust human pose estimation under motion blur without target domain annotations.

Authors:Phi Van Nguyen, Ngoc Huynh Trinh, Duy Minh Lam Nguyen, Phu Loc Nguyen, Quoc Long Tran
Title: Aleatoric Uncertainty Medical Image Segmentation Estimation via Flow Matching
Abstract:
Quantifying aleatoric uncertainty in medical image segmentation is critical since it is a reflection of the natural variability observed among expert annotators. A conventional approach is to model the segmentation distribution using the generative model, but current methods limit the expression ability of generative models. While current diffusion-based approaches have demonstrated impressive performance in approximating the data distribution, their inherent stochastic sampling process and inability to model exact densities limit their effectiveness in accurately capturing uncertainty. In contrast, our proposed method leverages conditional flow matching, a simulation-free flow-based generative model that learns an exact density, to produce highly accurate segmentation results. By guiding the flow model on the input image and sampling multiple data points, our approach synthesizes segmentation samples whose pixel-wise variance reliably reflects the underlying data distribution. This sampling strategy captures uncertainties in regions with ambiguous boundaries, offering robust quantification that mirrors inter-annotator differences. Experimental results demonstrate that our method not only achieves competitive segmentation accuracy but also generates uncertainty maps that provide deeper insights into the reliability of the segmentation outcomes. The code for this paper is freely available at https://github.com/huynhspm/Data-Uncertainty
中文摘要:本研究提出一种基于条件流匹配的方法,通过生成反映标注者间差异的多个分割样本来精确量化医学图像分割中的偶然不确定性,既实现了有竞争力的分割精度,又生成了具有深刻见解的不确定性图谱。
English Summary: This study introduces a method using conditional flow matching to accurately quantify aleatoric uncertainty in medical image segmentation by generating multiple segmentation samples that reflect inter-annotator variability, achieving both competitive accuracy and insightful uncertainty maps.

Authors:Sijie Wang, Siqi Li, Yawei Zhang, Shangshu Yu, Shenghai Yuan, Rui She, Quanjiang Guo, JinXuan Zheng, Ong Kang Howe, Leonrich Chandra, Shrivarshann Srijeyan, Aditya Sivadas, Toshan Aggarwal, Heyuan Liu, Hongming Zhang, Chujie Chen, Junyu Jiang, Lihua Xie, Wee Peng Tay
Title: UAVScenes: A Multi-Modal Dataset for UAVs
Abstract:
Multi-modal perception is essential for unmanned aerial vehicle (UAV) operations, as it enables a comprehensive understanding of the UAVs' surrounding environment. However, most existing multi-modal UAV datasets are primarily biased toward localization and 3D reconstruction tasks, or only support map-level semantic segmentation due to the lack of frame-wise annotations for both camera images and LiDAR point clouds. This limitation prevents them from being used for high-level scene understanding tasks. To address this gap and advance multi-modal UAV perception, we introduce UAVScenes, a large-scale dataset designed to benchmark various tasks across both 2D and 3D modalities. Our benchmark dataset is built upon the well-calibrated multi-modal UAV dataset MARS-LVIG, originally developed only for simultaneous localization and mapping (SLAM). We enhance this dataset by providing manually labeled semantic annotations for both frame-wise images and LiDAR point clouds, along with accurate 6-degree-of-freedom (6-DoF) poses. These additions enable a wide range of UAV perception tasks, including segmentation, depth estimation, 6-DoF localization, place recognition, and novel view synthesis (NVS). Our dataset is available at https://github.com/sijieaaa/UAVScenes
Chinese Summary: UAVScenes是一个新的大规模多模态数据集,它在MARS-LVIG数据集基础上增加了精细的语义标注和6自由度位姿信息,为无人机感知任务如分割和三维场景理解提供了支持。
English Summary: UAVScenes is a new large-scale multi-modal dataset that enhances the MARS-LVIG dataset with detailed semantic annotations and 6-DoF poses, enabling advanced UAV perception tasks like segmentation and 3D scene understanding.

Authors:Seungryong Lee, Woojeong Baek, Younghyun Kim, Eunwoo Kim, Haru Moon, Donggon Yoo, Eunbyung Park
Title: Moiré Zero: An Efficient and High-Performance Neural Architecture for Moiré Removal
Abstract:
Moiré patterns, caused by frequency aliasing between fine repetitive structures and a camera sensor's sampling process, have been a significant obstacle in various real-world applications, such as consumer photography and industrial defect inspection. With the advancements in deep learning algorithms, numerous studies-predominantly based on convolutional neural networks-have suggested various solutions to address this issue. Despite these efforts, existing approaches still struggle to effectively eliminate artifacts due to the diverse scales, orientations, and color shifts of moiré patterns, primarily because the constrained receptive field of CNN-based architectures limits their ability to capture the complex characteristics of moiré patterns. In this paper, we propose MZNet, a U-shaped network designed to bring images closer to a 'Moire-Zero' state by effectively removing moiré patterns. It integrates three specialized components: Multi-Scale Dual Attention Block (MSDAB) for extracting and refining multi-scale features, Multi-Shape Large Kernel Convolution Block (MSLKB) for capturing diverse moiré structures, and Feature Fusion-Based Skip Connection for enhancing information flow. Together, these components enhance local texture restoration and large-scale artifact suppression. Experiments on benchmark datasets demonstrate that MZNet achieves state-of-the-art performance on high-resolution datasets and delivers competitive results on lower-resolution dataset, while maintaining a low computational cost, suggesting that it is an efficient and practical solution for real-world applications. Project page: https://sngryonglee.github.io/MoireZero

Authors:Anubhav Kataria, Surbhi Madan, Shreya Ghosh, Tom Gedeon, Abhinav Dhall
Title: Gems: Group Emotion Profiling Through Multimodal Situational Understanding
Abstract:
Understanding individual, group and event level emotions along with contextual information is crucial for analyzing a multi-person social situation. To achieve this, we frame emotion comprehension as the task of predicting fine-grained individual emotion to coarse grained group and event level emotion. We introduce GEMS that leverages a multimodal swin-transformer and S3Attention based architecture, which processes an input scene, group members, and context information to generate joint predictions. Existing multi-person emotion related benchmarks mainly focus on atomic interactions primarily based on emotion perception over time and group level. To this end, we extend and propose VGAF-GEMS to provide more fine grained and holistic analysis on top of existing group level annotation of VGAF dataset. GEMS aims to predict basic discrete and continuous emotions (including valence and arousal) as well as individual, group and event level perceived emotions. Our benchmarking effort links individual, group and situational emotional responses holistically. The quantitative and qualitative comparisons with adapted state-of-the-art models demonstrate the effectiveness of GEMS framework on VGAF-GEMS benchmarking. We believe that it will pave the way of further research. The code and data is available at: https://github.com/katariaak579/GEMS
中文: GEMS框架通过多模态转换器和注意力机制,实现了从个体到群体及事件层面的细粒度情感预测,在扩展的VGAF-GEMS基准测试中展现出对社交场景情感分析的全面优势。
English: The GEMS framework utilizes multimodal transformers and attention mechanisms to predict fine-grained individual, group, and event-level emotions, demonstrating superior performance on the extended VGAF-GEMS benchmark for holistic social emotion analysis.

Authors:Pei Deng, Wenqian Zhou, Hanlin Wu
Title: DeltaVLM: Interactive Remote Sensing Image Change Analysis via Instruction-guided Difference Perception
Abstract:
Accurate interpretation of land-cover changes in multi-temporal satellite imagery is critical for real-world scenarios. However, existing methods typically provide only one-shot change masks or static captions, limiting their ability to support interactive, query-driven analysis. In this work, we introduce remote sensing image change analysis (RSICA) as a new paradigm that combines the strengths of change detection and visual question answering to enable multi-turn, instruction-guided exploration of changes in bi-temporal remote sensing images. To support this task, we construct ChangeChat-105k, a large-scale instruction-following dataset, generated through a hybrid rule-based and GPT-assisted process, covering six interaction types: change captioning, classification, quantification, localization, open-ended question answering, and multi-turn dialogues. Building on this dataset, we propose DeltaVLM, an end-to-end architecture tailored for interactive RSICA. DeltaVLM features three innovations: (1) a fine-tuned bi-temporal vision encoder to capture temporal differences; (2) a visual difference perception module with a cross-semantic relation measuring (CSRM) mechanism to interpret changes; and (3) an instruction-guided Q-former to effectively extract query-relevant difference information from visual changes, aligning them with textual instructions. We train DeltaVLM on ChangeChat-105k using a frozen large language model, adapting only the vision and alignment modules to optimize efficiency. Extensive experiments and ablation studies demonstrate that DeltaVLM achieves state-of-the-art performance on both single-turn captioning and multi-turn interactive change analysis, outperforming existing multimodal large language models and remote sensing vision-language models. Code, dataset and pre-trained weights are available at https://github.com/hanlinwu/DeltaVLM.
中文: 本文提出RSICA新范式,融合变化检测与视觉问答,通过DeltaVLM模型和ChangeChat-105k数据集实现对双时相遥感图像的交互式变化分析,在多轮变化探索中达到最优性能。
English: This paper introduces RSICA, a new paradigm combining change detection and visual question answering to enable interactive analysis of bi-temporal remote sensing images, supported by the DeltaVLM model and ChangeChat-105k dataset, achieving state-of-the-art performance in multi-turn change exploration.

Authors:Yuki Fujimura, Takahiro Kushida, Kazuya Kitano, Takuya Funatomi, Yasuhiro Mukaigawa
Title: UFV-Splatter: Pose-Free Feed-Forward 3D Gaussian Splatting Adapted to Unfavorable Views
Abstract:
This paper presents a pose-free, feed-forward 3D Gaussian Splatting (3DGS) framework designed to handle unfavorable input views. A common rendering setup for training feed-forward approaches places a 3D object at the world origin and renders it from cameras pointed toward the origin -- i.e., from favorable views, limiting the applicability of these models to real-world scenarios involving varying and unknown camera poses. To overcome this limitation, we introduce a novel adaptation framework that enables pretrained pose-free feed-forward 3DGS models to handle unfavorable views. We leverage priors learned from favorable images by feeding recentered images into a pretrained model augmented with low-rank adaptation (LoRA) layers. We further propose a Gaussian adapter module to enhance the geometric consistency of the Gaussians derived from the recentered inputs, along with a Gaussian alignment method to render accurate target views for training. Additionally, we introduce a new training strategy that utilizes an off-the-shelf dataset composed solely of favorable images. Experimental results on both synthetic images from the Google Scanned Objects dataset and real images from the OmniObject3D dataset validate the effectiveness of our method in handling unfavorable input views.

Authors:Shaoan Xie, Lingjing Kong, Yujia Zheng, Yu Yao, Zeyu Tang, Eric P. Xing, Guangyi Chen, Kun Zhang
Title: SmartCLIP: Modular Vision-language Alignment with Identification Guarantees
Abstract:
Contrastive Language-Image Pre-training (CLIP)~\citep{radford2021learning} has emerged as a pivotal model in computer vision and multimodal learning, achieving state-of-the-art performance at aligning visual and textual representations through contrastive learning. However, CLIP struggles with potential information misalignment in many image-text datasets and suffers from entangled representation. On the one hand, short captions for a single image in datasets like MSCOCO may describe disjoint regions in the image, leaving the model uncertain about which visual features to retain or disregard. On the other hand, directly aligning long captions with images can lead to the retention of entangled details, preventing the model from learning disentangled, atomic concepts -- ultimately limiting its generalization on certain downstream tasks involving short prompts. In this paper, we establish theoretical conditions that enable flexible alignment between textual and visual representations across varying levels of granularity. Specifically, our framework ensures that a model can not only \emph{preserve} cross-modal semantic information in its entirety but also \emph{disentangle} visual representations to capture fine-grained textual concepts. Building on this foundation, we introduce \ours, a novel approach that identifies and aligns the most relevant visual and textual representations in a modular manner. Superior performance across various tasks demonstrates its capability to handle information misalignment and supports our identification theory. The code is available at https://github.com/Mid-Push/SmartCLIP.
中文: CLIP模型在图文数据中存在信息错位和表征纠缠的问题,而本文提出的SmartCLIP框架通过模块化对齐实现了跨模态语义的灵活匹配与解耦表示,显著提升了泛化能力。
English: CLIP faces challenges with information misalignment and entangled representations in image-text datasets, but our proposed SmartCLIP framework enables flexible cross-modal alignment and disentangled visual concepts to improve generalization.

Authors:Umair Nawaz, Muhammad Zaigham Zaheer, Fahad Shahbaz Khan, Hisham Cholakkal, Salman Khan, Rao Muhammad Anwer
Title: AI in Agriculture: A Survey of Deep Learning Techniques for Crops, Fisheries and Livestock
Abstract:
Crops, fisheries and livestock form the backbone of global food production, essential to feed the ever-growing global population. However, these sectors face considerable challenges, including climate variability, resource limitations, and the need for sustainable management. Addressing these issues requires efficient, accurate, and scalable technological solutions, highlighting the importance of artificial intelligence (AI). This survey presents a systematic and thorough review of more than 200 research works covering conventional machine learning approaches, advanced deep learning techniques (e.g., vision transformers), and recent vision-language foundation models (e.g., CLIP) in the agriculture domain, focusing on diverse tasks such as crop disease detection, livestock health management, and aquatic species monitoring. We further cover major implementation challenges such as data variability and experimental aspects: datasets, performance evaluation metrics, and geographical focus. We finish the survey by discussing potential open research directions emphasizing the need for multimodal data integration, efficient edge-device deployment, and domain-adaptable AI models for diverse farming environments. Rapid growth of evolving developments in this field can be actively tracked on our project page: https://github.com/umair1221/AI-in-Agriculture
中文: 本综述系统梳理了农业领域200多项人工智能应用,涵盖从传统机器学习到先进视觉语言模型的技术,重点探讨了作物病害检测、畜牧健康管理等任务的实施挑战与未来研究方向。
English: This survey systematically reviews over 200 AI applications in agriculture—from traditional machine learning to advanced vision-language models—addressing challenges like crop disease detection and livestock monitoring while highlighting implementation hurdles and future research directions.

Authors:Sicheng Zhang, Binzhu Xie, Zhonghao Yan, Yuli Zhang, Donghao Zhou, Xiaofei Chen, Shi Qiu, Jiaqi Liu, Guoyang Xie, Zhichao Lu
Title: Trade-offs in Image Generation: How Do Different Dimensions Interact?
Abstract:
Model performance in text-to-image (T2I) and image-to-image (I2I) generation often depends on multiple aspects, including quality, alignment, diversity, and robustness. However, models' complex trade-offs among these dimensions have rarely been explored due to (1) the lack of datasets that allow fine-grained quantification of these trade-offs, and (2) the use of a single metric for multiple dimensions. To bridge this gap, we introduce TRIG-Bench (Trade-offs in Image Generation), which spans 10 dimensions (Realism, Originality, Aesthetics, Content, Relation, Style, Knowledge, Ambiguity, Toxicity, and Bias), contains 40,200 samples, and covers 132 pairwise dimensional subsets. Furthermore, we develop TRIGScore, a VLM-as-judge metric that automatically adapts to various dimensions. Based on TRIG-Bench and TRIGScore, we evaluate 14 models across T2I and I2I tasks. In addition, we propose the Relation Recognition System to generate the Dimension Trade-off Map (DTM) that visualizes the trade-offs among model-specific capabilities. Our experiments demonstrate that DTM consistently provides a comprehensive understanding of the trade-offs between dimensions for each type of generative model. Notably, we show that the model's dimension-specific weaknesses can be mitigated through fine-tuning on DTM to enhance overall performance. Code is available at: https://github.com/fesvhtr/TRIG
中文:TRIG-Bench和TRIGScore的提出旨在评估文生图与图生图在十个维度上的权衡关系,通过维度权衡图实现模型能力的全面分析,并可通过微调有效提升生成性能。
English: TRIG-Bench and TRIGScore are introduced to evaluate trade-offs across ten dimensions in text-to-image and image-to-image generation, enabling comprehensive model analysis and performance enhancement through fine-tuning.

Authors:Shuquan Lian, Yuhang Wu, Jia Ma, Yifan Ding, Zihan Song, Bingqi Chen, Xiawu Zheng, Hui Li
Title: UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding
Abstract:
The emergence of Multimodal Large Language Models (MLLMs) has driven significant advances in Graphical User Interface (GUI) agent capabilities. Nevertheless, existing GUI agent training and inference techniques still suffer from a dilemma for reasoning designs, ineffective reward, and visual noise. To address these issues, we introduce UI-AGILE for enhancing GUI agents at both training and inference. For training, we propose a suite of improvements to the Supervised Fine-Tuning (SFT) process: 1) a continuous reward function to incentivize high-precision grounding; 2) a ``Simple Thinking'' reward to balance planning with speed and grounding accuracy; and 3) a cropping-based resampling strategy to mitigate the sparse reward problem and improve learning on complex tasks. For inference, we present decomposed grounding with selection to dramatically improve grounding accuracy on high-resolution displays by breaking the image into smaller, manageable parts. Experiments show that UI-AGILE achieves the state-of-the-art grounding performance on two benchmarks ScreenSpot-Pro and ScreenSpot-v2 while it also exhibits strong general agent capabilities. For instance, using both our training and inference enhancement methods brings 23\% grounding accuracy improvement over the best baseline on ScreenSpot-Pro. We provide the code in https://github.com/KDEGroup/UI-AGILE.
中文:UI-AGILE通过连续奖励和裁剪策略优化训练,并采用分解式定位改进推理,在基准测试中实现了图形用户界面代理的最先进性能。
English: UI-AGILE introduces training enhancements like continuous rewards and cropping strategies, along with inference improvements through decomposed grounding, achieving state-of-the-art GUI agent performance on benchmarks.

Authors:Ziyun Dai, Xiaoqiang Li, Shaohua Zhang, Yuanchen Wu, Jide Li
Title: See Different, Think Better: Visual Variations Mitigating Hallucinations in LVLMs
Abstract:
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in visual understanding and multimodal reasoning. However, LVLMs frequently exhibit hallucination phenomena, manifesting as the generated textual responses that demonstrate inconsistencies with the provided visual content. Existing hallucination mitigation methods are predominantly text-centric, the challenges of visual-semantic alignment significantly limit their effectiveness, especially when confronted with fine-grained visual understanding scenarios. To this end, this paper presents ViHallu, a Vision-Centric Hallucination mitigation framework that enhances visual-semantic alignment through Visual Variation Image Generation and Visual Instruction Construction. ViHallu introduces visual variation images with controllable visual alterations while maintaining the overall image structure. These images, combined with carefully constructed visual instructions, enable LVLMs to better understand fine-grained visual content through fine-tuning, allowing models to more precisely capture the correspondence between visual content and text, thereby enhancing visual-semantic alignment. Extensive experiments on multiple benchmarks show that ViHallu effectively enhances models' fine-grained visual understanding while significantly reducing hallucination tendencies. Furthermore, we release ViHallu-Instruction, a visual instruction dataset specifically designed for hallucination mitigation and visual-semantic alignment. Code is available at https://github.com/oliviadzy/ViHallu.
Chinese: ViHallu是一种以视觉为中心的幻觉缓解框架,通过生成视觉变化图像和构建视觉指令,增强大型视觉语言模型的细粒度视觉理解能力,有效减少幻觉现象并提升视觉语义对齐。
English: ViHallu is a vision-centric framework that mitigates hallucinations in Large Vision-Language Models by generating visual variation images and constructing visual instructions, thereby enhancing fine-grained visual understanding and alignment between visual content and text.

Authors:Jihao Gu, Kun Li, Fei Wang, Yanyan Wei, Zhiliang Wu, Hehe Fan, Meng Wang
Title: Motion Matters: Motion-guided Modulation Network for Skeleton-based Micro-Action Recognition
Abstract:
Micro-Actions (MAs) are an important form of non-verbal communication in social interactions, with potential applications in human emotional analysis. However, existing methods in Micro-Action Recognition often overlook the inherent subtle changes in MAs, which limits the accuracy of distinguishing MAs with subtle changes. To address this issue, we present a novel Motion-guided Modulation Network (MMN) that implicitly captures and modulates subtle motion cues to enhance spatial-temporal representation learning. Specifically, we introduce a Motion-guided Skeletal Modulation module (MSM) to inject motion cues at the skeletal level, acting as a control signal to guide spatial representation modeling. In parallel, we design a Motion-guided Temporal Modulation module (MTM) to incorporate motion information at the frame level, facilitating the modeling of holistic motion patterns in micro-actions. Finally, we propose a motion consistency learning strategy to aggregate the motion cues from multi-scale features for micro-action classification. Experimental results on the Micro-Action 52 and iMiGUE datasets demonstrate that MMN achieves state-of-the-art performance in skeleton-based micro-action recognition, underscoring the importance of explicitly modeling subtle motion cues. The code will be available at https://github.com/momiji-bit/MMN.
中文摘要:提出的运动引导调制网络(MMN)通过骨骼和时序调制模块有效捕捉细微运动线索,显著提升了微动作识别性能,在基准数据集上取得了最优结果。
English Summary: The proposed Motion-guided Modulation Network (MMN) effectively captures subtle motion cues through skeletal and temporal modulation modules to improve micro-action recognition, achieving state-of-the-art results on benchmark datasets.

Authors:Jiahui Ren, Mochu Xiang, Jiajun Zhu, Yuchao Dai
Title: PanoSplatt3R: Leveraging Perspective Pretraining for Generalized Unposed Wide-Baseline Panorama Reconstruction
Abstract:
Wide-baseline panorama reconstruction has emerged as a highly effective and pivotal approach for not only achieving geometric reconstruction of the surrounding 3D environment, but also generating highly realistic and immersive novel views. Although existing methods have shown remarkable performance across various benchmarks, they are predominantly reliant on accurate pose information. In real-world scenarios, the acquisition of precise pose often requires additional computational resources and is highly susceptible to noise. These limitations hinder the broad applicability and practicality of such methods. In this paper, we present PanoSplatt3R, an unposed wide-baseline panorama reconstruction method. We extend and adapt the foundational reconstruction pretrainings from the perspective domain to the panoramic domain, thus enabling powerful generalization capabilities. To ensure a seamless and efficient domain-transfer process, we introduce RoPE rolling that spans rolled coordinates in rotary positional embeddings across different attention heads, maintaining a minimal modification to RoPE's mechanism, while modeling the horizontal periodicity of panorama images. Comprehensive experiments demonstrate that PanoSplatt3R, even in the absence of pose information, significantly outperforms current state-of-the-art methods. This superiority is evident in both the generation of high-quality novel views and the accuracy of depth estimation, thereby showcasing its great potential for practical applications. Project page: https://npucvr.github.io/PanoSplatt3R

Authors:Tianhong Gao, Yannian Fu, Weiqun Wu, Haixiao Yue, Shanshan Liu, Gang Zhang
Title: MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning
Abstract:
Large Language Models (LLMs), enhanced through agent tuning, have demonstrated remarkable capabilities in Chain-of-Thought (CoT) and tool utilization, significantly surpassing the performance of standalone models. However, the multimodal domain still lacks a large-scale, high-quality agent tuning dataset to unlock the full potential of multimodal large language models. To bridge this gap, we introduce MMAT-1M, the first million-scale multimodal agent tuning dataset designed to support CoT, reflection, and dynamic tool usage. Our dataset is constructed through a novel four-stage data engine: 1) We first curate publicly available multimodal datasets containing question-answer pairs; 2) Then, leveraging GPT-4o, we generate rationales for the original question-answer pairs and dynamically integrate API calls and Retrieval Augmented Generation (RAG) information through a multi-turn paradigm; 3) Furthermore, we refine the rationales through reflection to ensure logical consistency and accuracy, creating a multi-turn dialogue dataset with both Rationale and Reflection (RR); 4) Finally, to enhance efficiency, we optionally compress multi-turn dialogues into a One-turn Rationale and Reflection (ORR) format. By fine-tuning open-source multimodal models on the MMAT-1M, we observe significant performance gains. For instance, the InternVL2.5-8B-RR model achieves an average improvement of 2.7% across eight public benchmarks and 8.8% on the RAG benchmark Dyn-VQA, demonstrating the dataset's effectiveness in enhancing multimodal reasoning and tool-based capabilities. The dataset is publicly available at https://github.com/VIS-MPU-Agent/MMAT-1M.
中文: 研究者推出了首个百万规模的多模态智能体调优数据集MMAT-1M,通过四阶段数据引擎构建思维链与工具调用能力,显著提升了多模态模型在推理任务中的表现。
English: Researchers developed MMAT-1M, a million-scale multimodal agent tuning dataset featuring Chain-of-Thought reasoning and tool usage, which significantly boosts the performance of multimodal models on benchmarks.

Authors:Nicola Fanelli, Gennaro Vessio, Giovanna Castellano
Title: ArtSeek: Deep artwork understanding via multimodal in-context reasoning and late interaction retrieval
Abstract:
Analyzing digitized artworks presents unique challenges, requiring not only visual interpretation but also a deep understanding of rich artistic, contextual, and historical knowledge. We introduce ArtSeek, a multimodal framework for art analysis that combines multimodal large language models with retrieval-augmented generation. Unlike prior work, our pipeline relies only on image input, enabling applicability to artworks without links to Wikidata or Wikipedia-common in most digitized collections. ArtSeek integrates three key components: an intelligent multimodal retrieval module based on late interaction retrieval, a contrastive multitask classification network for predicting artist, genre, style, media, and tags, and an agentic reasoning strategy enabled through in-context examples for complex visual question answering and artwork explanation via Qwen2.5-VL. Central to this approach is WikiFragments, a Wikipedia-scale dataset of image-text fragments curated to support knowledge-grounded multimodal reasoning. Our framework achieves state-of-the-art results on multiple benchmarks, including a +8.4% F1 improvement in style classification over GraphCLIP and a +7.1 BLEU@1 gain in captioning on ArtPedia. Qualitative analyses show that ArtSeek can interpret visual motifs, infer historical context, and retrieve relevant knowledge, even for obscure works. Though focused on visual arts, our approach generalizes to other domains requiring external knowledge, supporting scalable multimodal AI research. Both the dataset and the source code will be made publicly available at https://github.com/cilabuniba/artseek.
中文: ArtSeek是一种多模态框架,仅通过图像输入分析艺术品,结合检索增强生成技术,并利用维基百科规模的数据集提升视觉与背景理解能力,取得了领先水平的成果。
English: ArtSeek is a multimodal framework that analyzes artworks using only image inputs, integrating retrieval-augmented generation and achieving state-of-the-art results by leveraging a Wikipedia-scale dataset for enhanced visual and contextual understanding.

Authors:Shengjia Chen, Ruchika Verma, Kevin Clare, Jannes Jegminat, Eugenia Alleva, Kuan-lin Huang, Brandon Veremis, Thomas Fuchs, Gabriele Campanella
Title: Predict Patient Self-reported Race from Skin Histological Images
Abstract:
Artificial Intelligence (AI) has demonstrated success in computational pathology (CPath) for disease detection, biomarker classification, and prognosis prediction. However, its potential to learn unintended demographic biases, particularly those related to social determinants of health, remains understudied. This study investigates whether deep learning models can predict self-reported race from digitized dermatopathology slides and identifies potential morphological shortcuts. Using a multisite dataset with a racially diverse population, we apply an attention-based mechanism to uncover race-associated morphological features. After evaluating three dataset curation strategies to control for confounding factors, the final experiment showed that White and Black demographic groups retained high prediction performance (AUC: 0.799, 0.762), while overall performance dropped to 0.663. Attention analysis revealed the epidermis as a key predictive feature, with significant performance declines when these regions were removed. These findings highlight the need for careful data curation and bias mitigation to ensure equitable AI deployment in pathology. Code available at: https://github.com/sinai-computational-pathology/CPath_SAIF.
中文: 本研究揭示深度学习模型能通过表皮形态特征从病理切片中预测患者种族,凸显了人工智能在医疗领域需通过精细数据管理来规避偏见风险的必要性。
English: This study demonstrates that deep learning models can predict patient race from pathology slides with notable accuracy, revealing unintended bias through epidermal features and underscoring the need for improved data curation to ensure equitable AI in healthcare.

Authors:Viacheslav Pirogov, Maksim Artemev
Title: Evaluating Deepfake Detectors in the Wild
Abstract:
Deepfakes powered by advanced machine learning models present a significant and evolving threat to identity verification and the authenticity of digital media. Although numerous detectors have been developed to address this problem, their effectiveness has yet to be tested when applied to real-world data. In this work we evaluate modern deepfake detectors, introducing a novel testing procedure designed to mimic real-world scenarios for deepfake detection. Using state-of-the-art deepfake generation methods, we create a comprehensive dataset containing more than 500,000 high-quality deepfake images. Our analysis shows that detecting deepfakes still remains a challenging task. The evaluation shows that in fewer than half of the deepfake detectors tested achieved an AUC score greater than 60%, with the lowest being 50%. We demonstrate that basic image manipulations, such as JPEG compression or image enhancement, can significantly reduce model performance. All code and data are publicly available at https://github.com/SumSubstance/Deepfake-Detectors-in-the-Wild.
中文: 现代深度伪造检测器在现实场景中表现不佳,仅不到半数检测器的AUC超过60%,且简单的图像处理会大幅降低其检测性能。
English: Modern deepfake detectors struggle in real-world scenarios, with fewer than half achieving over 60% AUC and basic image manipulations significantly degrading their performance.

Authors:Julia Wolleb, Florentin Bieder, Paul Friedrich, Hemant D. Tagare, Xenophon Papademetris
Title: VidFuncta: Towards Generalizable Neural Representations for Ultrasound Videos
Abstract:
Ultrasound is widely used in clinical care, yet standard deep learning methods often struggle with full video analysis due to non-standardized acquisition and operator bias. We offer a new perspective on ultrasound video analysis through implicit neural representations (INRs). We build on Functa, an INR framework in which each image is represented by a modulation vector that conditions a shared neural network. However, its extension to the temporal domain of medical videos remains unexplored. To address this gap, we propose VidFuncta, a novel framework that leverages Functa to encode variable-length ultrasound videos into compact, time-resolved representations. VidFuncta disentangles each video into a static video-specific vector and a sequence of time-dependent modulation vectors, capturing both temporal dynamics and dataset-level redundancies. Our method outperforms 2D and 3D baselines on video reconstruction and enables downstream tasks to directly operate on the learned 1D modulation vectors. We validate VidFuncta on three public ultrasound video datasets -- cardiac, lung, and breast -- and evaluate its downstream performance on ejection fraction prediction, B-line detection, and breast lesion classification. These results highlight the potential of VidFuncta as a generalizable and efficient representation framework for ultrasound videos. Our code is publicly available under https://github.com/JuliaWolleb/VidFuncta_public.
中文: 本文提出VidFuncta框架,通过隐式神经表示将超声视频编码为紧凑的时间分辨向量,在视频重建任务中超越传统方法,并在心脏射血分数预测等多个下游任务中展现出优越性能。
English: This paper introduces VidFuncta, a novel framework using implicit neural representations to encode ultrasound videos into compact time-resolved vectors, outperforming traditional methods in reconstruction and enabling efficient downstream task performance across multiple clinical applications.

Authors:Jiahao He, Daerji Suolang, Keren Fu, Qijun Zhao
Title: Unleashing the Power of Motion and Depth: A Selective Fusion Strategy for RGB-D Video Salient Object Detection
Abstract:
Applying salient object detection (SOD) to RGB-D videos is an emerging task called RGB-D VSOD and has recently gained increasing interest, due to considerable performance gains of incorporating motion and depth and that RGB-D videos can be easily captured now in daily life. Existing RGB-D VSOD models have different attempts to derive motion cues, in which extracting motion information explicitly from optical flow appears to be a more effective and promising alternative. Despite this, there remains a key issue that how to effectively utilize optical flow and depth to assist the RGB modality in SOD. Previous methods always treat optical flow and depth equally with respect to model designs, without explicitly considering their unequal contributions in individual scenarios, limiting the potential of motion and depth. To address this issue and unleash the power of motion and depth, we propose a novel selective cross-modal fusion framework (SMFNet) for RGB-D VSOD, incorporating a pixel-level selective fusion strategy (PSF) that achieves optimal fusion of optical flow and depth based on their actual contributions. Besides, we propose a multi-dimensional selective attention module (MSAM) to integrate the fused features derived from PSF with the remaining RGB modality at multiple dimensions, effectively enhancing feature representation to generate refined features. We conduct comprehensive evaluation of SMFNet against 19 state-of-the-art models on both RDVS and DVisal datasets, making the evaluation the most comprehensive RGB-D VSOD benchmark up to date, and it also demonstrates the superiority of SMFNet over other models. Meanwhile, evaluation on five video benchmark datasets incorporating synthetic depth validates the efficacy of SMFNet as well. Our code and benchmark results are made publicly available at https://github.com/Jia-hao999/SMFNet.
Chinese Summary: 该研究提出SMFNet这一选择性跨模态融合框架,通过像素级选择性融合策略动态整合光流与深度信息,在RGB-D视频显著目标检测任务中实现了超越现有模型的优越性能。
English Summary: The study introduces SMFNet, a selective cross-modal fusion framework for RGB-D video salient object detection that dynamically integrates optical flow and depth based on their actual contributions, achieving superior performance over existing models on comprehensive benchmarks.

Authors:Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, Zhao Zhong
Title: MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
Abstract:
Although GRPO substantially enhances flow matching models in human preference alignment of image generation, methods such as FlowGRPO still exhibit inefficiency due to the necessity of sampling and optimizing over all denoising steps specified by the Markov Decision Process (MDP). In this paper, we propose $\textbf{MixGRPO}$, a novel framework that leverages the flexibility of mixed sampling strategies through the integration of stochastic differential equations (SDE) and ordinary differential equations (ODE). This streamlines the optimization process within the MDP to improve efficiency and boost performance. Specifically, MixGRPO introduces a sliding window mechanism, using SDE sampling and GRPO-guided optimization only within the window, while applying ODE sampling outside. This design confines sampling randomness to the time-steps within the window, thereby reducing the optimization overhead, and allowing for more focused gradient updates to accelerate convergence. Additionally, as time-steps beyond the sliding window are not involved in optimization, higher-order solvers are supported for sampling. So we present a faster variant, termed $\textbf{MixGRPO-Flash}$, which further improves training efficiency while achieving comparable performance. MixGRPO exhibits substantial gains across multiple dimensions of human preference alignment, outperforming DanceGRPO in both effectiveness and efficiency, with nearly 50% lower training time. Notably, MixGRPO-Flash further reduces training time by 71%. Codes and models are available at $\href{https://github.com/Tencent-Hunyuan/MixGRPO}{MixGRPO}$.
中文: MixGRPO采用滑动窗口混合采样策略,结合SDE和ODE仅优化部分去噪步骤,在图像生成的人类偏好对齐中显著提升了训练效率和性能。
English: MixGRPO introduces a mixed sampling strategy with a sliding window that combines SDE and ODE to optimize only specific denoising steps, significantly improving training efficiency and performance in human preference alignment for image generation.

Authors:Zhaolong Wang, Tongfeng Sun, Mingzheng Du, Yachao Huang
Title: MSGCoOp: Multiple Semantic-Guided Context Optimization for Few-Shot Learning
Abstract:
Vision-language pre-trained models (VLMs) such as CLIP have demonstrated remarkable zero-shot generalization, and prompt learning has emerged as an efficient alternative to full fine-tuning. However, existing methods often struggle with generalization to novel classes, a phenomenon attributed to overfitting on seen classes and forgetting general knowledge. Furthermore, recent approaches that improve generalization often introduce complex architectures or heavy computational overhead. In this paper, we propose a Multiple Semantic-Guided Context Optimization (MSGCoOp) framework to enhance few-shot generalization while maintaining computational efficiency. Our approach leverages an ensemble of parallel learnable context vectors to capture diverse semantic aspects. To enrich these prompts, we introduce a semantic guidance mechanism that aligns them with comprehensive class descriptions automatically generated by a Large Language Model (LLM). Furthermore, a diversity regularization loss encourages the prompts to learn complementary and orthogonal features, preventing them from collapsing into redundant representations. Extensive experiments on 11 benchmark datasets show that MSGCoOp significantly improves performance on base-to-novel generalization, achieving an average harmonic mean improvement of 1.10\% over the strong KgCoOp baseline. Our method also demonstrates enhanced robustness in cross-domain generalization tasks. Our code is avaliable at: \href{https://github.com/Rain-Bus/MSGCoOp}{https://github.com/Rain-Bus/MSGCoOp}.
Chinese: MSGCoOp框架通过采用大语言模型生成的类别描述引导的并行可学习上下文向量和多样性正则化,提升了视觉语言模型的少样本泛化能力,在基础到新类别及跨领域任务中表现优异,同时保持了计算效率。
English: The MSGCoOp framework enhances few-shot generalization in vision-language models by using parallel learnable context vectors guided by LLM-generated class descriptions and diversity regularization, achieving improved performance on base-to-novel and cross-domain tasks while maintaining computational efficiency.

Authors:Zhishu Liu, Kaishen Yuan, Bo Zhao, Yong Xu, Zitong Yu
Title: AU-LLM: Micro-Expression Action Unit Detection via Enhanced LLM-Based Feature Fusion
Abstract:
The detection of micro-expression Action Units (AUs) is a formidable challenge in affective computing, pivotal for decoding subtle, involuntary human emotions. While Large Language Models (LLMs) demonstrate profound reasoning abilities, their application to the fine-grained, low-intensity domain of micro-expression AU detection remains unexplored. This paper pioneers this direction by introducing \textbf{AU-LLM}, a novel framework that for the first time uses LLM to detect AUs in micro-expression datasets with subtle intensities and the scarcity of data. We specifically address the critical vision-language semantic gap, the \textbf{Enhanced Fusion Projector (EFP)}. The EFP employs a Multi-Layer Perceptron (MLP) to intelligently fuse mid-level (local texture) and high-level (global semantics) visual features from a specialized 3D-CNN backbone into a single, information-dense token. This compact representation effectively empowers the LLM to perform nuanced reasoning over subtle facial muscle movements.Through extensive evaluations on the benchmark CASME II and SAMM datasets, including stringent Leave-One-Subject-Out (LOSO) and cross-domain protocols, AU-LLM establishes a new state-of-the-art, validating the significant potential and robustness of LLM-based reasoning for micro-expression analysis. The codes are available at https://github.com/ZS-liu-JLU/AU-LLMs.
中文摘要:本文首次提出AU-LLM框架,通过增强融合投影器整合视觉特征,成功将大语言模型应用于微表情动作单元检测,在基准数据集上实现了最先进的性能,验证了大语言模型在细微面部肌肉运动分析中的强大潜力。
English Summary: This paper introduces AU-LLM, a pioneering framework that leverages Large Language Models for micro-expression Action Unit detection, achieving state-of-the-art performance through enhanced visual feature fusion and demonstrating robust reasoning capabilities on subtle facial movements.

Authors:Aybora Koksal, A. Aydin Alatan
Title: Few-Shot Vision-Language Reasoning for Satellite Imagery via Verifiable Rewards
Abstract:
Recent advances in large language and vision-language models have enabled strong reasoning capabilities, yet they remain impractical for specialized domains like remote sensing, where annotated data is scarce and expensive. We present the first few-shot reinforcement learning with verifiable reward (RLVR) framework for satellite imagery that eliminates the need for caption supervision--relying solely on lightweight, rule-based binary or IoU-based rewards. Adapting the "1-shot RLVR" paradigm from language models to vision-language models, we employ policy-gradient optimization with as few as one curated example to align model outputs for satellite reasoning tasks. Comprehensive experiments across multiple remote sensing benchmarks--including classification, visual question answering, and grounding--show that even a single example yields substantial improvements over the base model. Scaling to 128 examples matches or exceeds models trained on thousands of annotated samples. While the extreme one-shot setting can induce mild, task-specific overfitting, our approach consistently demonstrates robust generalization and efficiency across diverse tasks. Further, we find that prompt design and loss weighting significantly influence training stability and final accuracy. Our method enables cost-effective and data-efficient development of domain-specialist vision-language reasoning models, offering a pragmatic recipe for data-scarce fields: start from a compact VLM, curate a handful of reward-checkable cases, and train via RLVR.
Chinese Summary: 该研究提出了一种用于卫星图像的带可验证奖励的少样本强化学习框架(RLVR),在遥感任务中仅需极少标注数据即可实现强大性能,仅用128个样本就能媲美基于数千样本训练的模型。
English Summary: The study introduces a few-shot reinforcement learning framework with verifiable rewards (RLVR) for satellite imagery, which achieves strong performance in remote sensing tasks using minimal annotated data, even matching models trained on thousands of samples with just 128 examples.

Authors:Shaojun E, Yuchen Yang, Jiaheng Wu, Yan Zhang, Tiejun Zhao, Ziyan Chen
Title: MAGE: Multimodal Alignment and Generation Enhancement via Bridging Visual and Semantic Spaces
Abstract:
In the latest advancements in multimodal learning, effectively addressing the spatial and semantic losses of visual data after encoding remains a critical challenge. This is because the performance of large multimodal models is positively correlated with the coupling between visual encoders and large language models. Existing approaches often face issues such as vector gaps or semantic disparities, resulting in information loss during the propagation process. To address these issues, we propose MAGE (Multimodal Alignment and Generation Enhancement), a novel framework that bridges the semantic spaces of vision and text through an innovative alignment mechanism. By introducing the Intelligent Alignment Network (IAN), MAGE achieves dimensional and semantic alignment. To reduce the gap between synonymous heterogeneous data, we employ a training strategy that combines cross-entropy and mean squared error, significantly enhancing the alignment effect. Moreover, to enhance MAGE's "Any-to-Any" capability, we developed a fine-tuning dataset for multimodal tool-calling instructions to expand the model's output capability boundaries. Finally, our proposed multimodal large model architecture, MAGE, achieved significantly better performance compared to similar works across various evaluation benchmarks, including MME, MMBench, and SEED. Complete code and appendix are available at: https://github.com/GTCOM-NLP/MAGE.
中文摘要:提出的MAGE框架通过智能对齐网络和混合训练策略,解决了多模态学习中视觉编码后的空间与语义损失问题,在多项评测基准中表现优异。
English Summary: The proposed MAGE framework addresses spatial and semantic losses in multimodal learning by introducing an Intelligent Alignment Network and a combined training strategy, achieving superior performance on multiple benchmarks.

Authors:Qianxiong Xu, Lanyun Zhu, Chenxi Liu, Guosheng Lin, Cheng Long, Ziyue Li, Rui Zhao
Title: SAMITE: Position Prompted SAM2 with Calibrated Memory for Visual Object Tracking
Abstract:
Visual Object Tracking (VOT) is widely used in applications like autonomous driving to continuously track targets in videos. Existing methods can be roughly categorized into template matching and autoregressive methods, where the former usually neglects the temporal dependencies across frames and the latter tends to get biased towards the object categories during training, showing weak generalizability to unseen classes. To address these issues, some methods propose to adapt the video foundation model SAM2 for VOT, where the tracking results of each frame would be encoded as memory for conditioning the rest of frames in an autoregressive manner. Nevertheless, existing methods fail to overcome the challenges of object occlusions and distractions, and do not have any measures to intercept the propagation of tracking errors. To tackle them, we present a SAMITE model, built upon SAM2 with additional modules, including: (1) Prototypical Memory Bank: We propose to quantify the feature-wise and position-wise correctness of each frame's tracking results, and select the best frames to condition subsequent frames. As the features of occluded and distracting objects are feature-wise and position-wise inaccurate, their scores would naturally be lower and thus can be filtered to intercept error propagation; (2) Positional Prompt Generator: To further reduce the impacts of distractors, we propose to generate positional mask prompts to provide explicit positional clues for the target, leading to more accurate tracking. Extensive experiments have been conducted on six benchmarks, showing the superiority of SAMITE. The code is available at https://github.com/Sam1224/SAMITE.
Chinese: SAMITE模型在SAM2视频基础模型上增加了原型记忆库来筛选准确帧和位置提示生成器以减少干扰,从而在多个基准测试中实现了卓越的视觉目标跟踪性能。
English: The SAMITE model enhances the SAM2 video foundation model for visual object tracking by introducing a Prototypical Memory Bank to filter out inaccurate frames and a Positional Prompt Generator to reduce distractions, achieving superior performance on multiple benchmarks.

Authors:Yaozong Zheng, Bineng Zhong, Qihua Liang, Ning Li, Shuxiang Song
Title: Decoupled Spatio-Temporal Consistency Learning for Self-Supervised Tracking
Abstract:
The success of visual tracking has been largely driven by datasets with manual box annotations. However, these box annotations require tremendous human effort, limiting the scale and diversity of existing tracking datasets. In this work, we present a novel Self-Supervised Tracking framework named \textbf{\tracker}, designed to eliminate the need of box annotations. Specifically, a decoupled spatio-temporal consistency training framework is proposed to learn rich target information across timestamps through global spatial localization and local temporal association. This allows for the simulation of appearance and motion variations of instances in real-world scenarios. Furthermore, an instance contrastive loss is designed to learn instance-level correspondences from a multi-view perspective, offering robust instance supervision without additional labels. This new design paradigm enables {\tracker} to effectively learn generic tracking representations in a self-supervised manner, while reducing reliance on extensive box annotations. Extensive experiments on nine benchmark datasets demonstrate that {\tracker} surpasses \textit{SOTA} self-supervised tracking methods, achieving an improvement of more than 25.3\%, 20.4\%, and 14.8\% in AUC (AO) score on the GOT10K, LaSOT, TrackingNet datasets, respectively. Code: https://github.com/GXNU-ZhongLab/SSTrack.
中文摘要:自监督跟踪框架SSTrack通过解耦的时空一致性训练方法和实例对比损失,无需人工框标注即可学习通用跟踪表示,在多个基准数据集上显著超越了现有最优方法。
English Summary: The Self-Supervised Tracking framework (SSTrack) eliminates the need for manual box annotations by employing a decoupled spatio-temporal consistency training method and instance contrastive loss, achieving state-of-the-art performance across multiple tracking benchmarks.

Authors:Jiong Yin, Liang Li, Jiehua Zhang, Yuhan Gao, Chenggang Yan, Xichun Sheng
Title: Progressive Homeostatic and Plastic Prompt Tuning for Audio-Visual Multi-Task Incremental Learning
Abstract:
Audio-visual multi-task incremental learning aims to continuously learn from multiple audio-visual tasks without the need for joint training on all tasks. The challenge of the problem is how to preserve the old task knowledge while facilitating the learning of new task with previous experiences. To address these challenges, we introduce a three-stage Progressive Homeostatic and Plastic audio-visual prompt (PHP) method. In the shallow phase, we design the task-shared modality aggregating adapter to foster cross-task and cross-modal audio-visual representation learning to enhance shared understanding between tasks. In the middle phase, we propose the task-specific modality-shared dynamic generating adapter, which constructs prompts that are tailored to individual tasks while remaining general across modalities, which balances the models ability to retain knowledge against forgetting with its potential for versatile multi-task transferability. In the deep phase, we introduce the task-specific modality-independent prompts to further refine the understand ability by targeting individual information for each task and modality. By incorporating these three phases, PHP retains task-specific prompts while adapting shared parameters for new tasks to effectively balance knowledge sharing and specificity. Our method achieves SOTA performance in different orders of four tasks (AVE, AVVP, AVS and AVQA). Our code can be available at https://github.com/ENJOY-Yin-jiong/PHP.
中文: 提出的渐进式稳态与可塑性(PHP)方法通过三个阶段的提示机制,在视听多任务增量学习中平衡知识保留与迁移,实现了最先进的性能。
English: The proposed Progressive Homeostatic and Plastic (PHP) method enables effective audio-visual multi-task incremental learning by balancing knowledge retention and transfer across tasks through three specialized prompt phases, achieving state-of-the-art performance.

Authors:Yanxu Zhu, Shitong Duan, Xiangxu Zhang, Jitao Sang, Peng Zhang, Tun Lu, Xiao Zhou, Jing Yao, Xiaoyuan Yi, Xing Xie
Title: MoHoBench: Assessing Honesty of Multimodal Large Language Models via Unanswerable Visual Questions
Abstract:
Recently Multimodal Large Language Models (MLLMs) have achieved considerable advancements in vision-language tasks, yet produce potentially harmful or untrustworthy content. Despite substantial work investigating the trustworthiness of language models, MMLMs' capability to act honestly, especially when faced with visually unanswerable questions, remains largely underexplored. This work presents the first systematic assessment of honesty behaviors across various MLLMs. We ground honesty in models' response behaviors to unanswerable visual questions, define four representative types of such questions, and construct MoHoBench, a large-scale MMLM honest benchmark, consisting of 12k+ visual question samples, whose quality is guaranteed by multi-stage filtering and human verification. Using MoHoBench, we benchmarked the honesty of 28 popular MMLMs and conducted a comprehensive analysis. Our findings show that: (1) most models fail to appropriately refuse to answer when necessary, and (2) MMLMs' honesty is not solely a language modeling issue, but is deeply influenced by visual information, necessitating the development of dedicated methods for multimodal honesty alignment. Therefore, we implemented initial alignment methods using supervised and preference learning to improve honesty behavior, providing a foundation for future work on trustworthy MLLMs. Our data and code can be found at https://github.com/DSTTSD/MoHoBench.
中文: 本研究首次提出MoHoBench基准,系统评估多模态大语言模型在面对视觉不可答问题时的诚实性,发现多数模型无法恰当拒答且视觉信息显著影响诚实表现,进而提出了改进的对齐方法。
English: This study introduces MoHoBench, the first benchmark to systematically evaluate the honesty of Multimodal Large Language Models (MLLMs) when faced with unanswerable visual questions, revealing that most models fail to refuse appropriately and that visual information significantly impacts honesty, leading to proposed alignment methods for improvement.

Authors:Zhichuan Wang, Yang Zhou, Zhe Liu, Rui Yu, Song Bai, Yulong Wang, Xinwei He, Xiang Bai
Title: Describe, Adapt and Combine: Empowering CLIP Encoders for Open-set 3D Object Retrieval
Abstract:
Open-set 3D object retrieval (3DOR) is an emerging task aiming to retrieve 3D objects of unseen categories beyond the training set. Existing methods typically utilize all modalities (i.e., voxels, point clouds, multi-view images) and train specific backbones before fusion. However, they still struggle to produce generalized representations due to insufficient 3D training data. Being contrastively pre-trained on web-scale image-text pairs, CLIP inherently produces generalized representations for a wide range of downstream tasks. Building upon it, we present a simple yet effective framework named Describe, Adapt and Combine (DAC) by taking only multi-view images for open-set 3DOR. DAC innovatively synergizes a CLIP model with a multi-modal large language model (MLLM) to learn generalized 3D representations, where the MLLM is used for dual purposes. First, it describes the seen category information to align with CLIP's training objective for adaptation during training. Second, it provides external hints about unknown objects complementary to visual cues during inference. To improve the synergy, we introduce an Additive-Bias Low-Rank adaptation (AB-LoRA), which alleviates overfitting and further enhances the generalization to unseen categories. With only multi-view images, DAC significantly surpasses prior arts by an average of +10.01\% mAP on four open-set 3DOR datasets. Moreover, its generalization is also validated on image-based and cross-dataset setups. Code is available at https://github.com/wangzhichuan123/DAC.
中文摘要:DAC框架通过结合CLIP与多模态大语言模型,仅使用多视角图像学习泛化性强的三维表征,在开放集三维物体检索任务中大幅超越现有方法,在四个数据集上平均提升10.01% mAP。
English Summary: The DAC framework leverages CLIP and a multi-modal large language model to create generalized 3D representations using only multi-view images, achieving state-of-the-art performance in open-set 3D object retrieval by significantly outperforming previous methods across multiple datasets.

Authors:Han Wu, Chong Wang, Zhiming Cui
Title: Dual Cross-image Semantic Consistency with Self-aware Pseudo Labeling for Semi-supervised Medical Image Segmentation
Abstract:
Semi-supervised learning has proven highly effective in tackling the challenge of limited labeled training data in medical image segmentation. In general, current approaches, which rely on intra-image pixel-wise consistency training via pseudo-labeling, overlook the consistency at more comprehensive semantic levels (e.g., object region) and suffer from severe discrepancy of extracted features resulting from an imbalanced number of labeled and unlabeled data. To overcome these limitations, we present a new \underline{Du}al \underline{C}ross-\underline{i}mage \underline{S}emantic \underline{C}onsistency (DuCiSC) learning framework, for semi-supervised medical image segmentation. Concretely, beyond enforcing pixel-wise semantic consistency, DuCiSC proposes dual paradigms to encourage region-level semantic consistency across: 1) labeled and unlabeled images; and 2) labeled and fused images, by explicitly aligning their prototypes. Relying on the dual paradigms, DuCiSC can effectively establish consistent cross-image semantics via prototype representations, thereby addressing the feature discrepancy issue. Moreover, we devise a novel self-aware confidence estimation strategy to accurately select reliable pseudo labels, allowing for exploiting the training dynamics of unlabeled data. Our DuCiSC method is extensively validated on four datasets, including two popular binary benchmarks in segmenting the left atrium and pancreas, a multi-class Automatic Cardiac Diagnosis Challenge dataset, and a challenging scenario of segmenting the inferior alveolar nerve that features complicated anatomical structures, showing superior segmentation results over previous state-of-the-art approaches. Our code is publicly available at \href{https://github.com/ShanghaiTech-IMPACT/DuCiSC}{https://github.com/ShanghaiTech-IMPACT/DuCiSC}.
中文:DuCiSC框架通过引入跨图像区域级语义一致性和自感知置信度策略,解决了半监督医学图像分割中的特征差异问题,并在多个数据集上实现了优越的分割效果。
English: The DuCiSC framework enhances semi-supervised medical image segmentation by introducing dual cross-image semantic consistency at the region level and a self-aware confidence strategy to address feature discrepancies and improve pseudo-label reliability, achieving superior results across multiple datasets.

Authors:Shijie Zhou, Ruiyi Zhang, Huaisheng Zhu, Branislav Kveton, Yufan Zhou, Jiuxiang Gu, Jian Chen, Changyou Chen
Title: Multimodal LLMs as Customized Reward Models for Text-to-Image Generation
Abstract:
We introduce LLaVA-Reward, an efficient reward model designed to automatically evaluate text-to-image (T2I) generations across multiple perspectives, leveraging pretrained multimodal large language models (MLLMs). Existing MLLM-based approaches require instruction-following data for supervised fine-tuning and evaluate generation quality on analyzing text response, which is time-consuming and difficult to train. To address this problem, we propose LLaVA-Reward, which directly utilizes the hidden states of MLLMs given text-image pairs. To enhance the bidirectional interaction between visual and textual representations in decoder-only MLLMs, we further propose adding a Skip-connection Cross Attention (SkipCA) module. This design enhances text-image correlation reasoning by connecting early-layer visual features with later-layer hidden representations. In addition, LLaVA-Reward supports different types of preference data for efficient fine-tuning, including paired preference data and unpaired data. We train LLaVA-Reward on four evaluation perspectives: text-image alignment, fidelity/artifact, safety, and overall ranking. Empirical results demonstrate that LLaVA-Reward outperforms conventional and MLLM-based methods in generating human-aligned scores for automatic evaluations and inference-time scaling in text-to-image generations.
中文: LLaVA-Reward是一种高效的奖励模型,利用预训练多模态大语言模型的隐藏状态和跳跃连接交叉注意力模块,从多角度自动评估文生图生成质量,在自动评估和推理扩展方面优于现有方法。
English: LLaVA-Reward is an efficient reward model that uses pretrained multimodal large language models to automatically evaluate text-to-image generations from multiple perspectives by leveraging hidden states and a Skip-connection Cross Attention module for improved visual-textual reasoning.

Authors:Jicheng Yuan, Manh Nguyen Duc, Qian Liu, Manfred Hauswirth, Danh Le Phuoc
Title: Collaborative Perceiver: Elevating Vision-based 3D Object Detection via Local Density-Aware Spatial Occupancy
Abstract:
Vision-based bird's-eye-view (BEV) 3D object detection has advanced significantly in autonomous driving by offering cost-effectiveness and rich contextual information. However, existing methods often construct BEV representations by collapsing extracted object features, neglecting intrinsic environmental contexts, such as roads and pavements. This hinders detectors from comprehensively perceiving the characteristics of the physical world. To alleviate this, we introduce a multi-task learning framework, Collaborative Perceiver (CoP), that leverages spatial occupancy as auxiliary information to mine consistent structural and conceptual similarities shared between 3D object detection and occupancy prediction tasks, bridging gaps in spatial representations and feature refinement. To this end, we first propose a pipeline to generate dense occupancy ground truths incorporating local density information (LDO) for reconstructing detailed environmental information. Next, we employ a voxel-height-guided sampling (VHS) strategy to distill fine-grained local features according to distinct object properties. Furthermore, we develop a global-local collaborative feature fusion (CFF) module that seamlessly integrates complementary knowledge between both tasks, thus composing more robust BEV representations. Extensive experiments on the nuScenes benchmark demonstrate that CoP outperforms existing vision-based frameworks, achieving 49.5\% mAP and 59.2\% NDS on the test set. Code and supplementary materials are available at this link https://github.com/jichengyuan/Collaborative-Perceiver.
Chinese Summary: 本研究提出了一种名为协同感知器(CoP)的多任务学习框架,通过整合空间占据信息来改进基于视觉的3D物体检测,在nuScenes基准测试中实现了最先进的性能。
English Summary: The study introduces a multi-task learning framework called Collaborative Perceiver (CoP) that enhances vision-based 3D object detection by integrating spatial occupancy information, achieving state-of-the-art performance on the nuScenes benchmark.

Authors:Amirmohammad Shamaei, Alexander Stebner, Salome, Bosshart, Johanna Ospel, Gouri Ginde, Mariana Bento, Roberto Souza
Title: Enhancing and Accelerating Brain MRI through Deep Learning Reconstruction Using Prior Subject-Specific Imaging
Abstract:
Magnetic resonance imaging (MRI) is a crucial medical imaging modality. However, long acquisition times remain a significant challenge, leading to increased costs, and reduced patient comfort. Recent studies have shown the potential of using deep learning models that incorporate information from prior subject-specific MRI scans to improve reconstruction quality of present scans. Integrating this prior information requires registration of the previous scan to the current image reconstruction, which can be time-consuming. We propose a novel deep-learning-based MRI reconstruction framework which consists of an initial reconstruction network, a deep registration model, and a transformer-based enhancement network. We validated our method on a longitudinal dataset of T1-weighted MRI scans with 2,808 images from 18 subjects at four acceleration factors (R5, R10, R15, R20). Quantitative metrics confirmed our approach's superiority over existing methods (p < 0.05, Wilcoxon signed-rank test). Furthermore, we analyzed the impact of our MRI reconstruction method on the downstream task of brain segmentation and observed improved accuracy and volumetric agreement with reference segmentations. Our approach also achieved a substantial reduction in total reconstruction time compared to methods that use traditional registration algorithms, making it more suitable for real-time clinical applications. The code associated with this work is publicly available at https://github.com/amirshamaei/longitudinal-mri-deep-recon.
中文: 本研究提出了一种新颖的深度学习MRI重建框架,通过配准模型和Transformer网络整合先验扫描,相比现有方法显著提升了图像质量、缩短了重建时间,并改善了脑部分割任务的准确性。
English: This study introduces a novel deep-learning MRI reconstruction framework that integrates prior scans via a registration model and transformer network, significantly improving image quality, accelerating reconstruction time, and enhancing downstream brain segmentation accuracy compared to existing methods.

Authors:Feixiang Zhou, Zhuangzhi Gao, He Zhao, Jianyang Xie, Yanda Meng, Yitian Zhao, Gregory Y. H. Lip, Yalin Zheng
Title: GLCP: Global-to-Local Connectivity Preservation for Tubular Structure Segmentation
Abstract:
Accurate segmentation of tubular structures, such as vascular networks, plays a critical role in various medical domains. A remaining significant challenge in this task is structural fragmentation, which can adversely impact downstream applications. Existing methods primarily focus on designing various loss functions to constrain global topological structures. However, they often overlook local discontinuity regions, leading to suboptimal segmentation results. To overcome this limitation, we propose a novel Global-to-Local Connectivity Preservation (GLCP) framework that can simultaneously perceive global and local structural characteristics of tubular networks. Specifically, we propose an Interactive Multi-head Segmentation (IMS) module to jointly learn global segmentation, skeleton maps, and local discontinuity maps, respectively. This enables our model to explicitly target local discontinuity regions while maintaining global topological integrity. In addition, we design a lightweight Dual-Attention-based Refinement (DAR) module to further improve segmentation quality by refining the resulting segmentation maps. Extensive experiments on both 2D and 3D datasets demonstrate that our GLCP achieves superior accuracy and continuity in tubular structure segmentation compared to several state-of-the-art approaches. The source codes will be available at https://github.com/FeixiangZhou/GLCP.
中文: 本研究提出的全局-局部连通性保持(GLCP)框架通过交互式多头分割模块同步感知管状网络的全局拓扑特征与局部断裂区域,结合双注意力优化机制,在二维和三维数据集上实现了优于现有方法的血管分割精度与连续性。
English: The proposed Global-to-Local Connectivity Preservation (GLCP) framework addresses structural fragmentation in tubular structure segmentation by jointly learning global topological features and local discontinuity regions, achieving superior accuracy and continuity through interactive multi-head segmentation and dual-attention refinement.

Authors:Théo Sourget, David Restrepo, Céline Hudelot, Enzo Ferrante, Stergios Christodoulidis, Maria Vakalopoulou
Title: Fairness and Robustness of CLIP-Based Models for Chest X-rays
Abstract:
Motivated by the strong performance of CLIP-based models in natural image-text domains, recent efforts have adapted these architectures to medical tasks, particularly in radiology, where large paired datasets of images and reports, such as chest X-rays, are available. While these models have shown encouraging results in terms of accuracy and discriminative performance, their fairness and robustness in the different clinical tasks remain largely underexplored. In this study, we extensively evaluate six widely used CLIP-based models on chest X-ray classification using three publicly available datasets: MIMIC-CXR, NIH-CXR14, and NEATX. We assess the models fairness across six conditions and patient subgroups based on age, sex, and race. Additionally, we assess the robustness to shortcut learning by evaluating performance on pneumothorax cases with and without chest drains. Our results indicate performance gaps between patients of different ages, but more equitable results for the other attributes. Moreover, all models exhibit lower performance on images without chest drains, suggesting reliance on spurious correlations. We further complement the performance analysis with a study of the embeddings generated by the models. While the sensitive attributes could be classified from the embeddings, we do not see such patterns using PCA, showing the limitations of these visualisation techniques when assessing models. Our code is available at https://github.com/TheoSourget/clip_cxr_fairness
中文: 近期研究评估了六种基于CLIP的模型在胸部X光分类中的公平性和鲁棒性,发现不同年龄组患者存在性能差异,且模型依赖胸管等伪相关性,同时揭示了嵌入技术在检测敏感属性方面的局限性。
English: Recent research evaluates the fairness and robustness of six CLIP-based models on chest X-ray classification, revealing performance disparities across age groups and reliance on spurious correlations like chest drains, while also examining embedding limitations in detecting sensitive attributes.

Authors:Christopher Indris, Raiyan Rahman, Goetz Bramesfeld, Guanghui Wang
Title: Tracking Moose using Aerial Object Detection
Abstract:
Aerial wildlife tracking is critical for conservation efforts and relies on detecting small objects on the ground below the aircraft. It presents technical challenges: crewed aircraft are expensive, risky and disruptive; autonomous drones have limited computational capacity for onboard AI systems. Since the objects of interest may appear only a few pixels wide, small object detection is an inherently challenging computer vision subfield compounded by computational efficiency needs. This paper applies a patching augmentation to datasets to study model performance under various settings. A comparative study of three common yet architecturally diverse object detectors is conducted using the data, varying the patching method's hyperparameters against detection accuracy. Each model achieved at least 93\% mAP@IoU=0.5 on at least one patching configuration. Statistical analyses provide an in-depth commentary on the effects of various factors. Analysis also shows that faster, simpler models are about as effective as models that require more computational power for this task and perform well given limited patch scales, encouraging UAV deployment. Datasets and models will be made available via https://github.com/chrisindris/Moose.
Chinese: 本研究证明,对数据集应用分块增强方法可实现高效的空中野生动物追踪小目标检测,且简化模型与计算密集型模型性能相当,有利于在自主无人机上部署。
English: This study demonstrates that applying patching augmentation to datasets enables efficient small object detection for aerial wildlife tracking, with simpler models performing comparably to computationally intensive ones, facilitating deployment on autonomous drones.

Authors:Donglu Yang, Liang Zhang, Zihao Yue, Liangyu Chen, Yichen Xu, Wenxuan Wang, Qin Jin
Title: ChartM$^3$: Benchmarking Chart Editing with Multimodal Instructions
Abstract:
Charts are a fundamental visualization format widely used in data analysis across research and industry. While enabling users to edit charts based on high-level intentions is of great practical value, existing methods primarily rely on natural language instructions, which are often too ambiguous to support fine-grained editing. In this work, we introduce a novel paradigm for multimodal chart editing, where user intent is expressed through a combination of natural language and visual indicators that explicitly highlight the elements to be modified. To support this paradigm, we present Chart$\text{M}^3$, a new benchmark for Multimodal chart editing with Multi-level complexity and Multi-perspective evaluation. Chart$\text{M}^3$ contains 1,000 samples spanning four levels of editing difficulty. Each sample includes triplets in the form of (chart, code, multimodal instructions). To comprehensively evaluate chart editing models, Chart$\text{M}^3$ provides metrics that assess both visual appearance and code correctness. Our benchmark reveals significant limitations in current multimodal large language models (MLLMs), including GPT-4o, particularly in their ability to interpret and act on visual indicators. To address this, we construct Chart$\text{M}^3$-Train, a large-scale training set with 24,000 multimodal chart editing samples. Fine-tuning MLLMs on this dataset leads to substantial improvements, demonstrating the importance of multimodal supervision in building practical chart editing systems. Our datasets, codes, and evaluation tools are available at https://github.com/MLrollIT/ChartM3. %https://github.com/MLrollIT/ChartM3Our datasets, codes, and evaluation tools are available at https://github.com/yaolinli/VCE.
中文: 本文提出了一种结合自然语言与视觉标记的多模态图表编辑新范式,建立了ChartM3基准进行评估,并证明在专业数据集上微调多模态大模型能显著提升其视觉编辑指令的解析能力。
English: This paper introduces a multimodal chart editing paradigm combining natural language and visual indicators, presents the ChartM3 benchmark for evaluation, and demonstrates that fine-tuning MLLMs on a specialized dataset significantly improves performance in interpreting visual editing instructions.

Authors:Zedong Wang, Siyuan Li, Dan Xu
Title: Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning
Abstract:
Despite the promise of Multi-Task Learning in leveraging complementary knowledge across tasks, existing multi-task optimization (MTO) techniques remain fixated on resolving conflicts via optimizer-centric loss scaling and gradient manipulation strategies, yet fail to deliver consistent gains. In this paper, we argue that the shared representation space, where task interactions naturally occur, offers rich information and potential for operations complementary to existing optimizers, especially for facilitating the inter-task complementarity, which is rarely explored in MTO. This intuition leads to Rep-MTL, which exploits the representation-level task saliency to quantify interactions between task-specific optimization and shared representation learning. By steering these saliencies through entropy-based penalization and sample-wise cross-task alignment, Rep-MTL aims to mitigate negative transfer by maintaining the effective training of individual tasks instead pure conflict-solving, while explicitly promoting complementary information sharing. Experiments are conducted on four challenging MTL benchmarks covering both task-shift and domain-shift scenarios. The results show that Rep-MTL, even paired with the basic equal weighting policy, achieves competitive performance gains with favorable efficiency. Beyond standard performance metrics, Power Law exponent analysis demonstrates Rep-MTL's efficacy in balancing task-specific learning and cross-task sharing. The project page is available at HERE.
中文: 现有多任务优化方法过度依赖优化器解决任务冲突,而忽视了共享表示空间的潜力;Rep-MTL通过表征层面的任务显著性量化任务交互,在缓解负迁移的同时显式促进跨任务互补性学习。
English: Current multi-task optimization methods overly focus on resolving task conflicts through optimizers but overlook the potential of shared representation spaces, leading to the development of Rep-MTL which leverages representation-level saliency to enhance complementary knowledge sharing and mitigate negative transfer.

Authors:Yukang Cao, Jiahao Lu, Zhisheng Huang, Zhuowen Shen, Chengfeng Zhao, Fangzhou Hong, Zhaoxi Chen, Xin Li, Wenping Wang, Yuan Liu, Ziwei Liu
Title: Reconstructing 4D Spatial Intelligence: A Survey
Abstract:
Reconstructing 4D spatial intelligence from visual observations has long been a central yet challenging task in computer vision, with broad real-world applications. These range from entertainment domains like movies, where the focus is often on reconstructing fundamental visual elements, to embodied AI, which emphasizes interaction modeling and physical realism. Fueled by rapid advances in 3D representations and deep learning architectures, the field has evolved quickly, outpacing the scope of previous surveys. Additionally, existing surveys rarely offer a comprehensive analysis of the hierarchical structure of 4D scene reconstruction. To address this gap, we present a new perspective that organizes existing methods into five progressive levels of 4D spatial intelligence: (1) Level 1 -- reconstruction of low-level 3D attributes (e.g., depth, pose, and point maps); (2) Level 2 -- reconstruction of 3D scene components (e.g., objects, humans, structures); (3) Level 3 -- reconstruction of 4D dynamic scenes; (4) Level 4 -- modeling of interactions among scene components; and (5) Level 5 -- incorporation of physical laws and constraints. We conclude the survey by discussing the key challenges at each level and highlighting promising directions for advancing toward even richer levels of 4D spatial intelligence. To track ongoing developments, we maintain an up-to-date project page: https://github.com/yukangcao/Awesome-4D-Spatial-Intelligence.
Chinese: 本综述提出了一个层次化的4D空间智能重建框架,将现有方法划分为从基础3D属性到物理约束的五个递进层级,同时探讨了当前研究空白和未来发展方向。
English: This survey introduces a hierarchical framework for 4D spatial intelligence reconstruction, organizing methods into five progressive levels from basic 3D attributes to physical constraints, while addressing current gaps and future directions in the field.

Authors:Licai Sun, Xingxun Jiang, Haoyu Chen, Yante Li, Zheng Lian, Biu Liu, Yuan Zong, Wenming Zheng, Jukka M. Leppänen, Guoying Zhao
Title: Learning Transferable Facial Emotion Representations from Large-Scale Semantically Rich Captions
Abstract:
Current facial emotion recognition systems are predominately trained to predict a fixed set of predefined categories or abstract dimensional values. This constrained form of supervision hinders generalization and applicability, as it reduces the rich and nuanced spectrum of emotions into oversimplified labels or scales. In contrast, natural language provides a more flexible, expressive, and interpretable way to represent emotions, offering a much broader source of supervision. Yet, leveraging semantically rich natural language captions as supervisory signals for facial emotion representation learning remains relatively underexplored, primarily due to two key challenges: 1) the lack of large-scale caption datasets with rich emotional semantics, and 2) the absence of effective frameworks tailored to harness such rich supervision. To this end, we introduce EmoCap100K, a large-scale facial emotion caption dataset comprising over 100,000 samples, featuring rich and structured semantic descriptions that capture both global affective states and fine-grained local facial behaviors. Building upon this dataset, we further propose EmoCapCLIP, which incorporates a joint global-local contrastive learning framework enhanced by a cross-modal guided positive mining module. This design facilitates the comprehensive exploitation of multi-level caption information while accommodating semantic similarities between closely related expressions. Extensive evaluations on over 20 benchmarks covering five tasks demonstrate the superior performance of our method, highlighting the promise of learning facial emotion representations from large-scale semantically rich captions. The code and data will be available at https://github.com/sunlicai/EmoCapCLIP.
中文: 当前面部情绪识别系统受限于过度简化的标签,而本研究提出了包含丰富语义描述的大规模数据集EmoCap100K和采用对比学习的EmoCapCLIP框架,通过利用自然语言监督显著提升了情绪表征能力,在多项基准测试中表现优异。
English: Current facial emotion recognition systems are limited by oversimplified labels, but this research introduces EmoCap100K, a large-scale dataset with rich semantic captions, and EmoCapCLIP, a framework that uses contrastive learning to improve emotion representation, demonstrating superior performance across multiple benchmarks.

Authors:Shen Li, Liuyi Yao, Wujia Niu, Lan Zhang, Yaliang Li
Title: Security Tensors as a Cross-Modal Bridge: Extending Text-Aligned Safety to Vision in LVLM
Abstract:
Large visual-language models (LVLMs) integrate aligned large language models (LLMs) with visual modules to process multimodal inputs. However, the safety mechanisms developed for text-based LLMs do not naturally extend to visual modalities, leaving LVLMs vulnerable to harmful image inputs. To address this cross-modal safety gap, we introduce security tensors - trainable input vectors applied during inference through either the textual or visual modality. These tensors transfer textual safety alignment to visual processing without modifying the model's parameters. They are optimized using a curated dataset containing (i) malicious image-text pairs requiring rejection, (ii) contrastive benign pairs with text structurally similar to malicious queries, with the purpose of being contrastive examples to guide visual reliance, and (iii) general benign samples preserving model functionality. Experimental results demonstrate that both textual and visual security tensors significantly enhance LVLMs' ability to reject diverse harmful visual inputs while maintaining near-identical performance on benign tasks. Further internal analysis towards hidden-layer representations reveals that security tensors successfully activate the language module's textual "safety layers" in visual inputs, thereby effectively extending text-based safety to the visual modality.
中文摘要:安全张量作为可训练的输入向量被引入,能将文本安全机制迁移到大型视觉语言模型的视觉处理中,有效增强其拒绝有害视觉输入的能力,同时保持良性任务性能。
English Summary: Security tensors are introduced as trainable input vectors that transfer textual safety mechanisms to visual processing in large visual-language models, effectively enhancing their ability to reject harmful visual inputs while preserving performance on benign tasks.

Authors:Xinhan Di, Kristin Qi, Pengqian Yu
Title: JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1
Abstract:
Recent advances in diffusion-based video generation have enabled photo-realistic short clips, but current methods still struggle to achieve multi-modal consistency when jointly generating whole-body motion and natural speech. Current approaches lack comprehensive evaluation frameworks that assess both visual and audio quality, and there are insufficient benchmarks for region-specific performance analysis. To address these gaps, we introduce the Joint Whole-Body Talking Avatar and Speech Generation Version I(JWB-DH-V1), comprising a large-scale multi-modal dataset with 10,000 unique identities across 2 million video samples, and an evaluation protocol for assessing joint audio-video generation of whole-body animatable avatars. Our evaluation of SOTA models reveals consistent performance disparities between face/hand-centric and whole-body performance, which incidates essential areas for future research. The dataset and evaluation tools are publicly available at https://github.com/deepreasonings/WholeBodyBenchmark.
中文: 本研究推出了JWB-DH-V1数据集和评估框架,旨在解决全身动作与语音联合生成中的不足,并通过性能差异揭示了未来研究的关键方向。
English: The study introduces JWB-DH-V1, a comprehensive dataset and evaluation framework to address gaps in joint whole-body motion and speech generation, revealing performance disparities that highlight key research areas.

Authors:Xiao Fang, Minhyek Jeon, Zheyang Qin, Stanislav Panev, Celso de Melo, Shuowen Hu, Shayok Chakraborty, Fernando De la Torre
Title: Adapting Vehicle Detectors for Aerial Imagery to Unseen Domains with Weak Supervision
Abstract:
Detecting vehicles in aerial imagery is a critical task with applications in traffic monitoring, urban planning, and defense intelligence. Deep learning methods have provided state-of-the-art (SOTA) results for this application. However, a significant challenge arises when models trained on data from one geographic region fail to generalize effectively to other areas. Variability in factors such as environmental conditions, urban layouts, road networks, vehicle types, and image acquisition parameters (e.g., resolution, lighting, and angle) leads to domain shifts that degrade model performance. This paper proposes a novel method that uses generative AI to synthesize high-quality aerial images and their labels, improving detector training through data augmentation. Our key contribution is the development of a multi-stage, multi-modal knowledge transfer framework utilizing fine-tuned latent diffusion models (LDMs) to mitigate the distribution gap between the source and target environments. Extensive experiments across diverse aerial imagery domains show consistent performance improvements in AP50 over supervised learning on source domain data, weakly supervised adaptation methods, unsupervised domain adaptation methods, and open-set object detectors by 4-23%, 6-10%, 7-40%, and more than 50%, respectively. Furthermore, we introduce two newly annotated aerial datasets from New Zealand and Utah to support further research in this field. Project page is available at: https://humansensinglab.github.io/AGenDA

Authors:Yuying Ge, Yixiao Ge, Chen Li, Teng Wang, Junfu Pu, Yizhuo Li, Lu Qiu, Jin Ma, Lisheng Duan, Xinyu Zuo, Jinwen Luo, Weibo Gu, Zexuan Li, Xiaojing Zhang, Yangyu Tao, Han Hu, Di Wang, Ying Shan
Title: ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts
Abstract:
Real-world user-generated short videos, especially those distributed on platforms such as WeChat Channel and TikTok, dominate the mobile internet. However, current large multimodal models lack essential temporally-structured, detailed, and in-depth video comprehension capabilities, which are the cornerstone of effective video search and recommendation, as well as emerging video applications. Understanding real-world shorts is actually challenging due to their complex visual elements, high information density in both visuals and audio, and fast pacing that focuses on emotional expression and viewpoint delivery. This requires advanced reasoning to effectively integrate multimodal information, including visual, audio, and text. In this work, we introduce ARC-Hunyuan-Video, a multimodal model that processes visual, audio, and textual signals from raw video inputs end-to-end for structured comprehension. The model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning. Leveraging high-quality data from an automated annotation pipeline, our compact 7B-parameter model is trained through a comprehensive regimen: pre-training, instruction fine-tuning, cold start, reinforcement learning (RL) post-training, and final instruction fine-tuning. Quantitative evaluations on our introduced benchmark ShortVid-Bench and qualitative comparisons demonstrate its strong performance in real-world video comprehension, and it supports zero-shot or fine-tuning with a few samples for diverse downstream applications. The real-world production deployment of our model has yielded tangible and measurable improvements in user engagement and satisfaction, a success supported by its remarkable efficiency, with stress tests indicating an inference time of just 10 seconds for a one-minute video on H20 GPU.
中文摘要:本文提出ARC-Hunyuan-Video多模态模型,通过端到端处理视频的视觉、音频和文本信号,实现了对短视频的深度结构化理解,在真实场景应用中表现出色且具备高效推理能力。
English Summary: This paper introduces ARC-Hunyuan-Video, a 7B-parameter multimodal model that processes visual, audio, and textual inputs end-to-end to achieve comprehensive video understanding, demonstrating strong performance in real-world applications with efficient inference capabilities.

Authors:Kai Ye, YingShi Luan, Zhudi Chen, Guangyue Meng, Pingyang Dai, Liujuan Cao
Title: RIS-LAD: A Benchmark and Model for Referring Low-Altitude Drone Image Segmentation
Abstract:
Referring Image Segmentation (RIS), which aims to segment specific objects based on natural language descriptions, plays an essential role in vision-language understanding. Despite its progress in remote sensing applications, RIS in Low-Altitude Drone (LAD) scenarios remains underexplored. Existing datasets and methods are typically designed for high-altitude and static-view imagery. They struggle to handle the unique characteristics of LAD views, such as diverse viewpoints and high object density. To fill this gap, we present RIS-LAD, the first fine-grained RIS benchmark tailored for LAD scenarios. This dataset comprises 13,871 carefully annotated image-text-mask triplets collected from realistic drone footage, with a focus on small, cluttered, and multi-viewpoint scenes. It highlights new challenges absent in previous benchmarks, such as category drift caused by tiny objects and object drift under crowded same-class objects. To tackle these issues, we propose the Semantic-Aware Adaptive Reasoning Network (SAARN). Rather than uniformly injecting all linguistic features, SAARN decomposes and routes semantic information to different stages of the network. Specifically, the Category-Dominated Linguistic Enhancement (CDLE) aligns visual features with object categories during early encoding, while the Adaptive Reasoning Fusion Module (ARFM) dynamically selects semantic cues across scales to improve reasoning in complex scenes. The experimental evaluation reveals that RIS-LAD presents substantial challenges to state-of-the-art RIS algorithms, and also demonstrates the effectiveness of our proposed model in addressing these challenges. The dataset and code will be publicly released soon at: https://github.com/AHideoKuzeA/RIS-LAD/.
中文: 本文提出了首个面向低空无人机场景的细粒度参考图像分割基准RIS-LAD,并设计了语义感知自适应推理网络SAARN,通过分层路由语义特征有效解决了微小物体和密集场景带来的新挑战。
English: This paper introduces RIS-LAD, the first fine-grained benchmark for referring image segmentation in low-altitude drone scenarios, and proposes SAARN, a semantic-aware adaptive reasoning network that dynamically routes linguistic features to effectively address challenges like tiny objects and crowded scenes.

Authors:Ziling Wu, Armaghan Moemeni, Praminda Caleb-Solly
Title: Ensemble Foreground Management for Unsupervised Object Discovery
Abstract:
Unsupervised object discovery (UOD) aims to detect and segment objects in 2D images without handcrafted annotations. Recent progress in self-supervised representation learning has led to some success in UOD algorithms. However, the absence of ground truth provides existing UOD methods with two challenges: 1) determining if a discovered region is foreground or background, and 2) knowing how many objects remain undiscovered. To address these two problems, previous solutions rely on foreground priors to distinguish if the discovered region is foreground, and conduct one or fixed iterations of discovery. However, the existing foreground priors are heuristic and not always robust, and a fixed number of discoveries leads to under or over-segmentation, since the number of objects in images varies. This paper introduces UnionCut, a robust and well-grounded foreground prior based on min-cut and ensemble methods that detects the union of foreground areas of an image, allowing UOD algorithms to identify foreground objects and stop discovery once the majority of the foreground union in the image is segmented. In addition, we propose UnionSeg, a distilled transformer of UnionCut that outputs the foreground union more efficiently and accurately. Our experiments show that by combining with UnionCut or UnionSeg, previous state-of-the-art UOD methods witness an increase in the performance of single object discovery, saliency detection and self-supervised instance segmentation on various benchmarks. The code is available at https://github.com/YFaris/UnionCut.
中文摘要:本文提出UnionCut,一种基于最小割和集成方法的鲁棒前景先验,通过精确识别前景区域并优化发现迭代次数,有效解决无监督对象发现中的关键挑战,显著提升了多个基准测试的性能。
English Summary: This paper introduces UnionCut, a robust foreground prior using min-cut and ensemble methods to address challenges in unsupervised object discovery by accurately identifying foreground regions and optimizing discovery iterations, significantly enhancing performance across various benchmarks.

Authors:Yuchen Liu, Yaoming Wang, Bowen Shi, Xiaopeng Zhang, Wenrui Dai, Chenglin Li, Hongkai Xiong, Qi Tian
Title: METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models
Abstract:
Vision encoders serve as the cornerstone of multimodal understanding. Single-encoder architectures like CLIP exhibit inherent constraints in generalizing across diverse multimodal tasks, while recent multi-encoder fusion methods introduce prohibitive computational overhead to achieve superior performance using complementary visual representations from multiple vision encoders. To address this, we propose a progressive pruning framework, namely Multi-Encoder collaboraTivE tOken pRuning (METEOR), that eliminates redundant visual tokens across the encoding, fusion, and decoding stages for multi-encoder MLLMs. For multi-vision encoding, we discard redundant tokens within each encoder via a rank guided collaborative token assignment strategy. Subsequently, for multi-vision fusion, we combine the visual features from different encoders while reducing cross-encoder redundancy with cooperative pruning. Finally, we propose an adaptive token pruning method in the LLM decoding stage to further discard irrelevant tokens based on the text prompts with dynamically adjusting pruning ratios for specific task demands. To our best knowledge, this is the first successful attempt that achieves an efficient multi-encoder based vision language model with multi-stage pruning strategies. Extensive experiments on 11 benchmarks demonstrate the effectiveness of our proposed approach. Compared with EAGLE, a typical multi-encoder MLLMs, METEOR reduces 76% visual tokens with only 0.3% performance drop in average. The code is available at https://github.com/YuchenLiu98/METEOR.
中文摘要:提出的METEOR框架通过多阶段剪枝策略,在编码、融合和解码环节逐步消除冗余视觉标记,实现了高效的多编码器视觉语言模型,在减少76%视觉标记的同时仅造成0.3%的性能损失。
English Summary: The proposed METEOR framework progressively prunes redundant visual tokens across encoding, fusion, and decoding stages to create an efficient multi-encoder vision-language model that reduces 76% of visual tokens with minimal performance loss.

Authors:Chunshi Wang, Bin Zhao, Shuxue Ding
Title: SCANet: Split Coordinate Attention Network for Building Footprint Extraction
Abstract:
Building footprint extraction holds immense significance in remote sensing image analysis and has great value in urban planning, land use, environmental protection and disaster assessment. Despite the progress made by conventional and deep learning approaches in this field, they continue to encounter significant challenges. This paper introduces a novel plug-and-play attention module, Split Coordinate Attention (SCA), which ingeniously captures spatially remote interactions by employing two spatial range of pooling kernels, strategically encoding each channel along x and y planes, and separately performs a series of split operations for each feature group, thus enabling more efficient semantic feature extraction. By inserting into a 2D CNN to form an effective SCANet, our SCANet outperforms recent SOTA methods on the public Wuhan University (WHU) Building Dataset and Massachusetts Building Dataset in terms of various metrics. Particularly SCANet achieves the best IoU, 91.61% and 75.49% for the two datasets. Our code is available at https://github.com/AiEson/SCANet
Chinese: 本文提出了一种新颖的分裂坐标注意力(SCA)模块,通过有效捕捉空间交互和语义特征,在基准数据集上实现了最先进的建筑轮廓提取性能。
English: This paper introduces a novel Split Coordinate Attention (SCA) module that enhances building footprint extraction by efficiently capturing spatial interactions and semantic features, achieving state-of-the-art performance on benchmark datasets.

Authors:Yang Chen, Yufan Shen, Wenxuan Huang, Sheng Zhou, Qunshu Lin, Xinyu Cai, Zhi Yu, Jiajun Bu, Botian Shi, Yu Qiao
Title: Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback
Abstract:
Multimodal Large Language Models (MLLMs) exhibit impressive performance across various visual tasks. Subsequent investigations into enhancing their visual reasoning abilities have significantly expanded their performance envelope. However, a critical bottleneck in the advancement of MLLMs toward deep visual reasoning is their heavy reliance on curated image-text supervision. To solve this problem, we introduce a novel framework, ``Reasoning-Rendering-Visual-Feedback'' (RRVF), that enables MLLMs to learn complex visual reasoning from only raw images. This framework builds on the ``Asymmetry of Verification'' principle, i.e., verifying the rendered output against the source image is substantially easier than performing deep visual reasoning to generate a faithful, structured representation such as code. We demonstrate that this relative ease provides an ideal reward signal for optimization via Reinforcement Learning (RL), thereby reducing reliance on image-text supervision. RRVF implements a closed-loop iterative process encompassing reasoning, rendering, and visual feedback components, enabling the model to perform complex reasoning, including self-correction through multi-turn interactions. This process is optimized end-to-end using the GRPO algorithm. Extensive evaluations are conducted on image-to-code generation across two diverse domains: data charts and web interfaces. The RRVF-trained model not only outperforms existing similarly sized open-source MLLMs and supervised fine-tuning baselines but also exhibits superior generalization. Notably, the model outperforms the more advanced MLLM used to generate visual feedback during training. Code is available at https://github.com/L-O-I/RRVF.
Chinese: RRVF框架通过推理、渲染和视觉反馈的闭环迭代过程,使多模态大语言模型能够直接从原始图像中学习复杂视觉推理,减少了对人工标注图像-文本数据的依赖,并在图像到代码生成任务中展现出卓越性能。
English: The RRVF framework enables Multimodal Large Language Models to learn complex visual reasoning directly from raw images by leveraging an iterative process of reasoning, rendering, and visual feedback, reducing reliance on curated image-text supervision and demonstrating superior performance in image-to-code generation tasks.

Authors:Kangcheng Bin, Chen Chen, Ting Hu, Jiahao Qi, Ping Zhong
Title: ATR-UMMIM: A Benchmark Dataset for UAV-Based Multimodal Image Registration under Complex Imaging Conditions
Abstract:
Multimodal fusion has become a key enabler for UAV-based object detection, as each modality provides complementary cues for robust feature extraction. However, due to significant differences in resolution, field of view, and sensing characteristics across modalities, accurate registration is a prerequisite before fusion. Despite its importance, there is currently no publicly available benchmark specifically designed for multimodal registration in UAV-based aerial scenarios, which severely limits the development and evaluation of advanced registration methods under real-world conditions. To bridge this gap, we present ATR-UMMIM, the first benchmark dataset specifically tailored for multimodal image registration in UAV-based applications. This dataset includes 7,969 triplets of raw visible, infrared, and precisely registered visible images captured covers diverse scenarios including flight altitudes from 80m to 300m, camera angles from 0° to 75°, and all-day, all-year temporal variations under rich weather and illumination conditions. To ensure high registration quality, we design a semi-automated annotation pipeline to introduce reliable pixel-level ground truth to each triplet. In addition, each triplet is annotated with six imaging condition attributes, enabling benchmarking of registration robustness under real-world deployment settings. To further support downstream tasks, we provide object-level annotations on all registered images, covering 11 object categories with 77,753 visible and 78,409 infrared bounding boxes. We believe ATR-UMMIM will serve as a foundational benchmark for advancing multimodal registration, fusion, and perception in real-world UAV scenarios. The datatset can be download from https://github.com/supercpy/ATR-UMMIM
中文: ATR-UMMIM是首个专为无人机多模态图像配准设计的基准数据集,包含7,969组可见光与红外图像三元组,提供精确的像素级标注和多样化的真实场景条件,旨在推动配准与感知技术的发展。
English: The ATR-UMMIM dataset is the first benchmark specifically designed for multimodal image registration in UAV applications, providing 7,969 triplets of visible and infrared images with precise pixel-level annotations and diverse real-world conditions to advance registration and perception technologies.

Authors:Zhuoer Yin, Calvin Yeung, Tomohiro Suzuki, Ryota Tanaka, Keisuke Fujii
Title: KASportsFormer: Kinematic Anatomy Enhanced Transformer for 3D Human Pose Estimation on Short Sports Scene Video
Abstract:
Recent transformer based approaches have demonstrated impressive performance in solving real-world 3D human pose estimation problems. Albeit these approaches achieve fruitful results on benchmark datasets, they tend to fall short of sports scenarios where human movements are more complicated than daily life actions, as being hindered by motion blur, occlusions, and domain shifts. Moreover, due to the fact that critical motions in a sports game often finish in moments of time (e.g., shooting), the ability to focus on momentary actions is becoming a crucial factor in sports analysis, where current methods appear to struggle with instantaneous scenarios. To overcome these limitations, we introduce KASportsFormer, a novel transformer based 3D pose estimation framework for sports that incorporates a kinematic anatomy-informed feature representation and integration module. In which the inherent kinematic motion information is extracted with the Bone Extractor (BoneExt) and Limb Fuser (LimbFus) modules and encoded in a multimodal manner. This improved the capability of comprehending sports poses in short videos. We evaluate our method through two representative sports scene datasets: SportsPose and WorldPose. Experimental results show that our proposed method achieves state-of-the-art results with MPJPE errors of 58.0mm and 34.3mm, respectively. Our code and models are available at: https://github.com/jw0r1n/KASportsFormer
中文: 近期基于Transformer的方法在常规三维人体姿态估计中表现出色,但在运动场景中因复杂动作和瞬时行为受限;为此提出的KASportsFormer通过融合运动学解剖特征,在体育数据集上取得了最优性能。
English: Recent transformer-based methods excel in general 3D human pose estimation but struggle with sports scenarios due to complex movements and momentary actions, prompting the development of KASportsFormer which integrates kinematic anatomy features to achieve state-of-the-art results on sports datasets.

Authors:Zeyu Huang, Wei Meng, Quan Liu, Kun Chen, Li Ma
Title: AR-LIF: Adaptive reset leaky integrate-and-fire neuron for spiking neural networks
Abstract:
Spiking neural networks offer low energy consumption due to their event-driven nature. Beyond binary spike outputs, their intrinsic floating-point dynamics merit greater attention. Neuronal threshold levels and reset modes critically determine spike count and timing. Hard reset cause information loss, while soft reset apply uniform treatment to neurons. To address these issues, we design an adaptive reset neuron that establishes relationships between inputs, outputs, and reset, while integrating a simple yet effective threshold adjustment strategy. Experimental results demonstrate that our method achieves excellent performance while maintaining lower energy consumption. In particular, it attains state-of-the-art accuracy on Tiny-ImageNet and CIFAR10-DVS. Codes are available at https://github.com/2ephyrus/AR-LIF.
Chinese: 该自适应重置神经元通过建立输入-输出-重置关系并整合阈值调节策略,解决了脉冲神经网络中的信息丢失和统一处理问题,在Tiny-ImageNet等数据集上以低能耗实现了最优精度。
English: The proposed adaptive reset neuron addresses information loss and uniform treatment issues in spiking neural networks by establishing input-output-reset relationships and incorporating a threshold adjustment strategy, achieving state-of-the-art accuracy on datasets like Tiny-ImageNet with low energy consumption.

Authors:Yue Zhu, Haiwen Diao, Shang Gao, Jiazuo Yu, Jiawen Zhu, Yunzhi Zhuge, Shuai Hao, Xu Jia, Lu Zhang, Ying Zhang, Huchuan Lu
Title: Regularizing Subspace Redundancy of Low-Rank Adaptation
Abstract:
Low-Rank Adaptation (LoRA) and its variants have delivered strong capability in Parameter-Efficient Transfer Learning (PETL) by minimizing trainable parameters and benefiting from reparameterization. However, their projection matrices remain unrestricted during training, causing high representation redundancy and diminishing the effectiveness of feature adaptation in the resulting subspaces. While existing methods mitigate this by manually adjusting the rank or implicitly applying channel-wise masks, they lack flexibility and generalize poorly across various datasets and architectures. Hence, we propose ReSoRA, a method that explicitly models redundancy between mapping subspaces and adaptively Regularizes Subspace redundancy of Low-Rank Adaptation. Specifically, it theoretically decomposes the low-rank submatrices into multiple equivalent subspaces and systematically applies de-redundancy constraints to the feature distributions across different projections. Extensive experiments validate that our proposed method consistently facilitates existing state-of-the-art PETL methods across various backbones and datasets in vision-language retrieval and standard visual classification benchmarks. Besides, as a training supervision, ReSoRA can be seamlessly integrated into existing approaches in a plug-and-play manner, with no additional inference costs. Code is publicly available at: https://github.com/Lucenova/ReSoRA.
中文:ReSoRA提出了一种自适应正则化方法,通过显式建模低秩适配投影子空间中的冗余并施加去冗余约束,在保持零推理成本的同时显著提升了跨视觉语言检索与分类任务的特征适应效能。
English: ReSoRA introduces an adaptive regularization method that explicitly reduces redundancy in Low-Rank Adaptation's projection subspaces, enhancing feature adaptation efficiency without increasing inference costs across diverse vision-language and classification tasks.

Authors:Haowen Li, Zhenfeng Fan, Zhang Wen, Zhengzhou Zhu, Yunjin Li
Title: AIComposer: Any Style and Content Image Composition via Feature Integration
Abstract:
Image composition has advanced significantly with large-scale pre-trained T2I diffusion models. Despite progress in same-domain composition, cross-domain composition remains under-explored. The main challenges are the stochastic nature of diffusion models and the style gap between input images, leading to failures and artifacts. Additionally, heavy reliance on text prompts limits practical applications. This paper presents the first cross-domain image composition method that does not require text prompts, allowing natural stylization and seamless compositions. Our method is efficient and robust, preserving the diffusion prior, as it involves minor steps for backward inversion and forward denoising without training the diffuser. Our method also uses a simple multilayer perceptron network to integrate CLIP features from foreground and background, manipulating diffusion with a local cross-attention strategy. It effectively preserves foreground content while enabling stable stylization without a pre-stylization network. Finally, we create a benchmark dataset with diverse contents and styles for fair evaluation, addressing the lack of testing datasets for cross-domain image composition. Our method outperforms state-of-the-art techniques in both qualitative and quantitative evaluations, significantly improving the LPIPS score by 30.5% and the CSD metric by 18.1%. We believe our method will advance future research and applications. Code and benchmark at https://github.com/sherlhw/AIComposer.
中文摘要:本文提出了首个无需文本提示的跨域图像合成方法,通过多层感知器和局部交叉注意力高效融合前景与背景风格,在定性和定量评估中均显著优于现有技术,并创建了基准数据集以推动该领域研究。
English Summary: This paper introduces the first text-prompt-free cross-domain image composition method that efficiently integrates foreground and background styles using a multilayer perceptron and local cross-attention, achieving superior performance with significant metric improvements.

Authors:Ao Li, Yuxiang Duan, Jinghui Zhang, Congbo Ma, Yutong Xie, Gustavo Carneiro, Mohammad Yaqub, Hu Wang
Title: TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model
Abstract:
Large Vision-Language Models (LVLMs) have advanced multimodal learning but face high computational costs due to the large number of visual tokens, motivating token pruning to improve inference efficiency. The key challenge lies in identifying which tokens are truly important. Most existing approaches rely on attention-based criteria to estimate token importance. However, they inherently suffer from certain limitations, such as positional bias. In this work, we explore a new perspective on token importance based on token transitions in LVLMs. We observe that the transition of token representations provides a meaningful signal of semantic information. Based on this insight, we propose TransPrune, a training-free and efficient token pruning method. Specifically, TransPrune progressively prunes tokens by assessing their importance through a combination of Token Transition Variation (TTV)-which measures changes in both the magnitude and direction of token representations-and Instruction-Guided Attention (IGA), which measures how strongly the instruction attends to image tokens via attention. Extensive experiments demonstrate that TransPrune achieves comparable multimodal performance to original LVLMs, such as LLaVA-v1.5 and LLaVA-Next, across eight benchmarks, while reducing inference TFLOPs by more than half. Moreover, TTV alone can serve as an effective criterion without relying on attention, achieving performance comparable to attention-based methods. The code will be made publicly available upon acceptance of the paper at https://github.com/liaolea/TransPrune.
Chinese: TransPrune是一种无需训练的令牌剪枝方法,通过结合令牌转换变化和指令引导注意力来评估重要性,逐步剪除冗余令牌,在保持多模态性能的同时将推理计算量减半。
English: TransPrune is a training-free token pruning method that enhances inference efficiency in Large Vision-Language Models by progressively removing tokens based on Token Transition Variation and Instruction-Guided Attention, achieving comparable performance with over 50% reduction in computational costs.

Authors:Zhixi Cai, Kartik Kuckreja, Shreya Ghosh, Akanksha Chuchra, Muhammad Haris Khan, Usman Tariq, Tom Gedeon, Abhinav Dhall
Title: AV-Deepfake1M++: A Large-Scale Audio-Visual Deepfake Benchmark with Real-World Perturbations
Abstract:
The rapid surge of text-to-speech and face-voice reenactment models makes video fabrication easier and highly realistic. To encounter this problem, we require datasets that rich in type of generation methods and perturbation strategy which is usually common for online videos. To this end, we propose AV-Deepfake1M++, an extension of the AV-Deepfake1M having 2 million video clips with diversified manipulation strategy and audio-visual perturbation. This paper includes the description of data generation strategies along with benchmarking of AV-Deepfake1M++ using state-of-the-art methods. We believe that this dataset will play a pivotal role in facilitating research in Deepfake domain. Based on this dataset, we host the 2025 1M-Deepfakes Detection Challenge. The challenge details, dataset and evaluation scripts are available online under a research-only license at https://deepfakes1m.github.io/2025.

Authors:Yanyin Guo, Runxuan An, Junwei Li, Zhiyuan Zhang
Title: LSFDNet: A Single-Stage Fusion and Detection Network for Ships Using SWIR and LWIR
Abstract:
Traditional ship detection methods primarily rely on single-modal approaches, such as visible or infrared images, which limit their application in complex scenarios involving varying lighting conditions and heavy fog. To address this issue, we explore the advantages of short-wave infrared (SWIR) and long-wave infrared (LWIR) in ship detection and propose a novel single-stage image fusion detection algorithm called LSFDNet. This algorithm leverages feature interaction between the image fusion and object detection subtask networks, achieving remarkable detection performance and generating visually impressive fused images. To further improve the saliency of objects in the fused images and improve the performance of the downstream detection task, we introduce the Multi-Level Cross-Fusion (MLCF) module. This module combines object-sensitive fused features from the detection task and aggregates features across multiple modalities, scales, and tasks to obtain more semantically rich fused features. Moreover, we utilize the position prior from the detection task in the Object Enhancement (OE) loss function, further increasing the retention of object semantics in the fused images. The detection task also utilizes preliminary fused features from the fusion task to complement SWIR and LWIR features, thereby enhancing detection performance. Additionally, we have established a Nearshore Ship Long-Short Wave Registration (NSLSR) dataset to train effective SWIR and LWIR image fusion and detection networks, bridging a gap in this field. We validated the superiority of our proposed single-stage fusion detection algorithm on two datasets. The source code and dataset are available at https://github.com/Yanyin-Guo/LSFDNet
中文摘要:针对传统单模态船舶检测方法在复杂场景下的局限性,本研究提出LSFDNet单阶段融合检测算法,通过结合短波与长波红外成像、多级交叉融合模块和物体增强损失函数,在新建立的近岸船舶数据集上验证了其优越性能。
English Summary: To overcome the limitations of traditional single-modal ship detection methods in complex conditions, this study introduces LSFDNet, a novel single-stage fusion detection algorithm that integrates short-wave and long-wave infrared imaging with a multi-level cross-fusion module and object enhancement loss, achieving superior performance validated on a newly established dataset.

Authors:Hyung Kyu Kim, Hak Gu Kim
Title: Learning Phonetic Context-Dependent Viseme for Enhancing Speech-Driven 3D Facial Animation
Abstract:
Speech-driven 3D facial animation aims to generate realistic facial movements synchronized with audio. Traditional methods primarily minimize reconstruction loss by aligning each frame with ground-truth. However, this frame-wise approach often fails to capture the continuity of facial motion, leading to jittery and unnatural outputs due to coarticulation. To address this, we propose a novel phonetic context-aware loss, which explicitly models the influence of phonetic context on viseme transitions. By incorporating a viseme coarticulation weight, we assign adaptive importance to facial movements based on their dynamic changes over time, ensuring smoother and perceptually consistent animations. Extensive experiments demonstrate that replacing the conventional reconstruction loss with ours improves both quantitative metrics and visual quality. It highlights the importance of explicitly modeling phonetic context-dependent visemes in synthesizing natural speech-driven 3D facial animation. Project page: https://cau-irislab.github.io/interspeech25/

Authors:Hyung Kyu Kim, Sangmin Lee, Hak Gu Kim
Title: MemoryTalker: Personalized Speech-Driven 3D Facial Animation via Audio-Guided Stylization
Abstract:
Speech-driven 3D facial animation aims to synthesize realistic facial motion sequences from given audio, matching the speaker's speaking style. However, previous works often require priors such as class labels of a speaker or additional 3D facial meshes at inference, which makes them fail to reflect the speaking style and limits their practical use. To address these issues, we propose MemoryTalker which enables realistic and accurate 3D facial motion synthesis by reflecting speaking style only with audio input to maximize usability in applications. Our framework consists of two training stages: 1-stage is storing and retrieving general motion (i.e., Memorizing), and 2-stage is to perform the personalized facial motion synthesis (i.e., Animating) with the motion memory stylized by the audio-driven speaking style feature. In this second stage, our model learns about which facial motion types should be emphasized for a particular piece of audio. As a result, our MemoryTalker can generate a reliable personalized facial animation without additional prior information. With quantitative and qualitative evaluations, as well as user study, we show the effectiveness of our model and its performance enhancement for personalized facial animation over state-of-the-art methods.

Authors:Chieh-Yun Chen, Min Shi, Gong Zhang, Humphrey Shi
Title: T2I-Copilot: A Training-Free Multi-Agent Text-to-Image System for Enhanced Prompt Interpretation and Interactive Generation
Abstract:
Text-to-Image (T2I) generative models have revolutionized content creation but remain highly sensitive to prompt phrasing, often requiring users to repeatedly refine prompts multiple times without clear feedback. While techniques such as automatic prompt engineering, controlled text embeddings, denoising, and multi-turn generation mitigate these issues, they offer limited controllability, or often necessitate additional training, restricting the generalization abilities. Thus, we introduce T2I-Copilot, a training-free multi-agent system that leverages collaboration between (Multimodal) Large Language Models to automate prompt phrasing, model selection, and iterative refinement. This approach significantly simplifies prompt engineering while enhancing generation quality and text-image alignment compared to direct generation. Specifically, T2I-Copilot consists of three agents: (1) Input Interpreter, which parses the input prompt, resolves ambiguities, and generates a standardized report; (2) Generation Engine, which selects the appropriate model from different types of T2I models and organizes visual and textual prompts to initiate generation; and (3) Quality Evaluator, which assesses aesthetic quality and text-image alignment, providing scores and feedback for potential regeneration. T2I-Copilot can operate fully autonomously while also supporting human-in-the-loop intervention for fine-grained control. On GenAI-Bench, using open-source generation models, T2I-Copilot achieves a VQA score comparable to commercial models RecraftV3 and Imagen 3, surpasses FLUX1.1-pro by 6.17% at only 16.59% of its cost, and outperforms FLUX.1-dev and SD 3.5 Large by 9.11% and 6.36%. Code will be released at: https://github.com/SHI-Labs/T2I-Copilot.
中文: T2I-Copilot是一种免训练的多智能体系统,通过多模态大语言模型协作自动优化提示词和选择模型,在显著提升生成质量与图文对齐度的同时大幅降低成本。
English: T2I-Copilot is a training-free multi-agent system that automates prompt engineering and model selection through collaboration between multimodal large language models, significantly improving generation quality and text-image alignment while reducing costs.

Authors:Risa Shinoda, Nakamasa Inoue, Hirokatsu Kataoka, Masaki Onishi, Yoshitaka Ushiku
Title: AgroBench: Vision-Language Model Benchmark in Agriculture
Abstract:
Precise automated understanding of agricultural tasks such as disease identification is essential for sustainable crop production. Recent advances in vision-language models (VLMs) are expected to further expand the range of agricultural tasks by facilitating human-model interaction through easy, text-based communication. Here, we introduce AgroBench (Agronomist AI Benchmark), a benchmark for evaluating VLM models across seven agricultural topics, covering key areas in agricultural engineering and relevant to real-world farming. Unlike recent agricultural VLM benchmarks, AgroBench is annotated by expert agronomists. Our AgroBench covers a state-of-the-art range of categories, including 203 crop categories and 682 disease categories, to thoroughly evaluate VLM capabilities. In our evaluation on AgroBench, we reveal that VLMs have room for improvement in fine-grained identification tasks. Notably, in weed identification, most open-source VLMs perform close to random. With our wide range of topics and expert-annotated categories, we analyze the types of errors made by VLMs and suggest potential pathways for future VLM development. Our dataset and code are available at https://dahlian00.github.io/AgroBenchPage/ .

Authors:Yili Li, Gang Xiong, Gaopeng Gou, Xiangyan Qu, Jiamin Zhuang, Zhen Li, Junzheng Shi
Title: T2VParser: Adaptive Decomposition Tokens for Partial Alignment in Text to Video Retrieval
Abstract:
Text-to-video retrieval essentially aims to train models to align visual content with textual descriptions accurately. Due to the impressive general multimodal knowledge demonstrated by image-text pretrained models such as CLIP, existing work has primarily focused on extending CLIP knowledge for video-text tasks. However, videos typically contain richer information than images. In current video-text datasets, textual descriptions can only reflect a portion of the video content, leading to partial misalignment in video-text matching. Therefore, directly aligning text representations with video representations can result in incorrect supervision, ignoring the inequivalence of information. In this work, we propose T2VParser to extract multiview semantic representations from text and video, achieving adaptive semantic alignment rather than aligning the entire representation. To extract corresponding representations from different modalities, we introduce Adaptive Decomposition Tokens, which consist of a set of learnable tokens shared across modalities. The goal of T2VParser is to emphasize precise alignment between text and video while retaining the knowledge of pretrained models. Experimental results demonstrate that T2VParser achieves accurate partial alignment through effective cross-modal content decomposition. The code is available at https://github.com/Lilidamowang/T2VParser.
中文: 本研究提出T2VParser方法,通过提取多视角语义表征并采用自适应分解标记,在保持预训练模型知识的同时实现视频与文本的精准局部对齐,有效解决跨模态检索中的信息不对等问题。
English: This study introduces T2VParser, which addresses partial misalignment in video-text retrieval by extracting multiview semantic representations and using Adaptive Decomposition Tokens for precise cross-modal alignment while preserving pretrained model knowledge.

Authors:Alexandru Brateanu, Raul Balmez, Ciprian Orhei, Codruta Ancuti, Cosmin Ancuti
Title: ModalFormer: Multimodal Transformer for Low-Light Image Enhancement
Abstract:
Low-light image enhancement (LLIE) is a fundamental yet challenging task due to the presence of noise, loss of detail, and poor contrast in images captured under insufficient lighting conditions. Recent methods often rely solely on pixel-level transformations of RGB images, neglecting the rich contextual information available from multiple visual modalities. In this paper, we present ModalFormer, the first large-scale multimodal framework for LLIE that fully exploits nine auxiliary modalities to achieve state-of-the-art performance. Our model comprises two main components: a Cross-modal Transformer (CM-T) designed to restore corrupted images while seamlessly integrating multimodal information, and multiple auxiliary subnetworks dedicated to multimodal feature reconstruction. Central to the CM-T is our novel Cross-modal Multi-headed Self-Attention mechanism (CM-MSA), which effectively fuses RGB data with modality-specific features--including deep feature embeddings, segmentation information, geometric cues, and color information--to generate information-rich hybrid attention maps. Extensive experiments on multiple benchmark datasets demonstrate ModalFormer's state-of-the-art performance in LLIE. Pre-trained models and results are made available at https://github.com/albrateanu/ModalFormer.
中文: ModalFormer是首个用于低光照图像增强的大规模多模态框架,通过跨模态转换器和多模态特征融合机制整合九种辅助模态,将RGB数据与分割、几何等上下文特征有效结合,实现了最先进的增强性能。
English: ModalFormer is a pioneering multimodal framework for low-light image enhancement that integrates nine auxiliary modalities through a Cross-modal Transformer and specialized attention mechanisms, achieving state-of-the-art results by effectively combining RGB data with contextual features like segmentation and geometry.

Authors:Peng Liu, Bianca Güttner, Yutong Su, Chenyang Li, Jinjing Xu, Mingyang Liu, Zhe Min, Andrey Zhylka, Jasper Smit, Karin Olthof, Matteo Fusaglia, Rudi Apolle, Matthias Miederer, Laura Frohneberger, Carina Riediger, Jügen Weitz, Fiona Kolbinger, Stefanie Speidel, Micha Pfeiffer
Title: PIVOTS: Aligning unseen Structures using Preoperative to Intraoperative Volume-To-Surface Registration for Liver Navigation
Abstract:
Non-rigid registration is essential for Augmented Reality guided laparoscopic liver surgery by fusing preoperative information, such as tumor location and vascular structures, into the limited intraoperative view, thereby enhancing surgical navigation. A prerequisite is the accurate prediction of intraoperative liver deformation which remains highly challenging due to factors such as large deformation caused by pneumoperitoneum, respiration and tool interaction as well as noisy intraoperative data, and limited field of view due to occlusion and constrained camera movement. To address these challenges, we introduce PIVOTS, a Preoperative to Intraoperative VOlume-To-Surface registration neural network that directly takes point clouds as input for deformation prediction. The geometric feature extraction encoder allows multi-resolution feature extraction, and the decoder, comprising novel deformation aware cross attention modules, enables pre- and intraoperative information interaction and accurate multi-level displacement prediction. We train the neural network on synthetic data simulated from a biomechanical simulation pipeline and validate its performance on both synthetic and real datasets. Results demonstrate superior registration performance of our method compared to baseline methods, exhibiting strong robustness against high amounts of noise, large deformation, and various levels of intraoperative visibility. We publish the training and test sets as evaluation benchmarks and call for a fair comparison of liver registration methods with volume-to-surface data. Code and datasets are available here https://github.com/pengliu-nct/PIVOTS.
中文: PIVOTS是一种用于非刚性配准的神经网络,通过点云预测肝脏变形,在腹腔镜手术中对噪声和大变形表现出强大的鲁棒性。
English: PIVOTS is a neural network for non-rigid registration that predicts liver deformation using point clouds, achieving robust performance against noise and large deformations in laparoscopic surgery.

Authors:Chenjian Gao, Lihe Ding, Rui Han, Zhanpeng Huang, Zibin Wang, Tianfan Xue
Title: From Gallery to Wrist: Realistic 3D Bracelet Insertion in Videos
Abstract:
Inserting 3D objects into videos is a longstanding challenge in computer graphics with applications in augmented reality, virtual try-on, and video composition. Achieving both temporal consistency, or realistic lighting remains difficult, particularly in dynamic scenarios with complex object motion, perspective changes, and varying illumination. While 2D diffusion models have shown promise for producing photorealistic edits, they often struggle with maintaining temporal coherence across frames. Conversely, traditional 3D rendering methods excel in spatial and temporal consistency but fall short in achieving photorealistic lighting. In this work, we propose a hybrid object insertion pipeline that combines the strengths of both paradigms. Specifically, we focus on inserting bracelets into dynamic wrist scenes, leveraging the high temporal consistency of 3D Gaussian Splatting (3DGS) for initial rendering and refining the results using a 2D diffusion-based enhancement model to ensure realistic lighting interactions. Our method introduces a shading-driven pipeline that separates intrinsic object properties (albedo, shading, reflectance) and refines both shading and sRGB images for photorealism. To maintain temporal coherence, we optimize the 3DGS model with multi-frame weighted adjustments. This is the first approach to synergize 3D rendering and 2D diffusion for video object insertion, offering a robust solution for realistic and consistent video editing. Project Page: https://cjeen.github.io/BraceletPaper/

Authors:Qiaosi Yi, Shuai Li, Rongyuan Wu, Lingchen Sun, Yuhui Wu, Lei Zhang
Title: Fine-structure Preserved Real-world Image Super-resolution via Transfer VAE Training
Abstract:
Impressive results on real-world image super-resolution (Real-ISR) have been achieved by employing pre-trained stable diffusion (SD) models. However, one critical issue of such methods lies in their poor reconstruction of image fine structures, such as small characters and textures, due to the aggressive resolution reduction of the VAE (eg., 8$\times$ downsampling) in the SD model. One solution is to employ a VAE with a lower downsampling rate for diffusion; however, adapting its latent features with the pre-trained UNet while mitigating the increased computational cost poses new challenges. To address these issues, we propose a Transfer VAE Training (TVT) strategy to transfer the 8$\times$ downsampled VAE into a 4$\times$ one while adapting to the pre-trained UNet. Specifically, we first train a 4$\times$ decoder based on the output features of the original VAE encoder, then train a 4$\times$ encoder while keeping the newly trained decoder fixed. Such a TVT strategy aligns the new encoder-decoder pair with the original VAE latent space while enhancing image fine details. Additionally, we introduce a compact VAE and compute-efficient UNet by optimizing their network architectures, reducing the computational cost while capturing high-resolution fine-scale features. Experimental results demonstrate that our TVT method significantly improves fine-structure preservation, which is often compromised by other SD-based methods, while requiring fewer FLOPs than state-of-the-art one-step diffusion models. The official code can be found at https://github.com/Joyies/TVT.
Chinese: 提出的转移VAE训练(TVT)策略通过将8倍下采样VAE转换为4倍版本并优化网络架构,有效提升了真实图像超分辨率中精细结构的重建能力,同时降低了计算成本。
English: The proposed Transfer VAE Training (TVT) strategy enhances fine-structure preservation in real-world image super-resolution by converting an 8× downsampling VAE to a 4× version and optimizing network architectures to reduce computational costs.

Authors:Dingkun Liu, Zhu Chen, Jingwei Luo, Shijie Lian, Dongrui Wu
Title: MIRepNet: A Pipeline and Foundation Model for EEG-Based Motor Imagery Classification
Abstract:
Brain-computer interfaces (BCIs) enable direct communication between the brain and external devices. Recent EEG foundation models aim to learn generalized representations across diverse BCI paradigms. However, these approaches overlook fundamental paradigm-specific neurophysiological distinctions, limiting their generalization ability. Importantly, in practical BCI deployments, the specific paradigm such as motor imagery (MI) for stroke rehabilitation or assistive robotics, is generally determined prior to data acquisition. This paper proposes MIRepNet, the first EEG foundation model tailored for the MI paradigm. MIRepNet comprises a high-quality EEG preprocessing pipeline incorporating a neurophysiologically-informed channel template, adaptable to EEG headsets with arbitrary electrode configurations. Furthermore, we introduce a hybrid pretraining strategy that combines self-supervised masked token reconstruction and supervised MI classification, facilitating rapid adaptation and accurate decoding on novel downstream MI tasks with fewer than 30 trials per class. Extensive evaluations across five public MI datasets demonstrated that MIRepNet consistently achieved state-of-the-art performance, significantly outperforming both specialized and generalized EEG models. Our code will be available on GitHub\footnote{https://github.com/staraink/MIRepNet}.
Chinese: 本文提出了首个专为运动想象范式设计的EEG基础模型MIRepNet,它采用神经生理学启发的预处理流程和混合预训练策略,在多个数据集上以极少数据需求实现了最先进的性能表现。
English: This paper introduces MIRepNet, the first EEG foundation model specifically designed for motor imagery paradigms, featuring a neurophysiologically-informed preprocessing pipeline and hybrid pretraining that achieves state-of-the-art performance across multiple datasets with minimal data requirements.

Authors:Risa Shinoda, Nakamasa Inoue, Iro Laina, Christian Rupprecht, Hirokatsu Kataoka
Title: AnimalClue: Recognizing Animals by their Traces
Abstract:
Wildlife observation plays an important role in biodiversity conservation, necessitating robust methodologies for monitoring wildlife populations and interspecies interactions. Recent advances in computer vision have significantly contributed to automating fundamental wildlife observation tasks, such as animal detection and species identification. However, accurately identifying species from indirect evidence like footprints and feces remains relatively underexplored, despite its importance in contributing to wildlife monitoring. To bridge this gap, we introduce AnimalClue, the first large-scale dataset for species identification from images of indirect evidence. Our dataset consists of 159,605 bounding boxes encompassing five categories of indirect clues: footprints, feces, eggs, bones, and feathers. It covers 968 species, 200 families, and 65 orders. Each image is annotated with species-level labels, bounding boxes or segmentation masks, and fine-grained trait information, including activity patterns and habitat preferences. Unlike existing datasets primarily focused on direct visual features (e.g., animal appearances), AnimalClue presents unique challenges for classification, detection, and instance segmentation tasks due to the need for recognizing more detailed and subtle visual features. In our experiments, we extensively evaluate representative vision models and identify key challenges in animal identification from their traces. Our dataset and code are available at https://dahlian00.github.io/AnimalCluePage/

Authors:Ruizi Yang, Xiaolu Liu, Junbo Chen, Jianke Zhu
Title: MambaMap: Online Vectorized HD Map Construction using State Space Model
Abstract:
High-definition (HD) maps are essential for autonomous driving, as they provide precise road information for downstream tasks. Recent advances highlight the potential of temporal modeling in addressing challenges like occlusions and extended perception range. However, existing methods either fail to fully exploit temporal information or incur substantial computational overhead in handling extended sequences. To tackle these challenges, we propose MambaMap, a novel framework that efficiently fuses long-range temporal features in the state space to construct online vectorized HD maps. Specifically, MambaMap incorporates a memory bank to store and utilize information from historical frames, dynamically updating BEV features and instance queries to improve robustness against noise and occlusions. Moreover, we introduce a gating mechanism in the state space, selectively integrating dependencies of map elements in high computational efficiency. In addition, we design innovative multi-directional and spatial-temporal scanning strategies to enhance feature extraction at both BEV and instance levels. These strategies significantly boost the prediction accuracy of our approach while ensuring robust temporal consistency. Extensive experiments on the nuScenes and Argoverse2 datasets demonstrate that our proposed MambaMap approach outperforms state-of-the-art methods across various splits and perception ranges. Source code will be available at https://github.com/ZiziAmy/MambaMap.
中文摘要:MambaMap是一种创新框架,通过内存库和门控机制高效融合长时序特征来构建稳健的在线矢量化高精地图,在精度和一致性方面均优于现有最优方法。
English Summary: MambaMap is a novel framework that efficiently fuses long-range temporal features using a memory bank and gating mechanism to construct robust online vectorized HD maps, outperforming state-of-the-art methods in accuracy and consistency.

Authors:Bohong Chen, Yumeng Li, Youyi Zheng, Yao-Xiang Ding, Kun Zhou
Title: Motion-example-controlled Co-speech Gesture Generation Leveraging Large Language Models
Abstract:
The automatic generation of controllable co-speech gestures has recently gained growing attention. While existing systems typically achieve gesture control through predefined categorical labels or implicit pseudo-labels derived from motion examples, these approaches often compromise the rich details present in the original motion examples. We present MECo, a framework for motion-example-controlled co-speech gesture generation by leveraging large language models (LLMs). Our method capitalizes on LLMs' comprehension capabilities through fine-tuning to simultaneously interpret speech audio and motion examples, enabling the synthesis of gestures that preserve example-specific characteristics while maintaining speech congruence. Departing from conventional pseudo-labeling paradigms, we position motion examples as explicit query contexts within the prompt structure to guide gesture generation. Experimental results demonstrate state-of-the-art performance across three metrics: Fréchet Gesture Distance (FGD), motion diversity, and example-gesture similarity. Furthermore, our framework enables granular control of individual body parts and accommodates diverse input modalities including motion clips, static poses, human video sequences, and textual descriptions. Our code, pre-trained models, and videos are available at https://robinwitch.github.io/MECo-Page.

Authors:Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang
Title: When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios
Abstract:
Multimodal large language models (MLLMs) have made remarkable strides, largely driven by their ability to process increasingly long and complex contexts, such as high-resolution images, extended video sequences, and lengthy audio input. While this ability significantly enhances MLLM capabilities, it introduces substantial computational challenges, primarily due to the quadratic complexity of self-attention mechanisms with numerous input tokens. To mitigate these bottlenecks, token compression has emerged as an auspicious and critical approach, efficiently reducing the number of tokens during both training and inference. In this paper, we present the first systematic survey and synthesis of the burgeoning field of multimodal long context token compression. Recognizing that effective compression strategies are deeply tied to the unique characteristics and redundancies of each modality, we categorize existing approaches by their primary data focus, enabling researchers to quickly access and learn methods tailored to their specific area of interest: (1) image-centric compression, which addresses spatial redundancy in visual data; (2) video-centric compression, which tackles spatio-temporal redundancy in dynamic sequences; and (3) audio-centric compression, which handles temporal and spectral redundancy in acoustic signals. Beyond this modality-driven categorization, we further dissect methods based on their underlying mechanisms, including transformation-based, similarity-based, attention-based, and query-based approaches. By providing a comprehensive and structured overview, this survey aims to consolidate current progress, identify key challenges, and inspire future research directions in this rapidly evolving domain. We also maintain a public repository to continuously track and update the latest advances in this promising area.
中文: 多模态大语言模型处理长上下文时面临计算瓶颈,催生了按模态和机制分类的令牌压缩方法,旨在提升效率并指引未来研究方向。
English: Multimodal large language models face computational challenges from long contexts, prompting token compression methods categorized by modality and mechanism to enhance efficiency and guide future research.

Authors:Li Jinfu, Song Hong, Xia Jianghan, Lin Yucong, Wang Ting, Shao Long, Fan Jingfan, Yang Jian
Title: MoCTEFuse: Illumination-Gated Mixture of Chiral Transformer Experts for Multi-Level Infrared and Visible Image Fusion
Abstract:
While illumination changes inevitably affect the quality of infrared and visible image fusion, many outstanding methods still ignore this factor and directly merge the information from source images, leading to modality bias in the fused results. To this end, we propose a dynamic multi-level image fusion network called MoCTEFuse, which applies an illumination-gated Mixture of Chiral Transformer Experts (MoCTE) to adaptively preserve texture details and object contrasts in balance. MoCTE consists of high- and low-illumination expert subnetworks, each built upon the Chiral Transformer Fusion Block (CTFB). Guided by the illumination gating signals, CTFB dynamically switches between the primary and auxiliary modalities as well as assigning them corresponding weights with its asymmetric cross-attention mechanism. Meanwhile, it is stacked at multiple stages to progressively aggregate and refine modality-specific and cross-modality information. To facilitate robust training, we propose a competitive loss function that integrates illumination distributions with three levels of sub-loss terms. Extensive experiments conducted on the DroneVehicle, MSRS, TNO and RoadScene datasets show MoCTEFuse's superior fusion performance. Finally, it achieves the best detection mean Average Precision (mAP) of 70.93% on the MFNet dataset and 45.14% on the DroneVehicle dataset. The code and model are released at https://github.com/Bitlijinfu/MoCTEFuse.
中文摘要:提出的MoCTEFuse网络通过光照门控的专家变换器动态平衡红外与可见光图像融合,采用竞争性损失函数在多个数据集上实现了优越性能。
English Summary: The proposed MoCTEFuse network dynamically balances infrared and visible image fusion through illumination-gated transformer experts, achieving superior performance across multiple datasets with a competitive loss function.

Authors:Yaozong Zheng, Bineng Zhong, Qihua Liang, Shengping Zhang, Guorong Li, Xianxian Li, Rongrong Ji
Title: Towards Universal Modal Tracking with Online Dense Temporal Token Learning
Abstract:
We propose a universal video-level modality-awareness tracking model with online dense temporal token learning (called {\modaltracker}). It is designed to support various tracking tasks, including RGB, RGB+Thermal, RGB+Depth, and RGB+Event, utilizing the same model architecture and parameters. Specifically, our model is designed with three core goals: \textbf{Video-level Sampling}. We expand the model's inputs to a video sequence level, aiming to see a richer video context from an near-global perspective. \textbf{Video-level Association}. Furthermore, we introduce two simple yet effective online dense temporal token association mechanisms to propagate the appearance and motion trajectory information of target via a video stream manner. \textbf{Modality Scalable}. We propose two novel gated perceivers that adaptively learn cross-modal representations via a gated attention mechanism, and subsequently compress them into the same set of model parameters via a one-shot training manner for multi-task inference. This new solution brings the following benefits: (i) The purified token sequences can serve as temporal prompts for the inference in the next video frames, whereby previous information is leveraged to guide future inference. (ii) Unlike multi-modal trackers that require independent training, our one-shot training scheme not only alleviates the training burden, but also improves model representation. Extensive experiments on visible and multi-modal benchmarks show that our {\modaltracker} achieves a new \textit{SOTA} performance. The code will be available at https://github.com/GXNU-ZhongLab/ODTrack.
中文: 提出的通用视频级模态感知跟踪模型{\modaltracker}采用在线密集时间令牌学习和一次性训练方案,以同一架构处理多种跟踪任务,在各类基准测试中均实现了最优性能。
English: The proposed universal video-level modality-awareness tracking model, called {\modaltracker}, employs online dense temporal token learning and a one-shot training scheme to handle multiple tracking tasks with the same architecture, achieving state-of-the-art performance across various benchmarks.

Authors:Fei Kong, Jinhao Duan, Kaidi Xu, Zhenhua Guo, Xiaofeng Zhu, Xiaoshuang Shi
Title: LRR-Bench: Left, Right or Rotate? Vision-Language models Still Struggle With Spatial Understanding Tasks
Abstract:
Real-world applications, such as autonomous driving and humanoid robot manipulation, require precise spatial perception. However, it remains underexplored how Vision-Language Models (VLMs) recognize spatial relationships and perceive spatial movement. In this work, we introduce a spatial evaluation pipeline and construct a corresponding benchmark. Specifically, we categorize spatial understanding into two main types: absolute spatial understanding, which involves querying the absolute spatial position (e.g., left, right) of an object within an image, and 3D spatial understanding, which includes movement and rotation. Notably, our dataset is entirely synthetic, enabling the generation of test samples at a low cost while also preventing dataset contamination. We conduct experiments on multiple state-of-the-art VLMs and observe that there is significant room for improvement in their spatial understanding abilities. Explicitly, in our experiments, humans achieve near-perfect performance on all tasks, whereas current VLMs attain human-level performance only on the two simplest tasks. For the remaining tasks, the performance of VLMs is distinctly lower than that of humans. In fact, the best-performing Vision-Language Models even achieve near-zero scores on multiple tasks. The dataset and code are available on https://github.com/kong13661/LRR-Bench.
Chinese: 本研究引入了一个合成基准来评估视觉语言模型的空间理解能力,发现尽管人类在所有任务中表现出色,但现有模型明显落后,尤其在复杂空间推理方面。
English: This study introduces a synthetic benchmark to evaluate Vision-Language Models' spatial understanding, revealing that while humans excel across tasks, current VLMs significantly lag, especially in complex spatial reasoning.

Authors:Zeyu Xi, Haoying Sun, Yaofei Wu, Junchi Yan, Haoran Zhang, Lifang Wu, Liang Wang, Changwen Chen
Title: Player-Centric Multimodal Prompt Generation for Large Language Model Based Identity-Aware Basketball Video Captioning
Abstract:
Existing sports video captioning methods often focus on the action yet overlook player identities, limiting their applicability. Although some methods integrate extra information to generate identity-aware descriptions, the player identities are sometimes incorrect because the extra information is independent of the video content. This paper proposes a player-centric multimodal prompt generation network for identity-aware sports video captioning (LLM-IAVC), which focuses on recognizing player identities from a visual perspective. Specifically, an identity-related information extraction module (IRIEM) is designed to extract player-related multimodal embeddings. IRIEM includes a player identification network (PIN) for extracting visual features and player names, and a bidirectional semantic interaction module (BSIM) to link player features with video content for mutual enhancement. Additionally, a visual context learning module (VCLM) is designed to capture the key video context information. Finally, by integrating the outputs of the above modules as the multimodal prompt for the large language model (LLM), it facilitates the generation of descriptions with player identities. To support this work, we construct a new benchmark called NBA-Identity, a large identity-aware basketball video captioning dataset with 9,726 videos covering 9 major event types. The experimental results on NBA-Identity and VC-NBA-2022 demonstrate that our proposed model achieves advanced performance. Code and dataset are publicly available at https://github.com/Zeyu1226-mt/LLM-IAVC.
Chinese: 本文提出了一种以球员为中心的多模态提示生成网络(LLM-IAVC),通过提取视频中的身份相关信息来生成准确的带身份描述,在新建的NBA-Identity数据集上验证了其先进性能。
English: This paper introduces a player-centric multimodal prompt generation network (LLM-IAVC) that extracts identity-related information from sports videos to generate accurate identity-aware captions, validated on the new NBA-Identity dataset with state-of-the-art results.

Authors:Yuhong Zhang, Liyao Wang, Han Wang, Danni Wu, Zuzeng Lin, Feng Wang, Li Song
Title: AnimeColor: Reference-based Animation Colorization with Diffusion Transformers
Abstract:
Animation colorization plays a vital role in animation production, yet existing methods struggle to achieve color accuracy and temporal consistency. To address these challenges, we propose \textbf{AnimeColor}, a novel reference-based animation colorization framework leveraging Diffusion Transformers (DiT). Our approach integrates sketch sequences into a DiT-based video diffusion model, enabling sketch-controlled animation generation. We introduce two key components: a High-level Color Extractor (HCE) to capture semantic color information and a Low-level Color Guider (LCG) to extract fine-grained color details from reference images. These components work synergistically to guide the video diffusion process. Additionally, we employ a multi-stage training strategy to maximize the utilization of reference image color information. Extensive experiments demonstrate that AnimeColor outperforms existing methods in color accuracy, sketch alignment, temporal consistency, and visual quality. Our framework not only advances the state of the art in animation colorization but also provides a practical solution for industrial applications. The code will be made publicly available at \href{https://github.com/IamCreateAI/AnimeColor}{https://github.com/IamCreateAI/AnimeColor}.
中文: AnimeColor框架采用扩散变换器,结合高级颜色提取器和低级颜色引导器,通过基于参考的指导和多阶段训练实现了卓越的动画上色效果,在色彩准确性和时间一致性方面表现出优越性能。
English: The proposed AnimeColor framework utilizes Diffusion Transformers with a High-level Color Extractor and Low-level Color Guider to achieve superior animation colorization through reference-based guidance and multi-stage training, demonstrating enhanced performance in color accuracy and temporal consistency.

Authors:Daulet Toibazar, Kesen Wang, Sherif Mohamed, Abdulaziz Al-Badawi, Abdulrahman Alfulayt, Pedro J. Moreno
Title: Trust the Model: Compact VLMs as In-Context Judges for Image-Text Data Quality
Abstract:
Vision-language models (VLMs) extend the conventional large language models by integrating visual data, enabling richer multimodal reasoning and significantly broadens the practical applications of AI. However, including visual inputs also brings new challenges in maintaining data quality. Empirical evidence consistently shows that carefully curated and representative training examples often yield superior results compared to simply increasing the quantity of data. Inspired by this observation, we introduce a streamlined data filtration framework that employs a compact VLM, fine-tuned on a high-quality image-caption annotated dataset. This model effectively evaluates and filters potential training samples based on caption and image quality and alignment. Unlike previous approaches, which typically add auxiliary filtration modules on top of existing full-scale VLMs, our method exclusively utilizes the inherent evaluative capability of a purpose-built small VLM. This strategy eliminates the need for extra modules and reduces training overhead. Our lightweight model efficiently filters out inaccurate, noisy web data, improving image-text alignment and caption linguistic fluency. Experimental results show that datasets underwent high-precision filtration using our compact VLM perform on par with, or even surpass, larger and noisier datasets gathered through high-volume web crawling. Thus, our method provides a lightweight yet robust solution for building high-quality vision-language training corpora. \\ \textbf{Availability and implementation:} Our compact VLM filtration model, training data, utility scripts, and Supplementary data (Appendices) are freely available at https://github.com/daulettoibazar/Compact_VLM_Filter.
中文: 本研究提出了一种轻量级数据过滤框架,利用紧凑型视觉语言模型过滤网络噪声数据以提升训练质量,在降低计算开销的同时,其过滤后的数据集性能可媲美甚至优于大规模采集数据。
English: This study introduces a lightweight data filtration framework using a compact vision-language model to enhance training data quality by filtering noisy web data, achieving performance comparable to or better than larger datasets while reducing computational overhead.

Authors:Jingxi Liao, Shijie Hao, Richang Hong, Meng Wang
Title: GT-Mean Loss: A Simple Yet Effective Solution for Brightness Mismatch in Low-Light Image Enhancement
Abstract:
Low-light image enhancement (LLIE) aims to improve the visual quality of images captured under poor lighting conditions. In supervised LLIE research, there exists a significant yet often overlooked inconsistency between the overall brightness of an enhanced image and its ground truth counterpart, referred to as brightness mismatch in this study. Brightness mismatch negatively impact supervised LLIE models by misleading model training. However, this issue is largely neglected in current research. In this context, we propose the GT-mean loss, a simple yet effective loss function directly modeling the mean values of images from a probabilistic perspective. The GT-mean loss is flexible, as it extends existing supervised LLIE loss functions into the GT-mean form with minimal additional computational costs. Extensive experiments demonstrate that the incorporation of the GT-mean loss results in consistent performance improvements across various methods and datasets.
中文: 本研究针对弱光图像增强中常被忽视的亮度失配问题,提出了一种灵活的GT-mean损失函数,能以最小计算成本持续提升各种方法和数据集的性能表现。
English: This study addresses the overlooked issue of brightness mismatch in supervised low-light image enhancement (LLIE) by proposing a flexible GT-mean loss function that consistently improves performance across methods and datasets with minimal computational overhead.

Authors:Shizuka Akahori, Shotaro Teruya, Pragyan Shrestha, Yuichi Yoshii, Satoshi Iizuka, Akira Ikumi, Hiromitsu Tsuge, Itaru Kitahara
Title: Detection of Medial Epicondyle Avulsion in Elbow Ultrasound Images via Bone Structure Reconstruction
Abstract:
This study proposes a reconstruction-based framework for detecting medial epicondyle avulsion in elbow ultrasound images, trained exclusively on normal cases. Medial epicondyle avulsion, commonly observed in baseball players, involves bone detachment and deformity, often appearing as discontinuities in bone contour. Therefore, learning the structure and continuity of normal bone is essential for detecting such abnormalities. To achieve this, we propose a masked autoencoder-based, structure-aware reconstruction framework that learns the continuity of normal bone structures. Even in the presence of avulsion, the model attempts to reconstruct the normal structure, resulting in large reconstruction errors at the avulsion site. For evaluation, we constructed a novel dataset comprising normal and avulsion ultrasound images from 16 baseball players, with pixel-level annotations under orthopedic supervision. Our method outperformed existing approaches, achieving a pixel-wise AUC of 0.965 and an image-wise AUC of 0.967. The dataset is publicly available at: https://github.com/Akahori000/Ultrasound-Medial-Epicondyle-Avulsion-Dataset.
本研究提出了一种基于掩码自编码器的重建框架,通过仅学习正常骨骼结构来检测肘部超声图像中的肱骨内上髁撕脱,在新公开数据集上取得了超过0.96的AUC值。
This study introduces a reconstruction-based method using masked autoencoders to detect medial epicondyle avulsion in elbow ultrasound by learning normal bone structures, achieving high accuracy with AUC scores over 0.96 on a new public dataset.

Authors:Haoyue Li, Di Wu
Title: Hybrid-Domain Synergistic Transformer for Hyperspectral Image Denoising
Abstract:
Hyperspectral image denoising faces the challenge of multi-dimensional coupling of spatially non-uniform noise and spectral correlation interference. Existing deep learning methods mostly focus on RGB images and struggle to effectively handle the unique spatial-spectral characteristics and complex noise distributions of hyperspectral images (HSI). This paper proposes an HSI denoising framework, Hybrid-Domain Synergistic Transformer Network (HDST), based on frequency domain enhancement and multiscale modeling, achieving three-dimensional collaborative processing of spatial, frequency and channel domains. The method innovatively integrates three key mechanisms: (1) introducing an FFT preprocessing module with multi-band convolution to extract cross-band correlations and decouple spectral noise components; (2) designing a dynamic cross-domain attention module that adaptively fuses spatial domain texture features and frequency domain noise priors through a learnable gating mechanism; (3) building a hierarchical architecture where shallow layers capture global noise statistics using multiscale atrous convolution, and deep layers achieve detail recovery through frequency domain postprocessing. Experiments on both real and synthetic datasets demonstrate that HDST significantly improves denoising performance while maintaining computational efficiency, validating the effectiveness of the proposed method. This research provides new insights and a universal framework for addressing complex noise coupling issues in HSI and other high-dimensional visual data. The code is available at https://github.com/lhy-cn/HDST-HSIDenoise.
中文摘要:本文提出HDST混合域协同Transformer网络,通过频域增强与多尺度建模实现空间、频率和通道的三维协同处理,在真实与合成数据集上均显著提升了高光谱图像去噪性能。
English Summary: This paper introduces HDST, a hybrid-domain transformer network that effectively denoises hyperspectral images by synergistically processing spatial, frequency, and channel domains through frequency enhancement and multiscale modeling, demonstrating superior performance on both real and synthetic datasets.

Authors:Shibang Liu, Xuemei Xie, Guangming Shi
Title: KB-DMGen: Knowledge-Based Global Guidance and Dynamic Pose Masking for Human Image Generation
Abstract:
Recent methods using diffusion models have made significant progress in Human Image Generation (HIG) with various control signals such as pose priors. In HIG, both accurate human poses and coherent visual quality are crucial for image generation. However, most existing methods mainly focus on pose accuracy while neglecting overall image quality, often improving pose alignment at the cost of image quality. To address this, we propose Knowledge-Based Global Guidance and Dynamic pose Masking for human image Generation (KB-DMGen). The Knowledge Base (KB), implemented as a visual codebook, provides coarse, global guidance based on input text-related visual features, improving pose accuracy while maintaining image quality, while the Dynamic pose Mask (DM) offers fine-grained local control to enhance precise pose accuracy. By injecting KB and DM at different stages of the diffusion process, our framework enhances pose accuracy through both global and local control without compromising image quality. Experiments demonstrate the effectiveness of KB-DMGen, achieving new state-of-the-art results in terms of AP and CAP on the HumanArt dataset. The project page and code are available at https://lushbng.github.io/KBDMGen.

Authors:Yin Xie, Kaicheng Yang, Xiang An, Kun Wu, Yongle Zhao, Weimo Deng, Zimin Ran, Yumeng Wang, Ziyong Feng, Roy Miles, Ismail Elezi, Jiankang Deng
Title: Region-based Cluster Discrimination for Visual Representation Learning
Abstract:
Learning visual representations is foundational for a broad spectrum of downstream tasks. Although recent vision-language contrastive models, such as CLIP and SigLIP, have achieved impressive zero-shot performance via large-scale vision-language alignment, their reliance on global representations constrains their effectiveness for dense prediction tasks, such as grounding, OCR, and segmentation. To address this gap, we introduce Region-Aware Cluster Discrimination (RICE), a novel method that enhances region-level visual and OCR capabilities. We first construct a billion-scale candidate region dataset and propose a Region Transformer layer to extract rich regional semantics. We further design a unified region cluster discrimination loss that jointly supports object and OCR learning within a single classification framework, enabling efficient and scalable distributed training on large-scale data. Extensive experiments show that RICE consistently outperforms previous methods on tasks, including segmentation, dense detection, and visual perception for Multimodal Large Language Models (MLLMs). The pre-trained models have been released at https://github.com/deepglint/MVT.
中文: RICE是一种新颖方法,通过构建大规模区域数据集并采用统一的聚类判别损失,提升了区域级视觉和OCR能力,在分割和检测任务中持续优于先前方法。
English: RICE is a novel method that enhances region-level vision and OCR capabilities by constructing a large-scale region dataset and employing a unified cluster discrimination loss, consistently outperforming previous methods in segmentation and detection tasks.

Authors:Lehan Wang, Hualiang Wang, Chubin Ou, Lushi Chen, Yunyi Liang, Xiaomeng Li
Title: VAMPIRE: Uncovering Vessel Directional and Morphological Information from OCTA Images for Cardiovascular Disease Risk Factor Prediction
Abstract:
Cardiovascular disease (CVD) remains the leading cause of death worldwide, requiring urgent development of effective risk assessment methods for timely intervention. While current research has introduced non-invasive and efficient approaches to predict CVD risk from retinal imaging with deep learning models, the commonly used fundus photographs and Optical Coherence Tomography (OCT) fail to capture detailed vascular features critical for CVD assessment compared with OCT angiography (OCTA) images. Moreover, existing methods typically classify CVD risk only as high or low, without providing a deeper analysis on CVD-related blood factor conditions, thus limiting prediction accuracy and clinical utility. As a result, we propose a novel multi-purpose paradigm of CVD risk assessment that jointly performs CVD risk and CVD-related condition prediction, aligning with clinical experiences. Based on this core idea, we introduce OCTA-CVD, the first OCTA dataset for CVD risk assessment, and a Vessel-Aware Mamba-based Prediction model with Informative Enhancement (VAMPIRE) based on OCTA enface images. Our proposed model aims to extract crucial vascular characteristics through two key components: (1) a Mamba-Based Directional (MBD) Module that captures fine-grained vascular trajectory features and (2) an Information-Enhanced Morphological (IEM) Module that incorporates comprehensive vessel morphology knowledge. Experimental results demonstrate that our method can surpass standard classification backbones, OCTA-based detection methods, and ophthalmologic foundation models. Our codes and the collected OCTA-CVD dataset are available at https://github.com/xmed-lab/VAMPIRE.
中文摘要:本研究提出了一种基于OCTA成像的新型心血管疾病风险评估多用途范式,通过VAMPIRE模型提取关键血管特征,在预测精度上超越了现有方法。
English Summary: This study introduces a novel multi-purpose paradigm for cardiovascular disease (CVD) risk assessment using OCTA imaging, featuring the VAMPIRE model that extracts detailed vascular features through specialized modules to surpass existing methods in prediction accuracy.

Authors:Hao-Yu Hou, Chun-Yi Lee, Motoharu Sonogashira, Yasutomo Kawanishi
Title: FROSS: Faster-than-Real-Time Online 3D Semantic Scene Graph Generation from RGB-D Images
Abstract:
The ability to abstract complex 3D environments into simplified and structured representations is crucial across various domains. 3D semantic scene graphs (SSGs) achieve this by representing objects as nodes and their interrelationships as edges, facilitating high-level scene understanding. Existing methods for 3D SSG generation, however, face significant challenges, including high computational demands and non-incremental processing that hinder their suitability for real-time open-world applications. To address this issue, we propose FROSS (Faster-than-Real-Time Online 3D Semantic Scene Graph Generation), an innovative approach for online and faster-than-real-time 3D SSG generation that leverages the direct lifting of 2D scene graphs to 3D space and represents objects as 3D Gaussian distributions. This framework eliminates the dependency on precise and computationally-intensive point cloud processing. Furthermore, we extend the Replica dataset with inter-object relationship annotations, creating the ReplicaSSG dataset for comprehensive evaluation of FROSS. The experimental results from evaluations on ReplicaSSG and 3DSSG datasets show that FROSS can achieve superior performance while operating significantly faster than prior 3D SSG generation methods. Our implementation and dataset are publicly available at https://github.com/Howardkhh/FROSS.
中文摘要:FROSS提出了一种超实时生成3D语义场景图的新方法,通过将2D场景图转换为3D高斯表示来避免密集点云处理,在速度和性能上均优于现有方法。
English Summary: FROSS introduces a faster-than-real-time method for generating 3D semantic scene graphs by converting 2D graphs into 3D Gaussian representations, eliminating intensive point cloud processing and outperforming existing approaches in speed and performance.

Authors:Mizanur Rahman, Md Tahmid Rahman Laskar, Shafiq Joty, Enamul Hoque
Title: Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text
Abstract:
Automated data visualization plays a crucial role in simplifying data interpretation, enhancing decision-making, and improving efficiency. While large language models (LLMs) have shown promise in generating visualizations from natural language, the absence of comprehensive benchmarks limits the rigorous evaluation of their capabilities. We introduce Text2Vis, a benchmark designed to assess text-to-visualization models, covering 20+ chart types and diverse data science queries, including trend analysis, correlation, outlier detection, and predictive analytics. It comprises 1,985 samples, each with a data table, natural language query, short answer, visualization code, and annotated charts. The queries involve complex reasoning, conversational turns, and dynamic data retrieval. We benchmark 11 open-source and closed-source models, revealing significant performance gaps, highlighting key challenges, and offering insights for future advancements. To close this gap, we propose the first cross-modal actor-critic agentic framework that jointly refines the textual answer and visualization code, increasing GPT-4o`s pass rate from 26% to 42% over the direct approach and improving chart quality. We also introduce an automated LLM-based evaluation framework that enables scalable assessment across thousands of samples without human annotation, measuring answer correctness, code execution success, visualization readability, and chart accuracy. We release Text2Vis at https://github.com/vis-nlp/Text2Vis.
中文: 本文提出Text2Vis基准,用于全面评估文本到可视化模型在多种图表类型和数据查询中的表现,揭示了性能差距并提出智能体框架,显著提升GPT-4o性能并实现自动化评估。
English: This paper introduces Text2Vis, a comprehensive benchmark for evaluating text-to-visualization models across diverse chart types and data queries, revealing performance gaps and proposing an agentic framework that significantly improves GPT-4o's performance while enabling automated evaluation.

Authors:Cesar Kadir Torrico Villanueva, Jiaxin Cindy Tu, Mihir Tripathy, Connor Lane, Rishab Iyer, Paul S. Scotti
Title: Predicting Brain Responses To Natural Movies With Multimodal LLMs
Abstract:
We present MedARC's team solution to the Algonauts 2025 challenge. Our pipeline leveraged rich multimodal representations from various state-of-the-art pretrained models across video (V-JEPA2), speech (Whisper), text (Llama 3.2), vision-text (InternVL3), and vision-text-audio (Qwen2.5-Omni). These features extracted from the models were linearly projected to a latent space, temporally aligned to the fMRI time series, and finally mapped to cortical parcels through a lightweight encoder comprising a shared group head plus subject-specific residual heads. We trained hundreds of model variants across hyperparameter settings, validated them on held-out movies and assembled ensembles targeted to each parcel in each subject. Our final submission achieved a mean Pearson's correlation of 0.2085 on the test split of withheld out-of-distribution movies, placing our team in fourth place for the competition. We further discuss a last-minute optimization that would have raised us to second place. Our results highlight how combining features from models trained in different modalities, using a simple architecture consisting of shared-subject and single-subject components, and conducting comprehensive model selection and ensembling improves generalization of encoding models to novel movie stimuli. All code is available on GitHub.
中文: MedARC团队在Algonauts 2025挑战赛中通过整合多模态预训练模型特征,采用共享组头与个体残差头的轻量编码器架构,结合大规模超参数调优与集成学习,在测试集上获得0.2085平均皮尔逊相关系数,最终位列第四。
English: MedARC's fourth-place solution for the Algonauts 2025 challenge combined multimodal features from state-of-the-art models using a lightweight encoder with shared and subject-specific components, achieving strong generalization through extensive hyperparameter tuning and ensemble methods.

Authors:Chengyu Zheng, Jin Huang, Honghua Chen, Mingqiang Wei
Title: RARE: Refine Any Registration of Pairwise Point Clouds via Zero-Shot Learning
Abstract:
Recent research leveraging large-scale pretrained diffusion models has demonstrated the potential of using diffusion features to establish semantic correspondences in images. Inspired by advancements in diffusion-based techniques, we propose a novel zero-shot method for refining point cloud registration algorithms. Our approach leverages correspondences derived from depth images to enhance point feature representations, eliminating the need for a dedicated training dataset. Specifically, we first project the point cloud into depth maps from multiple perspectives and extract implicit knowledge from a pretrained diffusion network as depth diffusion features. These features are then integrated with geometric features obtained from existing methods to establish more accurate correspondences between point clouds. By leveraging these refined correspondences, our approach achieves significantly improved registration accuracy. Extensive experiments demonstrate that our method not only enhances the performance of existing point cloud registration techniques but also exhibits robust generalization capabilities across diverse datasets. Codes are available at https://github.com/zhengcy-lambo/RARE.git.
中文: 本文提出一种零样本方法,通过将预训练模型的深度扩散特征与几何特征相结合来优化点云配准,无需训练数据即可显著提升配准精度和泛化能力。
English: This paper introduces a zero-shot method that refines point cloud registration by integrating depth diffusion features from pretrained models with geometric features, achieving enhanced accuracy and generalization without requiring training data.

Authors:Qingqing Fang, Wenxi Lv, Qinliang Su
Title: AF-CLIP: Zero-Shot Anomaly Detection via Anomaly-Focused CLIP Adaptation
Abstract:
Visual anomaly detection has been widely used in industrial inspection and medical diagnosis. Existing methods typically demand substantial training samples, limiting their utility in zero-/few-shot scenarios. While recent efforts have leveraged CLIP's zero-shot recognition capability for this task, they often ignore optimizing visual features to focus on local anomalies, reducing their efficacy. In this work, we propose AF-CLIP (Anomaly-Focused CLIP) by dramatically enhancing its visual representations to focus on local defects. Our approach introduces a lightweight adapter that emphasizes anomaly-relevant patterns in visual features, simultaneously optimizing both class-level features for image classification and patch-level features for precise localization. To capture anomalies of different sizes and improve detection accuracy, prior to the adapter, we develop a multi-scale spatial aggregation mechanism to effectively consolidate neighborhood context. Complementing these visual enhancements, we design learnable textual prompts that generically characterize normal and abnormal states. After optimization on auxiliary datasets using a composite objective function, AF-CLIP demonstrates strong zero-shot detection capability. Our method is also extended to few-shot scenarios by extra memory banks. Experimental results across diverse industrial and medical datasets demonstrate the effectiveness and generalization of our proposed method. Code is available at https://github.com/Faustinaqq/AF-CLIP.
中文摘要:本研究提出的AF-CLIP方法通过轻量级适配器和多尺度聚合机制增强CLIP的视觉表征能力,在零样本场景下能更精准地检测局部异常,在工业和医疗数据集上均展现出优异性能。
English Summary: The proposed AF-CLIP method enhances CLIP's visual representations through a lightweight adapter and multi-scale aggregation to better detect local anomalies in zero-shot scenarios, demonstrating strong performance across industrial and medical datasets.

Authors:Qing Xu, Yanming Chen, Yue Li, Ziyu Liu, Zhenye Lou, Yixuan Zhang, Xiangjian He
Title: MambaVesselNet++: A Hybrid CNN-Mamba Architecture for Medical Image Segmentation
Abstract:
Medical image segmentation plays an important role in computer-aided diagnosis. Traditional convolution-based U-shape segmentation architectures are usually limited by the local receptive field. Existing vision transformers have been widely applied to diverse medical segmentation frameworks due to their superior capabilities of capturing global contexts. Despite the advantage, the real-world application of vision transformers is challenged by their non-linear self-attention mechanism, requiring huge computational costs. To address this issue, the selective state space model (SSM) Mamba has gained recognition for its adeptness in modeling long-range dependencies in sequential data, particularly noted for its efficient memory costs. In this paper, we propose MambaVesselNet++, a Hybrid CNN-Mamba framework for medical image segmentation. Our MambaVesselNet++ is comprised of a hybrid image encoder (Hi-Encoder) and a bifocal fusion decoder (BF-Decoder). In Hi-Encoder, we first devise the texture-aware layer to capture low-level semantic features by leveraging convolutions. Then, we utilize Mamba to effectively model long-range dependencies with linear complexity. The Bi-Decoder adopts skip connections to combine local and global information of the Hi-Encoder for the accurate generation of segmentation masks. Extensive experiments demonstrate that MambaVesselNet++ outperforms current convolution-based, transformer-based, and Mamba-based state-of-the-arts across diverse medical 2D, 3D, and instance segmentation tasks. The code is available at https://github.com/CC0117/MambaVesselNet.
Chinese: MambaVesselNet++ 是一种混合CNN-Mamba框架,通过创新的编码器-解码器设计有效整合局部纹理特征与全局依赖关系,在线性计算复杂度下实现了医学图像分割的卓越性能。
English: MambaVesselNet++ is a hybrid CNN-Mamba framework that effectively integrates local texture features and global dependencies through its innovative encoder-decoder design, achieving superior performance in medical image segmentation with linear computational complexity.

Authors:Chang Liu, Yunfan Ye, Fan Zhang, Qingyang Zhou, Yuchuan Luo, Zhiping Cai
Title: HumanSAM: Classifying Human-centric Forgery Videos in Human Spatial, Appearance, and Motion Anomaly
Abstract:
Numerous synthesized videos from generative models, especially human-centric ones that simulate realistic human actions, pose significant threats to human information security and authenticity. While progress has been made in binary forgery video detection, the lack of fine-grained understanding of forgery types raises concerns regarding both reliability and interpretability, which are critical for real-world applications. To address this limitation, we propose HumanSAM, a new framework that builds upon the fundamental challenges of video generation models. Specifically, HumanSAM aims to classify human-centric forgeries into three distinct types of artifacts commonly observed in generated content: spatial, appearance, and motion anomaly. To better capture the features of geometry, semantics and spatiotemporal consistency, we propose to generate the human forgery representation by fusing two branches of video understanding and spatial depth. We also adopt a rank-based confidence enhancement strategy during the training process to learn more robust representation by introducing three prior scores. For training and evaluation, we construct the first public benchmark, the Human-centric Forgery Video (HFV) dataset, with all types of forgeries carefully annotated semi-automatically. In our experiments, HumanSAM yields promising results in comparison with state-of-the-art methods, both in binary and multi-class forgery classification.

Authors:Drandreb Earl O. Juanico, Rowel O. Atienza, Jeffrey Kenneth Go
Title: Interpretable Open-Vocabulary Referring Object Detection with Reverse Contrast Attention
Abstract:
We propose Reverse Contrast Attention (RCA), a plug-in method that enhances object localization in vision-language transformers without retraining. RCA reweights final-layer attention by suppressing extremes and amplifying mid-level activations to let semantically relevant but subdued tokens guide predictions. We evaluate it on Open Vocabulary Referring Object Detection (OV-RefOD), introducing FitAP, a confidence-free average precision metric based on IoU and box area. RCA improves FitAP in 11 out of 15 open-source VLMs, with gains up to $+26.6\%$. Effectiveness aligns with attention sharpness and fusion timing; while late-fusion models benefit consistently, models like $\texttt{DeepSeek-VL2}$ also improve, pointing to capacity and disentanglement as key factors. RCA offers both interpretability and performance gains for multimodal transformers. Codes and dataset are available from https://github.com/earl-juanico/rca
Chinese Summary: 反向对比注意力(RCA)是一种无需重新训练的即插即用方法,通过重新加权注意力机制增强中层激活来提升视觉语言Transformer中的目标定位能力,在开放词汇参考目标检测任务中最高可实现26.6%的性能提升。
English Summary: Reverse Contrast Attention (RCA) is a plug-in method that improves object localization in vision-language transformers by reweighting attention to enhance mid-level activations, achieving performance gains up to 26.6% on Open Vocabulary Referring Object Detection without requiring retraining.

Authors:X. Feng, S. Hu, X. Li, D. Zhang, M. Wu, J. Zhang, X. Chen, K. Huang
Title: ATCTrack: Aligning Target-Context Cues with Dynamic Target States for Robust Vision-Language Tracking
Abstract:
Vision-language tracking aims to locate the target object in the video sequence using a template patch and a language description provided in the initial frame. To achieve robust tracking, especially in complex long-term scenarios that reflect real-world conditions as recently highlighted by MGIT, it is essential not only to characterize the target features but also to utilize the context features related to the target. However, the visual and textual target-context cues derived from the initial prompts generally align only with the initial target state. Due to their dynamic nature, target states are constantly changing, particularly in complex long-term sequences. It is intractable for these cues to continuously guide Vision-Language Trackers (VLTs). Furthermore, for the text prompts with diverse expressions, our experiments reveal that existing VLTs struggle to discern which words pertain to the target or the context, complicating the utilization of textual cues. In this work, we present a novel tracker named ATCTrack, which can obtain multimodal cues Aligned with the dynamic target states through comprehensive Target-Context feature modeling, thereby achieving robust tracking. Specifically, (1) for the visual modality, we propose an effective temporal visual target-context modeling approach that provides the tracker with timely visual cues. (2) For the textual modality, we achieve precise target words identification solely based on textual content, and design an innovative context words calibration method to adaptively utilize auxiliary context words. (3) We conduct extensive experiments on mainstream benchmarks and ATCTrack achieves a new SOTA performance. The code and models will be released at: https://github.com/XiaokunFeng/ATCTrack.
中文: ATCTrack是一种新型视觉语言跟踪器,通过全面的目标-上下文特征建模,动态对齐多模态线索与变化的目标状态,从而在复杂长期场景中实现鲁棒跟踪。
English: ATCTrack is a novel vision-language tracker that enhances robustness in complex long-term scenarios by dynamically aligning multimodal cues with evolving target states through comprehensive target-context feature modeling.

Authors:Wenjie Zhu, Yabin Zhang, Xin Jin, Wenjun Zeng, Lei Zhang
Title: Knowledge Regularized Negative Feature Tuning of Vision-Language Models for Out-of-Distribution Detection
Abstract:
Out-of-distribution (OOD) detection is crucial for building reliable machine learning models. Although negative prompt tuning has enhanced the OOD detection capabilities of vision-language models, these tuned models often suffer from reduced generalization performance on unseen classes and styles. To address this challenge, we propose a novel method called Knowledge Regularized Negative Feature Tuning (KR-NFT), which integrates an innovative adaptation architecture termed Negative Feature Tuning (NFT) and a corresponding knowledge-regularization (KR) optimization strategy. Specifically, NFT applies distribution-aware transformations to pre-trained text features, effectively separating positive and negative features into distinct spaces. This separation maximizes the distinction between in-distribution (ID) and OOD images. Additionally, we introduce image-conditional learnable factors through a lightweight meta-network, enabling dynamic adaptation to individual images and mitigating sensitivity to class and style shifts. Compared to traditional negative prompt tuning, NFT demonstrates superior efficiency and scalability. To optimize this adaptation architecture, the KR optimization strategy is designed to enhance the discrimination between ID and OOD sets while mitigating pre-trained knowledge forgetting. This enhances OOD detection performance on trained ID classes while simultaneously improving OOD detection on unseen ID datasets. Notably, when trained with few-shot samples from ImageNet dataset, KR-NFT not only improves ID classification accuracy and OOD detection but also significantly reduces the FPR95 by 5.44\% under an unexplored generalization setting with unseen ID categories. Codes can be found at \href{https://github.com/ZhuWenjie98/KRNFT}.
中文: 提出的KR-NFT方法通过分布感知变换分离正负特征,并结合知识正则化优化策略,有效提升了视觉语言模型的分布外检测能力,同时增强了已知类别的分类准确性和对未知类别的泛化性能。
English: The proposed KR-NFT method enhances OOD detection in vision-language models by separating positive and negative features through distribution-aware transformations and a knowledge-regularized optimization strategy, improving both ID classification and generalization to unseen categories.

Authors:Yuze Wang, Yue Qi
Title: Taking Language Embedded 3D Gaussian Splatting into the Wild
Abstract:
Recent advances in leveraging large-scale Internet photo collections for 3D reconstruction have enabled immersive virtual exploration of landmarks and historic sites worldwide. However, little attention has been given to the immersive understanding of architectural styles and structural knowledge, which remains largely confined to browsing static text-image pairs. Therefore, can we draw inspiration from 3D in-the-wild reconstruction techniques and use unconstrained photo collections to create an immersive approach for understanding the 3D structure of architectural components? To this end, we extend language embedded 3D Gaussian splatting (3DGS) and propose a novel framework for open-vocabulary scene understanding from unconstrained photo collections. Specifically, we first render multiple appearance images from the same viewpoint as the unconstrained image with the reconstructed radiance field, then extract multi-appearance CLIP features and two types of language feature uncertainty maps-transient and appearance uncertainty-derived from the multi-appearance features to guide the subsequent optimization process. Next, we propose a transient uncertainty-aware autoencoder, a multi-appearance language field 3DGS representation, and a post-ensemble strategy to effectively compress, learn, and fuse language features from multiple appearances. Finally, to quantitatively evaluate our method, we introduce PT-OVS, a new benchmark dataset for assessing open-vocabulary segmentation performance on unconstrained photo collections. Experimental results show that our method outperforms existing methods, delivering accurate open-vocabulary segmentation and enabling applications such as interactive roaming with open-vocabulary queries, architectural style pattern recognition, and 3D scene editing.

Authors:Yanrui Yu, Tianfei Zhou, Jiaxin Sun, Lianpeng Qiao, Lizhong Ding, Ye Yuan, Guoren Wang
Title: LAVA: Language Driven Scalable and Versatile Traffic Video Analytics
Abstract:
In modern urban environments, camera networks generate massive amounts of operational footage -- reaching petabytes each day -- making scalable video analytics essential for efficient processing. Many existing approaches adopt an SQL-based paradigm for querying such large-scale video databases; however, this constrains queries to rigid patterns with predefined semantic categories, significantly limiting analytical flexibility. In this work, we explore a language-driven video analytics paradigm aimed at enabling flexible and efficient querying of high-volume video data driven by natural language. Particularly, we build \textsc{Lava}, a system that accepts natural language queries and retrieves traffic targets across multiple levels of granularity and arbitrary categories. \textsc{Lava} comprises three main components: 1) a multi-armed bandit-based efficient sampling method for video segment-level localization; 2) a video-specific open-world detection module for object-level retrieval; and 3) a long-term object trajectory extraction scheme for temporal object association, yielding complete trajectories for object-of-interests. To support comprehensive evaluation, we further develop a novel benchmark by providing diverse, semantically rich natural language predicates and fine-grained annotations for multiple videos. Experiments on this benchmark demonstrate that \textsc{Lava} improves $F_1$-scores for selection queries by $\mathbf{14\%}$, reduces MPAE for aggregation queries by $\mathbf{0.39}$, and achieves top-$k$ precision of $\mathbf{86\%}$, while processing videos $ \mathbf{9.6\times} $ faster than the most accurate baseline. Our code and dataset are available at https://github.com/yuyanrui/LAVA.
中文: 本文提出Lava系统,通过自然语言查询实现大规模视频数据的灵活高效分析,在查询精度和处理速度上显著优于现有方法。
English: This paper introduces Lava, a language-driven video analytics system that enables flexible and efficient querying of large-scale video data using natural language, significantly outperforming existing methods in accuracy and processing speed.

Authors:Guiping Cao, Xiangyuan Lan, Wenjian Huang, Jianguo Zhang, Dongmei Jiang, Yaowei Wang
Title: DS-Det: Single-Query Paradigm and Attention Disentangled Learning for Flexible Object Detection
Abstract:
Popular transformer detectors have achieved promising performance through query-based learning using attention mechanisms. However, the roles of existing decoder query types (e.g., content query and positional query) are still underexplored. These queries are generally predefined with a fixed number (fixed-query), which limits their flexibility. We find that the learning of these fixed-query is impaired by Recurrent Opposing inTeractions (ROT) between two attention operations: Self-Attention (query-to-query) and Cross-Attention (query-to-encoder), thereby degrading decoder efficiency. Furthermore, "query ambiguity" arises when shared-weight decoder layers are processed with both one-to-one and one-to-many label assignments during training, violating DETR's one-to-one matching principle. To address these challenges, we propose DS-Det, a more efficient detector capable of detecting a flexible number of objects in images. Specifically, we reformulate and introduce a new unified Single-Query paradigm for decoder modeling, transforming the fixed-query into flexible. Furthermore, we propose a simplified decoder framework through attention disentangled learning: locating boxes with Cross-Attention (one-to-many process), deduplicating predictions with Self-Attention (one-to-one process), addressing "query ambiguity" and "ROT" issues directly, and enhancing decoder efficiency. We further introduce a unified PoCoo loss that leverages box size priors to prioritize query learning on hard samples such as small objects. Extensive experiments across five different backbone models on COCO2017 and WiderPerson datasets demonstrate the general effectiveness and superiority of DS-Det. The source codes are available at https://github.com/Med-Process/DS-Det/.
中文摘要:提出的DS-Det通过引入灵活单查询范式和解耦注意力学习,解决了Transformer检测器中查询歧义和循环对抗交互的问题,实现了更优的目标检测性能。
English Summary: The proposed DS-Det addresses limitations in transformer detectors by introducing a flexible single-query paradigm and disentangled attention learning to resolve query ambiguity and recurrent opposing interactions, achieving superior object detection performance.

Authors:Peng Cai, Qiang Li, Kaicheng Yang, Dong Guo, Jia Li, Nan Zhou, Xiang An, Ninghua Yang, Jiankang Deng
Title: ForCenNet: Foreground-Centric Network for Document Image Rectification
Abstract:
Document image rectification aims to eliminate geometric deformation in photographed documents to facilitate text recognition. However, existing methods often neglect the significance of foreground elements, which provide essential geometric references and layout information for document image correction. In this paper, we introduce Foreground-Centric Network (ForCenNet) to eliminate geometric distortions in document images. Specifically, we initially propose a foreground-centric label generation method, which extracts detailed foreground elements from an undistorted image. Then we introduce a foreground-centric mask mechanism to enhance the distinction between readable and background regions. Furthermore, we design a curvature consistency loss to leverage the detailed foreground labels to help the model understand the distorted geometric distribution. Extensive experiments demonstrate that ForCenNet achieves new state-of-the-art on four real-world benchmarks, such as DocUNet, DIR300, WarpDoc, and DocReal. Quantitative analysis shows that the proposed method effectively undistorts layout elements, such as text lines and table borders. The resources for further comparison are provided at https://github.com/caipeng328/ForCenNet.
中文: 本文提出的ForCenNet通过聚焦前景元素并采用曲率一致性损失,有效消除文档图像的几何变形,在多个基准测试中达到了最先进的性能水平。
English: This paper introduces ForCenNet, a foreground-centric network that effectively eliminates geometric distortions in document images by leveraging detailed foreground elements and a curvature consistency loss, achieving state-of-the-art performance on multiple benchmarks.

Authors:Kanglin Qu, Pan Gao, Qun Dai, Yuanhao Sun
Title: HydraMamba: Multi-Head State Space Model for Global Point Cloud Learning
Abstract:
The attention mechanism has become a dominant operator in point cloud learning, but its quadratic complexity leads to limited inter-point interactions, hindering long-range dependency modeling between objects. Due to excellent long-range modeling capability with linear complexity, the selective state space model (S6), as the core of Mamba, has been exploited in point cloud learning for long-range dependency interactions over the entire point cloud. Despite some significant progress, related works still suffer from imperfect point cloud serialization and lack of locality learning. To this end, we explore a state space model-based point cloud network termed HydraMamba to address the above challenges. Specifically, we design a shuffle serialization strategy, making unordered point sets better adapted to the causal nature of S6. Meanwhile, to overcome the deficiency of existing techniques in locality learning, we propose a ConvBiS6 layer, which is capable of capturing local geometries and global context dependencies synergistically. Besides, we propose MHS6 by extending the multi-head design to S6, further enhancing its modeling capability. HydraMamba achieves state-of-the-art results on various tasks at both object-level and scene-level. The code is available at https://github.com/Point-Cloud-Learning/HydraMamba.
中文:HydraMamba通过引入随机序列化策略和ConvBiS6层,有效提升了点云学习中长程依赖建模和局部几何特征捕获能力,在多项任务中取得了领先性能。
English: HydraMamba introduces a shuffle serialization strategy and a ConvBiS6 layer to enhance point cloud learning by improving long-range dependency modeling and local geometry capture, achieving state-of-the-art results across tasks.

Authors:Seunghun Lee, Jiwan Seo, Minwoo Choi, Kiljoon Han, Jaehoon Jeong, Zane Durante, Ehsan Adeli, Sang Hyun Park, Sunghoon Im
Title: Latest Object Memory Management for Temporally Consistent Video Instance Segmentation
Abstract:
In this paper, we present Latest Object Memory Management (LOMM) for temporally consistent video instance segmentation that significantly improves long-term instance tracking. At the core of our method is Latest Object Memory (LOM), which robustly tracks and continuously updates the latest states of objects by explicitly modeling their presence in each frame. This enables consistent tracking and accurate identity management across frames, enhancing both performance and reliability through the VIS process. Moreover, we introduce Decoupled Object Association (DOA), a strategy that separately handles newly appearing and already existing objects. By leveraging our memory system, DOA accurately assigns object indices, improving matching accuracy and ensuring stable identity consistency, even in dynamic scenes where objects frequently appear and disappear. Extensive experiments and ablation studies demonstrate the superiority of our method over traditional approaches, setting a new benchmark in VIS. Notably, our LOMM achieves state-of-the-art AP score of 54.0 on YouTube-VIS 2022, a dataset known for its challenging long videos. Project page: https://seung-hun-lee.github.io/projects/LOMM/
Chinese: 本文提出最新对象记忆管理(LOMM)方法,通过最新对象记忆系统和解耦对象关联策略,显著提升视频实例分割中的长期目标跟踪性能,在YouTube-VIS 2022数据集上以54.0 AP分数创下最新纪录。
English: This paper introduces Latest Object Memory Management (LOMM), a novel method for video instance segmentation that enhances long-term object tracking through a Latest Object Memory system and Decoupled Object Association strategy, achieving state-of-the-art performance with a 54.0 AP score on YouTube-VIS 2022.

Authors:Liyang Wang, Shiqian Wu, Shun Fang, Qile Zhu, Jiaxin Wu, Sos Again
Title: Quaternion-Based Robust PCA for Efficient Moving Target Detection and Background Recovery in Color Videos
Abstract:
Moving target detection is a challenging computer vision task aimed at generating accurate segmentation maps in diverse in-the-wild color videos captured by static cameras. If backgrounds and targets can be simultaneously extracted and recombined, such synthetic data can significantly enrich annotated in-the-wild datasets and enhance the generalization ability of deep models. Quaternion-based RPCA (QRPCA) is a promising unsupervised paradigm for color image processing. However, in color video processing, Quaternion Singular Value Decomposition (QSVD) incurs high computational costs, and rank-1 quaternion matrix fails to yield rank-1 color channels. In this paper, we reduce the computational complexity of QSVD to o(1) by utilizing a quaternion Riemannian manifold. Furthermor, we propose the universal QRPCA (uQRPCA) framework, which achieves a balance in simultaneously segmenting targets and recovering backgrounds from color videos. Moreover, we expand to uQRPCA+ by introducing the Color Rank-1 Batch (CR1B) method to further process and obtain the ideal low-rank background across color channels. Experiments demonstrate our uQRPCA+ achieves State Of The Art (SOTA) performance on moving target detection and background recovery tasks compared to existing open-source methods. Our implementation is publicly available on GitHub at https://github.com/Ruchtech/uQRPCA
中文: 本文提出的uQRPCA+框架通过四元数黎曼流形降低计算复杂度,并采用颜色秩一批处理方法,在彩色视频运动目标检测和背景恢复任务中实现了最先进的性能。
English: This paper introduces the uQRPCA+ framework, which leverages quaternion Riemannian manifolds to reduce computational complexity and employs the Color Rank-1 Batch method to achieve state-of-the-art performance in moving target detection and background recovery from color videos.

Authors:Bermet Burkanova, Payam Jome Yazdian, Chuxuan Zhang, Trinity Evans, Paige Tuttösí, Angelica Lim
Title: Salsa as a Nonverbal Embodied Language -- The CoMPAS3D Dataset and Benchmarks
Abstract:
Imagine a humanoid that can safely and creatively dance with a human, adapting to its partner's proficiency, using haptic signaling as a primary form of communication. While today's AI systems excel at text or voice-based interaction with large language models, human communication extends far beyond text-it includes embodied movement, timing, and physical coordination. Modeling coupled interaction between two agents poses a formidable challenge: it is continuous, bidirectionally reactive, and shaped by individual variation. We present CoMPAS3D, the largest and most diverse motion capture dataset of improvised salsa dancing, designed as a challenging testbed for interactive, expressive humanoid AI. The dataset includes 3 hours of leader-follower salsa dances performed by 18 dancers spanning beginner, intermediate, and professional skill levels. For the first time, we provide fine-grained salsa expert annotations, covering over 2,800 move segments, including move types, combinations, execution errors and stylistic elements. We draw analogies between partner dance communication and natural language, evaluating CoMPAS3D on two benchmark tasks for synthetic humans that parallel key problems in spoken language and dialogue processing: leader or follower generation with proficiency levels (speaker or listener synthesis), and duet (conversation) generation. Towards a long-term goal of partner dance with humans, we release the dataset, annotations, and code, along with a multitask SalsaAgent model capable of performing all benchmark tasks, alongside additional baselines to encourage research in socially interactive embodied AI and creative, expressive humanoid motion generation.

Authors:Haochen Wang, Qirui Chen, Cilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Weidi Xie, Stratis Gavves
Title: Object-centric Video Question Answering with Visual Grounding and Referring
Abstract:
Video Large Language Models (VideoLLMs) have recently demonstrated remarkable progress in general video understanding. However, existing models primarily focus on high-level comprehension and are limited to text-only responses, restricting the flexibility for object-centric, multiround interactions. In this paper, we make three contributions: (i) we address these limitations by introducing a VideoLLM model, capable of performing both object referring for input and grounding for output in video reasoning tasks, i.e., allowing users to interact with videos using both textual and visual prompts; (ii) we propose STOM (Spatial-Temporal Overlay Module), a novel approach that propagates arbitrary visual prompts input at any single timestamp to the remaining frames within a video; (iii) we present VideoInfer, a manually curated object-centric video instruction dataset featuring questionanswering pairs that require reasoning. We conduct comprehensive experiments on VideoInfer and other existing benchmarks across video question answering and referring object segmentation. The results on 12 benchmarks of 6 tasks show that our proposed model consistently outperforms baselines in both video question answering and segmentation, underscoring its robustness in multimodal, object-centric video and image understanding. Project page: https://qirui-chen.github.io/RGA3-release/.
中文: 本文提出一种支持多模态交互的视频大语言模型,通过新颖的时空覆盖模块增强以物体为中心的视频理解能力,在12个基准测试中均表现优异。
English: This paper introduces a VideoLLM that enhances object-centric video understanding by supporting multimodal interactions and a novel Spatial-Temporal Overlay Module, achieving superior performance across 12 benchmarks.

Authors:Jovana Kondic, Pengyuan Li, Dhiraj Joshi, Zexue He, Shafiq Abedin, Jennifer Sun, Ben Wiesel, Eli Schwartz, Ahmed Nassar, Bo Wu, Assaf Arbelle, Aude Oliva, Dan Gutfreund, Leonid Karlinsky, Rogerio Feris
Title: ChartGen: Scaling Chart Understanding Via Code-Guided Synthetic Chart Generation
Abstract:
Chart-to-code reconstruction -- the task of recovering executable plotting scripts from chart images -- provides important insights into a model's ability to ground data visualizations in precise, machine-readable form. Yet many existing multimodal benchmarks largely focus primarily on answering questions about charts or summarizing them. To bridge this gap, we present ChartGen, a fully-automated pipeline for code-guided synthetic chart generation. Starting from seed chart images, ChartGen (i) prompts a vision-language model (VLM) to reconstruct each image into a python script, and (ii) iteratively augments that script with a code-oriented large language model (LLM). Using ChartGen, we create 222.5K unique chart-image code pairs from 13K seed chart images, and present an open-source synthetic chart dataset covering 27 chart types, 11 plotting libraries, and multiple data modalities (image, code, text, CSV, DocTags). From this corpus, we curate a held-out chart-to-code evaluation subset of 4.3K chart image-code pairs, and evaluate six open-weight VLMs (3B - 26B parameters), highlighting substantial room for progress. We release the pipeline, prompts, and the dataset to help accelerate efforts towards robust chart understanding and vision-conditioned code generation: https://github.com/SD122025/ChartGen/
中文:本文提出了ChartGen,一个自动化生成图表-代码对的流程,旨在填补多模态基准在图表到代码重建任务上的空白,通过创建包含多种图表类型和绘图库的数据集,并评估了多个视觉语言模型的性能。
English: This paper introduces ChartGen, an automated pipeline that generates synthetic chart-image code pairs to advance chart-to-code reconstruction, addressing a gap in multimodal benchmarks by creating a comprehensive dataset and evaluating several vision-language models.

Authors:Byungjun Kim, Shunsuke Saito, Giljoo Nam, Tomas Simon, Jason Saragih, Hanbyul Joo, Junxuan Li
Title: HairCUP: Hair Compositional Universal Prior for 3D Gaussian Avatars
Abstract:
We present a universal prior model for 3D head avatars with explicit hair compositionality. Existing approaches to build generalizable priors for 3D head avatars often adopt a holistic modeling approach, treating the face and hair as an inseparable entity. This overlooks the inherent compositionality of the human head, making it difficult for the model to naturally disentangle face and hair representations, especially when the dataset is limited. Furthermore, such holistic models struggle to support applications like 3D face and hairstyle swapping in a flexible and controllable manner. To address these challenges, we introduce a prior model that explicitly accounts for the compositionality of face and hair, learning their latent spaces separately. A key enabler of this approach is our synthetic hairless data creation pipeline, which removes hair from studio-captured datasets using estimated hairless geometry and texture derived from a diffusion prior. By leveraging a paired dataset of hair and hairless captures, we train disentangled prior models for face and hair, incorporating compositionality as an inductive bias to facilitate effective separation. Our model's inherent compositionality enables seamless transfer of face and hair components between avatars while preserving identity. Additionally, we demonstrate that our model can be fine-tuned in a few-shot manner using monocular captures to create high-fidelity, hair-compositional 3D head avatars for unseen subjects. These capabilities highlight the practical applicability of our approach in real-world scenarios, paving the way for flexible and expressive 3D avatar generation.

Authors:Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, Weiyun Wang, Xiangyu Zhao, Jixuan Chen, Haodong Duan, Tianbao Xie, Chenyu Yang, Shiqian Su, Yue Yu, Yuan Huang, Yiqian Liu, Xiao Zhang, Yanting Zhang, Xiangyu Yue, Weijie Su, Xizhou Zhu, Wei Shen, Jifeng Dai, Wenhai Wang
Title: MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents
Abstract:
We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, macOS, Linux, iOS, Android, and Web platforms. It comprises four levels: GUI Content Understanding, Element Grounding, Task Automation, and Task Collaboration, covering essential skills for GUI agents. In addition, we propose a novel Efficiency-Quality Area (EQA) metric to assess GUI agent execution efficiency in online automation scenarios. Through MMBench-GUI, we identify accurate visual grounding as a critical determinant of overall task success, emphasizing the substantial benefits of modular frameworks that integrate specialized grounding modules. Furthermore, to achieve reliable GUI automation, an agent requires strong task planning and cross-platform generalization abilities, with long-context memory, a broad action space, and long-term reasoning playing a critical role. More important, task efficiency remains a critically underexplored dimension, and all models suffer from substantial inefficiencies, with excessive redundant steps even when tasks are ultimately completed. The integration of precise localization, effective planning, and early stopping strategies is indispensable to enable truly efficient and scalable GUI automation. Our benchmark code, evaluation data, and running environment will be publicly available at https://github.com/open-compass/MMBench-GUI.
中文:MMBench-GUI是一个跨平台的GUI自动化代理分层评测基准,包含四个技能层级和创新的EQA效率评估指标,强调精准视觉定位、任务规划与效率优化对实现可靠自动化的重要作用。
English: MMBench-GUI is a cross-platform hierarchical benchmark for evaluating GUI automation agents, featuring four skill levels and a novel EQA metric to assess efficiency, while highlighting the critical roles of visual grounding, task planning, and efficiency optimization for reliable automation.

Authors:Chen Zhu, Wangbo Zhao, Huiwen Zhang, Samir Khaki, Yuhao Zhou, Weidong Tang, Shuo Wang, Zhihang Yuan, Yuzhang Shang, Xiaojiang Peng, Kai Wang, Dawei Yang
Title: EA-ViT: Efficient Adaptation for Elastic Vision Transformer
Abstract:
Vision Transformers (ViTs) have emerged as a foundational model in computer vision, excelling in generalization and adaptation to downstream tasks. However, deploying ViTs to support diverse resource constraints typically requires retraining multiple, size-specific ViTs, which is both time-consuming and energy-intensive. To address this issue, we propose an efficient ViT adaptation framework that enables a single adaptation process to generate multiple models of varying sizes for deployment on platforms with various resource constraints. Our approach comprises two stages. In the first stage, we enhance a pre-trained ViT with a nested elastic architecture that enables structural flexibility across MLP expansion ratio, number of attention heads, embedding dimension, and network depth. To preserve pre-trained knowledge and ensure stable adaptation, we adopt a curriculum-based training strategy that progressively increases elasticity. In the second stage, we design a lightweight router to select submodels according to computational budgets and downstream task demands. Initialized with Pareto-optimal configurations derived via a customized NSGA-II algorithm, the router is then jointly optimized with the backbone. Extensive experiments on multiple benchmarks demonstrate the effectiveness and versatility of EA-ViT. The code is available at https://github.com/zcxcf/EA-ViT.
中文摘要:本文提出EA-ViT框架,通过嵌套弹性架构和渐进式课程训练,仅需一次适配即可将预训练视觉Transformer转换为多种尺寸的部署模型,有效解决了传统方法需为不同资源限制重复训练多个模型的问题。
English Summary: The paper introduces EA-ViT, an efficient adaptation framework that transforms a single pre-trained Vision Transformer into multiple scalable models through a nested elastic architecture and curriculum training, eliminating the need for retraining separate models for different resource constraints.

Authors:Lanmiao Liu, Esam Ghaleb, Aslı Özyürek, Zerrin Yumak
Title: SemGes: Semantics-aware Co-Speech Gesture Generation using Semantic Coherence and Relevance Learning
Abstract:
Creating a virtual avatar with semantically coherent gestures that are aligned with speech is a challenging task. Existing gesture generation research mainly focused on generating rhythmic beat gestures, neglecting the semantic context of the gestures. In this paper, we propose a novel approach for semantic grounding in co-speech gesture generation that integrates semantic information at both fine-grained and global levels. Our approach starts with learning the motion prior through a vector-quantized variational autoencoder. Built on this model, a second-stage module is applied to automatically generate gestures from speech, text-based semantics and speaker identity that ensures consistency between the semantic relevance of generated gestures and co-occurring speech semantics through semantic coherence and relevance modules. Experimental results demonstrate that our approach enhances the realism and coherence of semantic gestures. Extensive experiments and user studies show that our method outperforms state-of-the-art approaches across two benchmarks in co-speech gesture generation in both objective and subjective metrics. The qualitative results of our model, code, dataset and pre-trained models can be viewed at https://semgesture.github.io/.

Authors:Muhammad Ibrahim, Naveed Akhtar, Haitian Wang, Saeed Anwar, Ajmal Mian
Title: Multistream Network for LiDAR and Camera-based 3D Object Detection in Outdoor Scenes
Abstract:
Fusion of LiDAR and RGB data has the potential to enhance outdoor 3D object detection accuracy. To address real-world challenges in outdoor 3D object detection, fusion of LiDAR and RGB input has started gaining traction. However, effective integration of these modalities for precise object detection task still remains a largely open problem. To address that, we propose a MultiStream Detection (MuStD) network, that meticulously extracts task-relevant information from both data modalities. The network follows a three-stream structure. Its LiDAR-PillarNet stream extracts sparse 2D pillar features from the LiDAR input while the LiDAR-Height Compression stream computes Bird's-Eye View features. An additional 3D Multimodal stream combines RGB and LiDAR features using UV mapping and polar coordinate indexing. Eventually, the features containing comprehensive spatial, textural and geometric information are carefully fused and fed to a detection head for 3D object detection. Our extensive evaluation on the challenging KITTI Object Detection Benchmark using public testing server at https://www.cvlibs.net/datasets/kitti/eval_object_detail.php?&result=d162ec699d6992040e34314d19ab7f5c217075e0 establishes the efficacy of our method by achieving new state-of-the-art or highly competitive results in different categories while remaining among the most efficient methods. Our code will be released through MuStD GitHub repository at https://github.com/IbrahimUWA/MuStD.git
Chinese: 提出的多流检测(MuStD)网络通过三流架构有效融合激光雷达与RGB数据,在KITTI基准测试中实现了最优的三维物体检测性能,同时保持高效运行。
English: The proposed MultiStream Detection (MuStD) network effectively fuses LiDAR and RGB data through a three-stream architecture to achieve state-of-the-art 3D object detection performance on the KITTI benchmark while maintaining high efficiency.

Authors:Sakuya Ota, Qing Yu, Kent Fujiwara, Satoshi Ikehata, Ikuro Sato
Title: PINO: Person-Interaction Noise Optimization for Long-Duration and Customizable Motion Generation of Arbitrary-Sized Groups
Abstract:
Generating realistic group interactions involving multiple characters remains challenging due to increasing complexity as group size expands. While existing conditional diffusion models incrementally generate motions by conditioning on previously generated characters, they rely on single shared prompts, limiting nuanced control and leading to overly simplified interactions. In this paper, we introduce Person-Interaction Noise Optimization (PINO), a novel, training-free framework designed for generating realistic and customizable interactions among groups of arbitrary size. PINO decomposes complex group interactions into semantically relevant pairwise interactions, and leverages pretrained two-person interaction diffusion models to incrementally compose group interactions. To ensure physical plausibility and avoid common artifacts such as overlapping or penetration between characters, PINO employs physics-based penalties during noise optimization. This approach allows precise user control over character orientation, speed, and spatial relationships without additional training. Comprehensive evaluations demonstrate that PINO generates visually realistic, physically coherent, and adaptable multi-person interactions suitable for diverse animation, gaming, and robotics applications.

Authors:Guoping Xu, Yan Dai, Hengrui Zhao, Ying Zhang, Jie Deng, Weiguo Lu, You Zhang
Title: SAM2-Aug: Prior knowledge-based Augmentation for Target Volume Auto-Segmentation in Adaptive Radiation Therapy Using Segment Anything Model 2
Abstract:
Purpose: Accurate tumor segmentation is vital for adaptive radiation therapy (ART) but remains time-consuming and user-dependent. Segment Anything Model 2 (SAM2) shows promise for prompt-based segmentation but struggles with tumor accuracy. We propose prior knowledge-based augmentation strategies to enhance SAM2 for ART. Methods: Two strategies were introduced to improve SAM2: (1) using prior MR images and annotations as contextual inputs, and (2) improving prompt robustness via random bounding box expansion and mask erosion/dilation. The resulting model, SAM2-Aug, was fine-tuned and tested on the One-Seq-Liver dataset (115 MRIs from 31 liver cancer patients), and evaluated without retraining on Mix-Seq-Abdomen (88 MRIs, 28 patients) and Mix-Seq-Brain (86 MRIs, 37 patients). Results: SAM2-Aug outperformed convolutional, transformer-based, and prompt-driven models across all datasets, achieving Dice scores of 0.86(liver), 0.89(abdomen), and 0.90(brain). It demonstrated strong generalization across tumor types and imaging sequences, with improved performance in boundary-sensitive metrics. Conclusions: Incorporating prior images and enhancing prompt diversity significantly boosts segmentation accuracy and generalizability. SAM2-Aug offers a robust, efficient solution for tumor segmentation in ART. Code and models will be released at https://github.com/apple1986/SAM2-Aug.
中文摘要:本研究通过整合先验知识和增强提示多样性,提升了Segment Anything Model 2(SAM2)在自适应放疗肿瘤分割中的性能,使其在多个数据集上展现出卓越的准确性和泛化能力。
English Summary: The study enhances the Segment Anything Model 2 (SAM2) for tumor segmentation in adaptive radiation therapy by incorporating prior knowledge and improving prompt robustness, resulting in superior performance across diverse datasets.

Authors:An Xiang, Zixuan Huang, Xitong Gao, Kejiang Ye, Cheng-zhong Xu
Title: BridgeNet: A Unified Multimodal Framework for Bridging 2D and 3D Industrial Anomaly Detection
Abstract:
Industrial anomaly detection for 2D objects has gained significant attention and achieved progress in anomaly detection (AD) methods. However, identifying 3D depth anomalies using only 2D information is insufficient. Despite explicitly fusing depth information into RGB images or using point cloud backbone networks to extract depth features, both approaches struggle to adequately represent 3D information in multimodal scenarios due to the disparities among different modal information. Additionally, due to the scarcity of abnormal samples in industrial data, especially in multimodal scenarios, it is necessary to perform anomaly generation to simulate real-world abnormal samples. Therefore, we propose a novel unified multimodal anomaly detection framework to address these issues. Our contributions consist of 3 key aspects. (1) We extract visible depth information from 3D point cloud data simply and use 2D RGB images to represent appearance, which disentangles depth and appearance to support unified anomaly generation. (2) Benefiting from the flexible input representation, the proposed Multi-Scale Gaussian Anomaly Generator and Unified Texture Anomaly Generator can generate richer anomalies in RGB and depth. (3) All modules share parameters for both RGB and depth data, effectively bridging 2D and 3D anomaly detection. Subsequent modules can directly leverage features from both modalities without complex fusion. Experiments show our method outperforms state-of-the-art (SOTA) on MVTec-3D AD and Eyecandies datasets. Code available at: https://github.com/Xantastic/BridgeNet
中文: 本文提出了一种统一的多模态异常检测框架,通过分离深度与外观特征实现高效异常生成,在三维工业数据集上超越了现有最优方法。
English: This paper introduces a unified multimodal anomaly detection framework that disentangles depth and appearance features to enable effective anomaly generation and outperforms state-of-the-art methods on 3D industrial datasets.

Authors:Jiaru Zhong, Jiahao Wang, Jiahui Xu, Xiaofan Li, Zaiqing Nie, Haibao Yu
Title: CoopTrack: Exploring End-to-End Learning for Efficient Cooperative Sequential Perception
Abstract:
Cooperative perception aims to address the inherent limitations of single-vehicle autonomous driving systems through information exchange among multiple agents. Previous research has primarily focused on single-frame perception tasks. However, the more challenging cooperative sequential perception tasks, such as cooperative 3D multi-object tracking, have not been thoroughly investigated. Therefore, we propose CoopTrack, a fully instance-level end-to-end framework for cooperative tracking, featuring learnable instance association, which fundamentally differs from existing approaches. CoopTrack transmits sparse instance-level features that significantly enhance perception capabilities while maintaining low transmission costs. Furthermore, the framework comprises two key components: Multi-Dimensional Feature Extraction, and Cross-Agent Association and Aggregation, which collectively enable comprehensive instance representation with semantic and motion features, and adaptive cross-agent association and fusion based on a feature graph. Experiments on both the V2X-Seq and Griffin datasets demonstrate that CoopTrack achieves excellent performance. Specifically, it attains state-of-the-art results on V2X-Seq, with 39.0\% mAP and 32.8\% AMOTA. The project is available at https://github.com/zhongjiaru/CoopTrack.
协作感知通过多智能体信息共享提升自动驾驶能力,CoopTrack提出创新的实例级框架实现高效追踪,在基准测试中达到顶尖性能。
Cooperative perception enhances autonomous driving by enabling multi-agent information sharing, with CoopTrack introducing an innovative instance-level framework for efficient tracking that achieves top performance on benchmark datasets.

Authors:Donggeun Lim, Jinseok Bae, Inwoo Hwang, Seungmin Lee, Hwanhee Lee, Young Min Kim
Title: Event-Driven Storytelling with Multiple Lifelike Humans in a 3D Scene
Abstract:
In this work, we propose a framework that creates a lively virtual dynamic scene with contextual motions of multiple humans. Generating multi-human contextual motion requires holistic reasoning over dynamic relationships among human-human and human-scene interactions. We adapt the power of a large language model (LLM) to digest the contextual complexity within textual input and convert the task into tangible subproblems such that we can generate multi-agent behavior beyond the scale that was not considered before. Specifically, our event generator formulates the temporal progression of a dynamic scene into a sequence of small events. Each event calls for a well-defined motion involving relevant characters and objects. Next, we synthesize the motions of characters at positions sampled based on spatial guidance. We employ a high-level module to deliver scalable yet comprehensive context, translating events into relative descriptions that enable the retrieval of precise coordinates. As the first to address this problem at scale and with diversity, we offer a benchmark to assess diverse aspects of contextual reasoning. Benchmark results and user studies show that our framework effectively captures scene context with high scalability. The code and benchmark, along with result videos, are available at our project page: https://rms0329.github.io/Event-Driven-Storytelling/.

Authors:Hanbing Wu, Ping Jiang, Anyang Su, Chenxu Zhao, Tianyu Fu, Minghui Wu, Beiping Tan, Huiying Li
Title: PRE-MAP: Personalized Reinforced Eye-tracking Multimodal LLM for High-Resolution Multi-Attribute Point Prediction
Abstract:
Visual selective attention, driven by individual preferences, regulates human prioritization of visual stimuli by bridging subjective cognitive mechanisms with objective visual elements, thereby steering the semantic interpretation and hierarchical processing of dynamic visual scenes. However, existing models and datasets predominantly neglect the influence of subjective cognitive diversity on fixation behavior. Conventional saliency prediction models, typically employing segmentation approaches, rely on low-resolution imagery to generate saliency heatmaps, subsequently upscaled to native resolutions, which limiting their capacity to capture personalized attention patterns. Furthermore, MLLMs are constrained by factors such as hallucinations, making it very costly to strictly adhere to the expected format in tasks involving multiple point predictions, and achieving precise point positioning is challenging. To address these limitations, we present Subjective Personalized Attention for Advertisement Videos, namely SPA-ADV, a large-scale multimodal dataset capturing gaze behaviors from over 4,500 participants varying in age and gender with 486 videos. Furthermore, we propose PRE-MAP, a novel eye-tracking saliency model that characterizes Personalized visual disparities through Reinforcement learning-optimized Eye-tracking, built upon MLLMs and guided by Multi-Attribute user profiles to predict Points. To ensure MLLMs produce prediction points that are both format-correct and spatially accurate, we introduce Consistency Group Relative Policy Optimization (C-GRPO), inspired by the variability in eye movement points and Multi-Attribute profiles. Extensive experiments on SPA-ADV and other benchmarks demonstrate the effectiveness of our approach. The code and dataset are available at \href{https://github.com/mininglamp-MLLM/PRE-MAP}{this URL}.
中文: 本研究提出了SPA-ADV大规模多模态数据集以捕捉多样化注视行为,并开发了PRE-MAP模型,该模型通过强化学习优化,利用多属性用户画像预测个性化视觉注意点,同时采用创新优化方法解决多模态大模型在格式规范与定位精度方面的挑战。
English: This research introduces SPA-ADV, a large-scale multimodal dataset capturing diverse gaze behaviors, and proposes PRE-MAP, a reinforcement learning-optimized model that leverages multi-attribute profiles to predict personalized visual attention points while addressing format and accuracy challenges in MLLMs through innovative optimization techniques.

Authors:Xin Li, Kaixiang Yang, Qiang Li, Zhiwei Wang
Title: Joint Holistic and Lesion Controllable Mammogram Synthesis via Gated Conditional Diffusion Model
Abstract:
Mammography is the most commonly used imaging modality for breast cancer screening, driving an increasing demand for deep-learning techniques to support large-scale analysis. However, the development of accurate and robust methods is often limited by insufficient data availability and a lack of diversity in lesion characteristics. While generative models offer a promising solution for data synthesis, current approaches often fail to adequately emphasize lesion-specific features and their relationships with surrounding tissues. In this paper, we propose Gated Conditional Diffusion Model (GCDM), a novel framework designed to jointly synthesize holistic mammogram images and localized lesions. GCDM is built upon a latent denoising diffusion framework, where the noised latent image is concatenated with a soft mask embedding that represents breast, lesion, and their transitional regions, ensuring anatomical coherence between them during the denoising process. To further emphasize lesion-specific features, GCDM incorporates a gated conditioning branch that guides the denoising process by dynamically selecting and fusing the most relevant radiomic and geometric properties of lesions, effectively capturing their interplay. Experimental results demonstrate that GCDM achieves precise control over small lesion areas while enhancing the realism and diversity of synthesized mammograms. These advancements position GCDM as a promising tool for clinical applications in mammogram synthesis. Our code is available at https://github.com/lixinHUST/Gated-Conditional-Diffusion-Model/
Chinese Summary: 门控条件扩散模型(GCDM)是一种创新框架,通过潜在去噪过程结合软掩码嵌入和门控调节分支,能够同时合成完整的乳腺X光图像和局部病灶,有效增强病灶特征与周围组织的解剖一致性。
English Summary: The Gated Conditional Diffusion Model (GCDM) is a novel framework that synthesizes both complete mammogram images and localized lesions by using a latent denoising process with soft mask embeddings and a gated conditioning branch to enhance lesion-specific features and anatomical coherence.

Authors:Kang Wang, Chen Qin, Zhang Shi, Haoran Wang, Xiwen Zhang, Chen Chen, Cheng Ouyang, Chengliang Dai, Yuanhan Mo, Chenchen Dai, Xutong Kuang, Ruizhe Li, Xin Chen, Xiuzheng Yue, Song Tian, Alejandro Mora-Rubio, Kumaradevan Punithakumar, Shizhan Gong, Qi Dou, Sina Amirrajab, Yasmina Al Khalil, Cian M. Scannell, Lexiaozi Fan, Huili Yang, Xiaowu Sun, Rob van der Geest, Tewodros Weldebirhan Arega, Fabrice Meriaudeau, Caner Özer, Amin Ranem, John Kalkhof, İlkay Öksüz, Anirban Mukhopadhyay, Abdul Qayyum, Moona Mazher, Steven A Niederer, Carles Garcia-Cabrera, Eric Arazo, Michal K. Grzeszczyk, Szymon Płotka, Wanqin Ma, Xiaomeng Li, Rongjun Ge, Yongqing Kou, Xinrong Chen, He Wang, Chengyan Wang, Wenjia Bai, Shuo Wang
Title: Extreme Cardiac MRI Analysis under Respiratory Motion: Results of the CMRxMotion Challenge
Abstract:
Deep learning models have achieved state-of-the-art performance in automated Cardiac Magnetic Resonance (CMR) analysis. However, the efficacy of these models is highly dependent on the availability of high-quality, artifact-free images. In clinical practice, CMR acquisitions are frequently degraded by respiratory motion, yet the robustness of deep learning models against such artifacts remains an underexplored problem. To promote research in this domain, we organized the MICCAI CMRxMotion challenge. We curated and publicly released a dataset of 320 CMR cine series from 40 healthy volunteers who performed specific breathing protocols to induce a controlled spectrum of motion artifacts. The challenge comprised two tasks: 1) automated image quality assessment to classify images based on motion severity, and 2) robust myocardial segmentation in the presence of motion artifacts. A total of 22 algorithms were submitted and evaluated on the two designated tasks. This paper presents a comprehensive overview of the challenge design and dataset, reports the evaluation results for the top-performing methods, and further investigates the impact of motion artifacts on five clinically relevant biomarkers. All resources and code are publicly available at: https://github.com/CMRxMotion
中文:MICCAI CMRxMotion挑战赛通过发布包含受控运动伪影的心脏MRI数据集,评估了22种算法在图像质量分级与分割任务中的表现,旨在推动深度学习模型对呼吸运动伪影的鲁棒性研究,所有资源均已开源。
English: The MICCAI CMRxMotion challenge addressed the underexplored issue of deep learning robustness against respiratory motion artifacts in cardiac MRI by releasing a curated dataset and evaluating 22 algorithms on quality assessment and segmentation tasks, with all resources made publicly available.

Authors:Jie Chen, Zhangchi Hu, Peixi Wu, Huyue Zhu, Hebei Li, Xiaoyan Sun
Title: DASH: 4D Hash Encoding with Self-Supervised Decomposition for Real-Time Dynamic Scene Rendering
Abstract:
Dynamic scene reconstruction is a long-term challenge in 3D vision. Existing plane-based methods in dynamic Gaussian splatting suffer from an unsuitable low-rank assumption, causing feature overlap and poor rendering quality. Although 4D hash encoding provides an explicit representation without low-rank constraints, directly applying it to the entire dynamic scene leads to substantial hash collisions and redundancy. To address these challenges, we present DASH, a real-time dynamic scene rendering framework that employs 4D hash encoding coupled with self-supervised decomposition. Our approach begins with a self-supervised decomposition mechanism that separates dynamic and static components without manual annotations or precomputed masks. Next, we introduce a multiresolution 4D hash encoder for dynamic elements, providing an explicit representation that avoids the low-rank assumption. Finally, we present a spatio-temporal smoothness regularization strategy to mitigate unstable deformation artifacts. Experiments on real-world datasets demonstrate that DASH achieves state-of-the-art dynamic rendering performance, exhibiting enhanced visual quality at real-time speeds of 264 FPS on a single 4090 GPU. Code: https://github.com/chenj02/DASH.
中文: DASH提出了一种结合4D哈希编码与自监督分解的实时动态场景渲染框架,有效解决了特征重叠和哈希冲突问题,在264 FPS下实现了最优的视觉质量。
English: DASH introduces a real-time dynamic scene rendering framework using 4D hash encoding with self-supervised decomposition to overcome feature overlap and hash collisions, achieving state-of-the-art visual quality at 264 FPS.

Authors:Tianyu Zou, Shengwu Xiong, Ruilin Yao, Yi Rong
Title: Balancing Conservatism and Aggressiveness: Prototype-Affinity Hybrid Network for Few-Shot Segmentation
Abstract:
This paper studies the few-shot segmentation (FSS) task, which aims to segment objects belonging to unseen categories in a query image by learning a model on a small number of well-annotated support samples. Our analysis of two mainstream FSS paradigms reveals that the predictions made by prototype learning methods are usually conservative, while those of affinity learning methods tend to be more aggressive. This observation motivates us to balance the conservative and aggressive information captured by these two types of FSS frameworks so as to improve the segmentation performance. To achieve this, we propose a **P**rototype-**A**ffinity **H**ybrid **Net**work (PAHNet), which introduces a Prototype-guided Feature Enhancement (PFE) module and an Attention Score Calibration (ASC) module in each attention block of an affinity learning model (called affinity learner). These two modules utilize the predictions generated by a pre-trained prototype learning model (called prototype predictor) to enhance the foreground information in support and query image representations and suppress the mismatched foreground-background (FG-BG) relationships between them, respectively. In this way, the aggressiveness of the affinity learner can be effectively mitigated, thereby eventually increasing the segmentation accuracy of our PAHNet method. Experimental results show that PAHNet outperforms most recently proposed methods across 1-shot and 5-shot settings on both PASCAL-5$^i$ and COCO-20$^i$ datasets, suggesting its effectiveness. The code is available at: [GitHub - tianyu-zou/PAHNet: Balancing Conservatism and Aggressiveness: Prototype-Affinity Hybrid Network for Few-Shot Segmentation (ICCV'25)](https://github.com/tianyu-zou/PAHNet)
本文提出PAHNet混合网络,通过特征增强和注意力校准模块平衡保守的原型学习与激进的亲和力学习,从而提升小样本分割的准确性。
This paper introduces PAHNet, a hybrid few-shot segmentation model that balances conservative prototype learning and aggressive affinity learning through feature enhancement and attention calibration modules to improve segmentation accuracy.

Authors:Xuetian Chen, Yinghao Chen, Xinfeng Yuan, Zhuo Peng, Lu Chen, Yuekeng Li, Zhoujia Zhang, Yingqian Huang, Leyan Huang, Jiaqing Liang, Tianbao Xie, Zhiyong Wu, Qiushi Sun, Biqing Qi, Bowen Zhou
Title: OS-MAP: How Far Can Computer-Using Agents Go in Breadth and Depth?
Abstract:
Computer-using agents have shown strong potential to boost human productivity and enable new application forms across platforms. While recent advances have led to usable applications, existing benchmarks fail to account for the internal task heterogeneity and the corresponding agent capabilities, as well as their alignment with actual user demands-hindering both targeted capability development and the reliable transition of research progress into practical deployment. To bridge the gap, we present OS-MAP, a benchmark for daily computer-using automation that organizes its 416 realistic tasks across 15 applications along two key dimensions: a five-level taxonomy of automation and a generalization scope derived from a real-world user demand hierarchy. To enable fine-grained analysis of required capabilities and alignment with real-world scenarios, OS-MAP evaluates agents along two dimensions: automation level across a five-level taxonomy, and generalization scope across a demand hierarchy. This design captures varying levels of required agent autonomy and generalization, forming a performance-generalization evaluation matrix for structured and comprehensive assessment. Experiments show that even State-of-the-Art agents with VLM backbones struggle with higher-level tasks involving perception, reasoning, and coordination-highlighting the need for a deeper understanding of current strengths and limitations to drive the future progress in computer-using agents research and deployment. All code, environments, baselines, and data are publicly available at https://github.com/OS-Copilot/OS-Map.
中文: 计算机使用代理虽能提升生产力,但现有基准未能匹配实际需求,因此我们推出OS-MAP基准,通过自动化分级和泛化范围评估代理能力,揭示其在高阶任务中的不足,以推动研究与应用发展。
English: Computer-using agents show promise for productivity but face challenges in aligning capabilities with real-world tasks, prompting the introduction of OS-MAP, a benchmark that evaluates agents across automation levels and generalization scopes to address these gaps and guide future development.

Authors:Yuqi Li, Haotian Zhang, Li Li, Dong Liu
Title: Learned Image Compression with Hierarchical Progressive Context Modeling
Abstract:
Context modeling is essential in learned image compression for accurately estimating the distribution of latents. While recent advanced methods have expanded context modeling capacity, they still struggle to efficiently exploit long-range dependency and diverse context information across different coding steps. In this paper, we introduce a novel Hierarchical Progressive Context Model (HPCM) for more efficient context information acquisition. Specifically, HPCM employs a hierarchical coding schedule to sequentially model the contextual dependencies among latents at multiple scales, which enables more efficient long-range context modeling. Furthermore, we propose a progressive context fusion mechanism that incorporates contextual information from previous coding steps into the current step, effectively exploiting diverse contextual information. Experimental results demonstrate that our method achieves state-of-the-art rate-distortion performance and strikes a better balance between compression performance and computational complexity. The code is available at https://github.com/lyq133/LIC-HPCM.
Chinese: 本文提出的分层渐进上下文模型(HPCM)通过多尺度顺序建模和渐进融合机制,有效利用长程依赖和多样化上下文信息,在图像压缩中实现了领先的率失真性能和更优的计算复杂度平衡。
English: The paper introduces a Hierarchical Progressive Context Model (HPCM) that enhances image compression by efficiently capturing long-range dependencies and integrating contextual information across coding steps, achieving superior rate-distortion performance and computational balance.

Authors:Shuhao Li, Weidong Yang, Yue Cui, Xiaoxing Liu, Lingkai Meng, Lipeng Ma, Fan Zhang
Title: Fine-Grained Traffic Inference from Road to Lane via Spatio-Temporal Graph Node Generation
Abstract:
Fine-grained traffic management and prediction are fundamental to key applications such as autonomous driving, lane change guidance, and traffic signal control. However, obtaining lane-level traffic data has become a critical bottleneck for data-driven models due to limitations in the types and number of sensors and issues with the accuracy of tracking algorithms. To address this, we propose the Fine-grained Road Traffic Inference (FRTI) task, which aims to generate more detailed lane-level traffic information using limited road data, providing a more energy-efficient and cost-effective solution for precise traffic management. This task is abstracted as the first scene of the spatio-temporal graph node generation problem. We designed a two-stage framework--RoadDiff--to solve the FRTI task. solve the FRTI task. This framework leverages the Road-Lane Correlation Autoencoder-Decoder and the Lane Diffusion Module to fully utilize the limited spatio-temporal dependencies and distribution relationships of road data to accurately infer fine-grained lane traffic states. Based on existing research, we designed several baseline models with the potential to solve the FRTI task and conducted extensive experiments on six datasets representing different road conditions to validate the effectiveness of the RoadDiff model in addressing the FRTI task. The relevant datasets and code are available at https://github.com/ShuhaoLii/RoadDiff.
Chinese: 本研究提出了细粒度道路交通推断(FRTI)任务,并设计了两阶段RoadDiff框架,通过利用有限道路数据高效生成详细车道级交通信息,在六个数据集上的大量实验验证了其有效性。
English: The study introduces the Fine-grained Road Traffic Inference (FRTI) task and proposes a two-stage RoadDiff framework to generate detailed lane-level traffic data efficiently using limited road information, validated through extensive experiments on six datasets.

Authors:Rui Pan, Ruiying Lu
Title: SP-Mamba: Spatial-Perception State Space Model for Unsupervised Medical Anomaly Detection
Abstract:
Radiography imaging protocols target on specific anatomical regions, resulting in highly consistent images with recurrent structural patterns across patients. Recent advances in medical anomaly detection have demonstrated the effectiveness of CNN- and transformer-based approaches. However, CNNs exhibit limitations in capturing long-range dependencies, while transformers suffer from quadratic computational complexity. In contrast, Mamba-based models, leveraging superior long-range modeling, structural feature extraction, and linear computational efficiency, have emerged as a promising alternative. To capitalize on the inherent structural regularity of medical images, this study introduces SP-Mamba, a spatial-perception Mamba framework for unsupervised medical anomaly detection. The window-sliding prototype learning and Circular-Hilbert scanning-based Mamba are introduced to better exploit consistent anatomical patterns and leverage spatial information for medical anomaly detection. Furthermore, we excavate the concentration and contrast characteristics of anomaly maps for improving anomaly detection. Extensive experiments on three diverse medical anomaly detection benchmarks confirm the proposed method's state-of-the-art performance, validating its efficacy and robustness. The code is available at https://github.com/Ray-RuiPan/SP-Mamba.
中文摘要:本研究提出SP-Mamba空间感知框架,通过利用医学图像的结构规律性和空间信息进行无监督异常检测,在多个基准测试中实现了最先进的性能。
English Summary: This study introduces SP-Mamba, a spatial-perception Mamba framework that leverages structural regularity and spatial information for unsupervised medical anomaly detection, achieving state-of-the-art performance across multiple benchmarks.

Authors:Shuiqing Zhao, Meihuan Wang, Jiaxuan Xu, Jie Feng, Wei Qian, Rongchang Chen, Zhenyu Liang, Shouliang Qi, Yanan Wu
Title: A Self-training Framework for Semi-supervised Pulmonary Vessel Segmentation and Its Application in COPD
Abstract:
Background: It is fundamental for accurate segmentation and quantification of the pulmonary vessel, particularly smaller vessels, from computed tomography (CT) images in chronic obstructive pulmonary disease (COPD) patients. Objective: The aim of this study was to segment the pulmonary vasculature using a semi-supervised method. Methods: In this study, a self-training framework is proposed by leveraging a teacher-student model for the segmentation of pulmonary vessels. First, the high-quality annotations are acquired in the in-house data by an interactive way. Then, the model is trained in the semi-supervised way. A fully supervised model is trained on a small set of labeled CT images, yielding the teacher model. Following this, the teacher model is used to generate pseudo-labels for the unlabeled CT images, from which reliable ones are selected based on a certain strategy. The training of the student model involves these reliable pseudo-labels. This training process is iteratively repeated until an optimal performance is achieved. Results: Extensive experiments are performed on non-enhanced CT scans of 125 COPD patients. Quantitative and qualitative analyses demonstrate that the proposed method, Semi2, significantly improves the precision of vessel segmentation by 2.3%, achieving a precision of 90.3%. Further, quantitative analysis is conducted in the pulmonary vessel of COPD, providing insights into the differences in the pulmonary vessel across different severity of the disease. Conclusion: The proposed method can not only improve the performance of pulmonary vascular segmentation, but can also be applied in COPD analysis. The code will be made available at https://github.com/wuyanan513/semi-supervised-learning-for-vessel-segmentation.
中文: 本研究提出一种名为Semi2的半监督师生模型,将COPD患者肺血管分割精度提升2.3%至90.3%,并实现了疾病严重程度的血管差异分析。
English: This study introduces a semi-supervised teacher-student model called Semi2, which improves pulmonary vessel segmentation precision by 2.3% to 90.3% in COPD patients and enables disease severity analysis.

Authors:Haochen Han, Alex Jinpeng Wang, Fangming Liu, Jun Zhu
Title: Negation-Aware Test-Time Adaptation for Vision-Language Models
Abstract:
In this paper, we study a practical but less-touched problem in Vision-Language Models (VLMs), \ie, negation understanding. Specifically, many real-world applications require models to explicitly identify what is false or non-existent, \eg, radiologists may search for images that exclude specific conditions. Despite the impressive transferability of VLMs through large-scale training, they suffer from a critical limitation that fails to handle negation. To address this challenge, existing methods attribute its root cause to the scarcity of negation training data and propose to fine-tune VLMs on massive data containing explicit negation. Undoubtedly, such data-centric solutions demand substantial data and computational resources, limiting their sustainable widespread adoption. To tackle negation in a low-carbon manner, we empirically observe that the key obstacle lies in the dual-concept shifts between the affirmation and negation distributions. Therefore, we propose a Negation-Aware Test-Time Adaptation (NEAT) method to efficiently adjust distribution-related parameters during inference. In brief, NEAT can reduce distribution shift in consistent semantics while eliminating false distributional consistency in unrelated semantics. Extensive experiments on the various negation understanding tasks verify the effectiveness of the proposed method. Remarkably, with less than 0.01\% of trainable parameters, NEAT achieves comparable or superior performance to state-of-the-art post-training approaches. Our code is available at https://github.com/hhc1997/NEAT.
Chinese: 本文提出NEAT方法,通过推理时调整分布相关参数,以低碳方式解决视觉语言模型中的否定理解问题,仅用极少可训练参数即达到最优性能。
English: This paper introduces NEAT, a low-carbon test-time adaptation method that efficiently addresses negation understanding in Vision-Language Models by adjusting distribution-related parameters during inference, achieving state-of-the-art performance with minimal computational resources.

Authors:Chong Xia, Shengjun Zhang, Fangfu Liu, Chang Liu, Khodchaphun Hirunyaratsameewong, Yueqi Duan
Title: ScenePainter: Semantically Consistent Perpetual 3D Scene Generation with Concept Relation Alignment
Abstract:
Perpetual 3D scene generation aims to produce long-range and coherent 3D view sequences, which is applicable for long-term video synthesis and 3D scene reconstruction. Existing methods follow a "navigate-and-imagine" fashion and rely on outpainting for successive view expansion. However, the generated view sequences suffer from semantic drift issue derived from the accumulated deviation of the outpainting module. To tackle this challenge, we propose ScenePainter, a new framework for semantically consistent 3D scene generation, which aligns the outpainter's scene-specific prior with the comprehension of the current scene. To be specific, we introduce a hierarchical graph structure dubbed SceneConceptGraph to construct relations among multi-level scene concepts, which directs the outpainter for consistent novel views and can be dynamically refined to enhance diversity. Extensive experiments demonstrate that our framework overcomes the semantic drift issue and generates more consistent and immersive 3D view sequences. Project Page: https://xiac20.github.io/ScenePainter/.
中文: ScenePainter通过引入层次化场景概念图来对齐场景特定先验与当前场景理解,有效克服语义漂移问题,生成连贯且沉浸式的3D视图序列。
English: ScenePainter introduces a hierarchical SceneConceptGraph to align scene-specific priors with current scene comprehension, effectively overcoming semantic drift and generating consistent, immersive 3D view sequences.

Authors:Binxu Li, Yuhui Zhang, Xiaohan Wang, Weixin Liang, Ludwig Schmidt, Serena Yeung-Levy
Title: Closing the Modality Gap for Mixed Modality Search
Abstract:
Mixed modality search -- retrieving information across a heterogeneous corpus composed of images, texts, and multimodal documents -- is an important yet underexplored real-world application. In this work, we investigate how contrastive vision-language models, such as CLIP, perform on the mixed modality search task. Our analysis reveals a critical limitation: these models exhibit a pronounced modality gap in the embedding space, where image and text embeddings form distinct clusters, leading to intra-modal ranking bias and inter-modal fusion failure. To address this issue, we propose GR-CLIP, a lightweight post-hoc calibration method that removes the modality gap in CLIP's embedding space. Evaluated on MixBench -- the first benchmark specifically designed for mixed modality search -- GR-CLIP improves NDCG@10 by up to 26 percentage points over CLIP, surpasses recent vision-language generative embedding models by 4 percentage points, while using 75x less compute.
中文: 本研究揭示了CLIP模型在混合模态检索中存在显著的模态鸿沟问题,并提出轻量级校准方法GR-CLIP,该方法在极大降低计算成本的同时显著提升了检索精度。
English: This study identifies a significant modality gap in CLIP models that hinders mixed modality search performance, and introduces GR-CLIP, a lightweight calibration method that substantially improves retrieval accuracy while drastically reducing computational costs.

Authors:Yufei Ma, Hanwen Zhang, Qiya Yang, Guibo Luo, Yuesheng Zhu
Title: A New One-Shot Federated Learning Framework for Medical Imaging Classification with Feature-Guided Rectified Flow and Knowledge Distillation
Abstract:
In multi-center scenarios, One-Shot Federated Learning (OSFL) has attracted increasing attention due to its low communication overhead, requiring only a single round of transmission. However, existing generative model-based OSFL methods suffer from low training efficiency and potential privacy leakage in the healthcare domain. Additionally, achieving convergence within a single round of model aggregation is challenging under non-Independent and Identically Distributed (non-IID) data. To address these challenges, in this paper a modified OSFL framework is proposed, in which a new Feature-Guided Rectified Flow Model (FG-RF) and Dual-Layer Knowledge Distillation (DLKD) aggregation method are developed. FG-RF on the client side accelerates generative modeling in medical imaging scenarios while preserving privacy by synthesizing feature-level images rather than pixel-level images. To handle non-IID distributions, DLKD enables the global student model to simultaneously mimic the output logits and align the intermediate-layer features of client-side teacher models during aggregation. Experimental results on three non-IID medical imaging datasets show that our new framework and method outperform multi-round federated learning approaches, achieving up to 21.73% improvement, and exceeds the baseline FedISCA by an average of 21.75%. Furthermore, our experiments demonstrate that feature-level synthetic images significantly reduce privacy leakage risks compared to pixel-level synthetic images. The code is available at https://github.com/LMIAPC/one-shot-fl-medical.
Chinese: 本文提出了一种改进的单次联邦学习框架,采用特征引导整流流模型实现高效且保护隐私的医学图像生成,并通过双层知识蒸馏方法处理非独立同分布数据,在三个医学影像数据集上相比现有方法性能提升最高达21.73%。
English: This paper introduces a modified one-shot federated learning framework featuring a Feature-Guided Rectified Flow Model for efficient, privacy-preserving medical image generation and a Dual-Layer Knowledge Distillation method to handle non-IID data, achieving superior performance with up to 21.73% improvement over existing approaches.

Authors:Ying Ba, Tianyu Zhang, Yalong Bai, Wenyi Mo, Tao Liang, Bing Su, Ji-Rong Wen
Title: Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment
Abstract:
Contemporary image generation systems have achieved high fidelity and superior aesthetic quality beyond basic text-image alignment. However, existing evaluation frameworks have failed to evolve in parallel. This study reveals that human preference reward models fine-tuned based on CLIP and BLIP architectures have inherent flaws: they inappropriately assign low scores to images with rich details and high aesthetic value, creating a significant discrepancy with actual human aesthetic preferences. To address this issue, we design a novel evaluation score, ICT (Image-Contained-Text) score, that achieves and surpasses the objectives of text-image alignment by assessing the degree to which images represent textual content. Building upon this foundation, we further train an HP (High-Preference) score model using solely the image modality to enhance image aesthetics and detail quality while maintaining text-image alignment. Experiments demonstrate that the proposed evaluation model improves scoring accuracy by over 10\% compared to existing methods, and achieves significant results in optimizing state-of-the-art text-to-image models. This research provides theoretical and empirical support for evolving image generation technology toward higher-order human aesthetic preferences. Code is available at https://github.com/BarretBa/ICTHP.
中文: 针对现有图像生成系统评估方法的不足,本研究提出了ICT和HP评分模型,通过优化图文对齐和美学质量评估,显著提升了评分准确性并推动图像生成技术向更高审美层次发展。
English: Current image generation systems produce high-quality images but are poorly evaluated by outdated metrics, leading this study to introduce the ICT and HP scores for better alignment with human aesthetic preferences and improved evaluation accuracy.

Authors:Yongsong Huang, Tomo Miyazaki, Xiaofeng Liu, Shinichiro Omachi
Title: GPSMamba: A Global Phase and Spectral Prompt-guided Mamba for Infrared Image Super-Resolution
Abstract:
Infrared Image Super-Resolution (IRSR) is challenged by the low contrast and sparse textures of infrared data, requiring robust long-range modeling to maintain global coherence. While State-Space Models like Mamba offer proficiency in modeling long-range dependencies for this task, their inherent 1D causal scanning mechanism fragments the global context of 2D images, hindering fine-detail restoration. To address this, we propose Global Phase and Spectral Prompt-guided Mamba (GPSMamba), a framework that synergizes architectural guidance with non-causal supervision. First, our Adaptive Semantic-Frequency State Space Module (ASF-SSM) injects a fused semantic-frequency prompt directly into the Mamba block, integrating non-local context to guide reconstruction. Then, a novel Thermal-Spectral Attention and Phase Consistency Loss provides explicit, non-causal supervision to enforce global structural and spectral fidelity. By combining these two innovations, our work presents a systematic strategy to mitigate the limitations of causal modeling. Extensive experiments demonstrate that GPSMamba achieves state-of-the-art performance, validating our approach as a powerful new paradigm for infrared image restoration. Code is available at https://github.com/yongsongH/GPSMamba.
中文: 提出的GPSMamba框架通过融合自适应语义频率提示和热光谱监督,克服了红外图像超分辨率中一维因果建模的局限,实现了通过增强全局一致性和结构保真度的最先进性能。
English: The proposed GPSMamba framework overcomes the limitations of 1D causal modeling in infrared image super-resolution by integrating adaptive semantic-frequency prompts and thermal-spectral supervision, achieving state-of-the-art performance through enhanced global coherence and structural fidelity.

Authors:Zixiang Ai, Zhenyu Cui, Yuxin Peng, Jiahuan Zhou
Title: UPP: Unified Point-Level Prompting for Robust Point Cloud Analysis
Abstract:
Pre-trained point cloud analysis models have shown promising advancements in various downstream tasks, yet their effectiveness is typically suffering from low-quality point cloud (i.e., noise and incompleteness), which is a common issue in real scenarios due to casual object occlusions and unsatisfactory data collected by 3D sensors. To this end, existing methods focus on enhancing point cloud quality by developing dedicated denoising and completion models. However, due to the isolation between the point cloud enhancement and downstream tasks, these methods fail to work in various real-world domains. In addition, the conflicting objectives between denoising and completing tasks further limit the ensemble paradigm to preserve critical geometric features. To tackle the above challenges, we propose a unified point-level prompting method that reformulates point cloud denoising and completion as a prompting mechanism, enabling robust analysis in a parameter-efficient manner. We start by introducing a Rectification Prompter to adapt to noisy points through the predicted rectification vector prompts, effectively filtering noise while preserving intricate geometric features essential for accurate analysis. Sequentially, we further incorporate a Completion Prompter to generate auxiliary point prompts based on the rectified point clouds, facilitating their robustness and adaptability. Finally, a Shape-Aware Unit module is exploited to efficiently unify and capture the filtered geometric features for the downstream point cloud analysis.Extensive experiments on four datasets demonstrate the superiority and robustness of our method when handling noisy and incomplete point cloud data against existing state-of-the-art methods. Our code is released at https://github.com/zhoujiahuan1991/ICCV2025-UPP.
中文: 本文提出了一种统一的点级提示方法,通过整合校正和补全提示器来高效处理噪声和不完整数据,从而增强预训练点云分析的鲁棒性,并在多个数据集上展现出卓越性能。
English: This paper introduces a unified point-level prompting method that enhances pre-trained point cloud analysis by integrating rectification and completion prompters to handle noisy and incomplete data efficiently, demonstrating superior performance across multiple datasets.

Authors:Jionghao Wang, Cheng Lin, Yuan Liu, Rui Xu, Zhiyang Dou, Xiao-Xiao Long, Hao-Xiang Guo, Taku Komura, Wenping Wang, Xin Li
Title: PDT: Point Distribution Transformation with Diffusion Models
Abstract:
Point-based representations have consistently played a vital role in geometric data structures. Most point cloud learning and processing methods typically leverage the unordered and unconstrained nature to represent the underlying geometry of 3D shapes. However, how to extract meaningful structural information from unstructured point cloud distributions and transform them into semantically meaningful point distributions remains an under-explored problem. We present PDT, a novel framework for point distribution transformation with diffusion models. Given a set of input points, PDT learns to transform the point set from its original geometric distribution into a target distribution that is semantically meaningful. Our method utilizes diffusion models with novel architecture and learning strategy, which effectively correlates the source and the target distribution through a denoising process. Through extensive experiments, we show that our method successfully transforms input point clouds into various forms of structured outputs - ranging from surface-aligned keypoints, and inner sparse joints to continuous feature lines. The results showcase our framework's ability to capture both geometric and semantic features, offering a powerful tool for various 3D geometry processing tasks where structured point distributions are desired. Code will be available at this link: https://github.com/shanemankiw/PDT.
中文: PDT框架利用扩散模型将无结构的点云转换为具有语义意义的分布,有效捕捉几何和结构特征,以提升三维几何处理能力。
English: The PDT framework employs diffusion models to transform unstructured point clouds into semantically meaningful distributions, effectively capturing both geometric and structural features for enhanced 3D geometry processing.

Authors:Jian Chen, Yuxuan Hu, Haifeng Lu, Wei Wang, Min Yang, Chengming Li, Xiping Hu
Title: MGHFT: Multi-Granularity Hierarchical Fusion Transformer for Cross-Modal Sticker Emotion Recognition
Abstract:
Although pre-trained visual models with text have demonstrated strong capabilities in visual feature extraction, sticker emotion understanding remains challenging due to its reliance on multi-view information, such as background knowledge and stylistic cues. To address this, we propose a novel multi-granularity hierarchical fusion transformer (MGHFT), with a multi-view sticker interpreter based on Multimodal Large Language Models. Specifically, inspired by the human ability to interpret sticker emotions from multiple views, we first use Multimodal Large Language Models to interpret stickers by providing rich textual context via multi-view descriptions. Then, we design a hierarchical fusion strategy to fuse the textual context into visual understanding, which builds upon a pyramid visual transformer to extract both global and local sticker features at multiple stages. Through contrastive learning and attention mechanisms, textual features are injected at different stages of the visual backbone, enhancing the fusion of global- and local-granularity visual semantics with textual guidance. Finally, we introduce a text-guided fusion attention mechanism to effectively integrate the overall multimodal features, enhancing semantic understanding. Extensive experiments on 2 public sticker emotion datasets demonstrate that MGHFT significantly outperforms existing sticker emotion recognition approaches, achieving higher accuracy and more fine-grained emotion recognition. Compared to the best pre-trained visual models, our MGHFT also obtains an obvious improvement, 5.4% on F1 and 4.0% on accuracy. The code is released at https://github.com/cccccj-03/MGHFT_ACMMM2025.
中文摘要:提出的多粒度分层融合Transformer(MGHFT)通过多模态大语言模型实现表情包的多视角解读,并采用分层策略将文本信息与多阶段视觉特征融合,在表情情感识别任务中显著优于现有方法。
English Summary: The proposed Multi-Granularity Hierarchical Fusion Transformer (MGHFT) leverages multimodal large language models to interpret stickers through multi-view descriptions and hierarchically fuses textual context with visual features, achieving superior performance in sticker emotion recognition compared to existing methods.

Authors:Zhihao Luo, Luojun Lin, Zheng Lin
Title: Synthetic-to-Real Camouflaged Object Detection
Abstract:
Due to the high cost of collection and labeling, there are relatively few datasets for camouflaged object detection (COD). In particular, for certain specialized categories, the available image dataset is insufficiently populated. Synthetic datasets can be utilized to alleviate the problem of limited data to some extent. However, directly training with synthetic datasets compared to real datasets can lead to a degradation in model performance. To tackle this problem, in this work, we investigate a new task, namely Syn-to-Real Camouflaged Object Detection (S2R-COD). In order to improve the model performance in real world scenarios, a set of annotated synthetic camouflaged images and a limited number of unannotated real images must be utilized. We propose the Cycling Syn-to-Real Domain Adaptation Framework (CSRDA), a method based on the student-teacher model. Specially, CSRDA propagates class information from the labeled source domain to the unlabeled target domain through pseudo labeling combined with consistency regularization. Considering that narrowing the intra-domain gap can improve the quality of pseudo labeling, CSRDA utilizes a recurrent learning framework to build an evolving real domain for bridging the source and target domain. Extensive experiments demonstrate the effectiveness of our framework, mitigating the problem of limited data and handcraft annotations in COD. Our code is publicly available at: https://github.com/Muscape/S2R-COD.
中文: 本研究提出合成到真实伪装目标检测(S2R-COD)新任务,并开发了CSRDA框架,通过域适应方法结合合成数据与未标注真实图像,显著提升了模型在真实场景中的性能,有效缓解了伪装检测领域数据匮乏的问题。
English: This study introduces the Syn-to-Real Camouflaged Object Detection (S2R-COD) task and proposes the CSRDA framework, which uses synthetic data and unlabeled real images with domain adaptation to enhance model performance in real-world scenarios, effectively addressing data scarcity in COD.

Authors:Beidi Zhao, SangMook Kim, Hao Chen, Chen Zhou, Zu-hua Gao, Gang Wang, Xiaoxiao Li
Title: PTCMIL: Multiple Instance Learning via Prompt Token Clustering for Whole Slide Image Analysis
Abstract:
Multiple Instance Learning (MIL) has advanced WSI analysis but struggles with the complexity and heterogeneity of WSIs. Existing MIL methods face challenges in aggregating diverse patch information into robust WSI representations. While ViTs and clustering-based approaches show promise, they are computationally intensive and fail to capture task-specific and slide-specific variability. To address these limitations, we propose PTCMIL, a novel Prompt Token Clustering-based ViT for MIL aggregation. By introducing learnable prompt tokens into the ViT backbone, PTCMIL unifies clustering and prediction tasks in an end-to-end manner. It dynamically aligns clustering with downstream tasks, using projection-based clustering tailored to each WSI, reducing complexity while preserving patch heterogeneity. Through token merging and prototype-based pooling, PTCMIL efficiently captures task-relevant patterns. Extensive experiments on eight datasets demonstrate its superior performance in classification and survival analysis tasks, outperforming state-of-the-art methods. Systematic ablation studies confirm its robustness and strong interpretability. The code is released at https://github.com/ubc-tea/PTCMIL.
中文: PTCMIL提出了一种基于提示令牌聚类的视觉变换器,通过动态对齐聚类与下游任务,在降低计算复杂度的同时有效捕获任务相关模式,在WSI分析中展现出卓越性能。
English: PTCMIL introduces a prompt token clustering-based Vision Transformer that dynamically aligns clustering with downstream tasks, achieving superior performance in WSI analysis through efficient task-relevant pattern capture and reduced computational complexity.

Authors:Fabio De Sousa Ribeiro, Omar Todd, Charles Jones, Avinash Kori, Raghav Mehta, Ben Glocker
Title: Flow Stochastic Segmentation Networks
Abstract:
We introduce the Flow Stochastic Segmentation Network (Flow-SSN), a generative segmentation model family featuring discrete-time autoregressive and modern continuous-time flow variants. We prove fundamental limitations of the low-rank parameterisation of previous methods and show that Flow-SSNs can estimate arbitrarily high-rank pixel-wise covariances without assuming the rank or storing the distributional parameters. Flow-SSNs are also more efficient to sample from than standard diffusion-based segmentation models, thanks to most of the model capacity being allocated to learning the base distribution of the flow, constituting an expressive prior. We apply Flow-SSNs to challenging medical imaging benchmarks and achieve state-of-the-art results. Code available: https://github.com/biomedia-mira/flow-ssn.
中文摘要:Flow-SSN模型系列通过离散和连续时间变体克服了先前方法的秩限制,同时借助表达能力强的先验分布实现高效采样,在医学影像基准测试中取得了最先进的结果。
English Summary: The Flow-SSN model family introduces generative segmentation with discrete and continuous-time variants that overcome previous methods' rank limitations while enabling efficient sampling through an expressive prior, achieving state-of-the-art results on medical imaging benchmarks.

Authors:Miguel Saavedra-Ruiz, Samer B. Nashed, Charlie Gauthier, Liam Paull
Title: Perpetua: Multi-Hypothesis Persistence Modeling for Semi-Static Environments
Abstract:
Many robotic systems require extended deployments in complex, dynamic environments. In such deployments, parts of the environment may change between subsequent robot observations. Most robotic mapping or environment modeling algorithms are incapable of representing dynamic features in a way that enables predicting their future state. Instead, they opt to filter certain state observations, either by removing them or some form of weighted averaging. This paper introduces Perpetua, a method for modeling the dynamics of semi-static features. Perpetua is able to: incorporate prior knowledge about the dynamics of the feature if it exists, track multiple hypotheses, and adapt over time to enable predicting of future feature states. Specifically, we chain together mixtures of "persistence" and "emergence" filters to model the probability that features will disappear or reappear in a formal Bayesian framework. The approach is an efficient, scalable, general, and robust method for estimating the states of features in an environment, both in the present as well as at arbitrary future times. Through experiments on simulated and real-world data, we find that Perpetua yields better accuracy than similar approaches while also being online adaptable and robust to missing observations.
中文: 本文提出Perpetua方法,通过串联持续性与出现性滤波器对半静态环境特征进行贝叶斯动态建模,实验证明该方法在预测未来状态时具有更高准确性、在线适应性和鲁棒性。
English: This paper introduces Perpetua, a Bayesian method for modeling semi-static environmental features by chaining persistence and emergence filters to predict their future states, demonstrating superior accuracy and adaptability in experiments.

Authors:Yiguo He, Xinjun Cheng, Junjie Zhu, Chunping Qiu, Jun Wang, Xichuan Zhang, Qiangjuan Huang, Ke Yang
Title: SAR-TEXT: A Large-Scale SAR Image-Text Dataset Built with SAR-Narrator and A Progressive Learning Strategy for Downstream Tasks
Abstract:
Vision Language Models (VLMs) have achieved remarkable breakthroughs in the field of remote sensing in recent years. Synthetic Aperture Radar (SAR) imagery, with its all-weather capability, is essential in remote sensing, yet the lack of large-scale, high-quality SAR image-text datasets hinders its semantic understanding. In this paper, we construct SAR-TEXT, a large-scale and high-quality dataset consisting of over 130,000 SAR image-text pairs. To construct the SAR-TEXT dataset, we design the SAR-Narrator framework, which generates textual descriptions for SAR images through a multi-stage strategy. To verify the effectiveness of the SAR-TEXT dataset, we conduct experiments on three typical vision-language tasks: image-text retrieval, image captioning, and visual question answering (VQA). Specifically, we construct three representative models on SAR-TEXT: SAR-RS-CLIP, SAR-RS-CoCa, and SAR-GPT. SAR-RS-CLIP achieves notable improvements in retrieval performance, boosting average recall by 12.97% and 10.0% on the OSdataset_512 and HRSID test sets, respectively. In the captioning task, SAR-RS-CoCa achieves significant improvements over the original CoCa models in terms of BLEU-4, SPICE, and CIDEr scores. In the VQA task, SAR-GPT outperforms baseline and single-stage models on multiple SAR-VQA datasets, demonstrating stronger semantic understanding and reasoning ability, as further confirmed by qualitative results. It is worth noting that, as a flexible captioning tool, SAR-Narrator can be readily adopted by the community to construct larger-scale SAR image-text datasets. All code, pretrained models, and the SAR-Text dataset are publicly available at: https://github.com/YiguoHe/SAR-TEXT.
中文: 本文提出了SAR-TEXT,一个包含超过13万对SAR图像-文本的大规模高质量数据集,通过SAR-Narrator框架生成,显著提升了遥感任务中的检索、描述和视觉问答性能。
English: This paper introduces SAR-TEXT, a large-scale dataset of over 130,000 SAR image-text pairs generated via the SAR-Narrator framework, which significantly enhances performance in remote sensing tasks like retrieval, captioning, and VQA.

Authors:Siyu Mu, Wei Xuan Chan, Choon Hwai Yap
Title: HeartUnloadNet: A Weakly-Supervised Cycle-Consistent Graph Network for Predicting Unloaded Cardiac Geometry from Diastolic States
Abstract:
The unloaded cardiac geometry (i.e., the state of the heart devoid of luminal pressure) serves as a valuable zero-stress and zero-strain reference and is critical for personalized biomechanical modeling of cardiac function, to understand both healthy and diseased physiology and to predict the effects of cardiac interventions. However, estimating the unloaded geometry from clinical images remains a challenging task. Traditional approaches rely on inverse finite element (FE) solvers that require iterative optimization and are computationally expensive. In this work, we introduce HeartUnloadNet, a deep learning framework that predicts the unloaded left ventricular (LV) shape directly from the end diastolic (ED) mesh while explicitly incorporating biophysical priors. The network accepts a mesh of arbitrary size along with physiological parameters such as ED pressure, myocardial stiffness scale, and fiber helix orientation, and outputs the corresponding unloaded mesh. It adopts a graph attention architecture and employs a cycle-consistency strategy to enable bidirectional (loading and unloading) prediction, allowing for partial self-supervision that improves accuracy and reduces the need for large training datasets. Trained and tested on 20,700 FE simulations across diverse LV geometries and physiological conditions, HeartUnloadNet achieves sub-millimeter accuracy, with an average DSC of 0.986 and HD of 0.083 cm, while reducing inference time to just 0.02 seconds per case, over 10^5 times faster and significantly more accurate than traditional inverse FE solvers. Ablation studies confirm the effectiveness of the architecture. Notably, the cycle-consistent design enables the model to maintain a DSC of 97% even with as few as 200 training samples. This work thus presents a scalable and accurate surrogate for inverse FE solvers, supporting real-time clinical applications in the future.
中文: HeartUnloadNet是一种深度学习框架,通过结合生物物理先验和循环一致性策略,能直接从临床图像中精确快速地预测左心室无负载几何形状,其精度达亚毫米级,计算效率比传统方法提高超过10万倍。
English: HeartUnloadNet is a deep learning framework that accurately and rapidly predicts the unloaded left ventricular geometry from clinical images using biophysical priors and cycle-consistency, achieving sub-millimeter precision and reducing computation time by over 100,000 times compared to traditional methods.

Authors:James Dickens, Kamyar Hamad
Title: Part Segmentation of Human Meshes via Multi-View Human Parsing
Abstract:
Recent advances in point cloud deep learning have led to models that achieve high per-part labeling accuracy on large-scale point clouds, using only the raw geometry of unordered point sets. In parallel, the field of human parsing focuses on predicting body part and clothing/accessory labels from images. This work aims to bridge these two domains by enabling per-vertex semantic segmentation of large-scale human meshes. To achieve this, a pseudo-ground truth labeling pipeline is developed for the Thuman2.1 dataset: meshes are first aligned to a canonical pose, segmented from multiple viewpoints, and the resulting point-level labels are then backprojected onto the original mesh to produce per-point pseudo ground truth annotations. Subsequently, a novel, memory-efficient sampling strategy is introduced, a windowed iterative farthest point sampling (FPS) with space-filling curve-based serialization to effectively downsample the point clouds. This is followed by a purely geometric segmentation using PointTransformer, enabling semantic parsing of human meshes without relying on texture information. Experimental results confirm the effectiveness and accuracy of the proposed approach. Project code and pre-processed data is available at https://github.com/JamesMcCullochDickens/Human3DParsing/tree/master.
中文摘要:本研究通过开发伪标注生成流程和结合内存高效采样策略与PointTransformer的方法,实现了仅利用几何数据对人体网格进行语义分割,从而连接了点云深度学习与人体解析两个领域。
English Summary: This study bridges point cloud deep learning and human parsing by enabling semantic segmentation of human meshes using only geometric data, achieved through a pseudo-ground truth pipeline and a memory-efficient sampling strategy combined with PointTransformer.

Authors:Haiyang Liu, Xiaolin Hong, Xuancheng Yang, Yudi Ruan, Xiang Lian, Michael Lingelbach, Hongwei Yi, Wei Li
Title: Livatar-1: Real-Time Talking Heads Generation with Tailored Flow Matching
Abstract:
We present Livatar, a real-time audio-driven talking heads videos generation framework. Existing baselines suffer from limited lip-sync accuracy and long-term pose drift. We address these limitations with a flow matching based framework. Coupled with system optimizations, Livatar achieves competitive lip-sync quality with a 8.50 LipSync Confidence on the HDTF dataset, and reaches a throughput of 141 FPS with an end-to-end latency of 0.17s on a single A10 GPU. This makes high-fidelity avatars accessible to broader applications. Our project is available at https://www.hedra.com/ with with examples at https://h-liu1997.github.io/Livatar-1/

Authors:Grace Su, Sheng-Yu Wang, Aaron Hertzmann, Eli Shechtman, Jun-Yan Zhu, Richard Zhang
Title: Identifying Prompted Artist Names from Generated Images
Abstract:
A common and controversial use of text-to-image models is to generate pictures by explicitly naming artists, such as "in the style of Greg Rutkowski". We introduce a benchmark for prompted-artist recognition: predicting which artist names were invoked in the prompt from the image alone. The dataset contains 1.95M images covering 110 artists and spans four generalization settings: held-out artists, increasing prompt complexity, multiple-artist prompts, and different text-to-image models. We evaluate feature similarity baselines, contrastive style descriptors, data attribution methods, supervised classifiers, and few-shot prototypical networks. Generalization patterns vary: supervised and few-shot models excel on seen artists and complex prompts, whereas style descriptors transfer better when the artist's style is pronounced; multi-artist prompts remain the most challenging. Our benchmark reveals substantial headroom and provides a public testbed to advance the responsible moderation of text-to-image models. We release the dataset and benchmark to foster further research: https://graceduansu.github.io/IdentifyingPromptedArtists/
Chinese Summary: 本研究提出了一个从文本到图像模型生成的作品中识别艺术家风格的基准,通过评估多种方法在不同泛化场景下的表现,揭示了负责任AI内容审核方面仍有巨大提升空间。
English Summary: This study introduces a benchmark for recognizing artists from images generated by text-to-image models, evaluating various methods across different generalization settings and revealing significant room for improvement in responsible AI moderation.

Authors:Xinyu Wang, Jinghua Hou, Zhe Liu, Yingying Zhu
Title: HybridTM: Combining Transformer and Mamba for 3D Semantic Segmentation
Abstract:
Transformer-based methods have demonstrated remarkable capabilities in 3D semantic segmentation through their powerful attention mechanisms, but the quadratic complexity limits their modeling of long-range dependencies in large-scale point clouds. While recent Mamba-based approaches offer efficient processing with linear complexity, they struggle with feature representation when extracting 3D features. However, effectively combining these complementary strengths remains an open challenge in this field. In this paper, we propose HybridTM, the first hybrid architecture that integrates Transformer and Mamba for 3D semantic segmentation. In addition, we propose the Inner Layer Hybrid Strategy, which combines attention and Mamba at a finer granularity, enabling simultaneous capture of long-range dependencies and fine-grained local features. Extensive experiments demonstrate the effectiveness and generalization of our HybridTM on diverse indoor and outdoor datasets. Furthermore, our HybridTM achieves state-of-the-art performance on ScanNet, ScanNet200, and nuScenes benchmarks. The code will be made available at https://github.com/deepinact/HybridTM.
中文: HybridTM是一种创新的混合架构,将Transformer和Mamba相结合用于3D语义分割,有效整合两者的互补优势以捕捉长程依赖关系和细粒度局部特征,在多个基准测试中实现了最先进的性能。
English: HybridTM is a novel hybrid architecture that integrates Transformer and Mamba for 3D semantic segmentation, effectively combining their complementary strengths to capture long-range dependencies and fine-grained local features, achieving state-of-the-art performance on multiple benchmarks.

Authors:Baoyao Yang, Wanyun Li, Dixin Chen, Junxiang Chen, Wenbin Yao, Haifeng Lin
Title: VideoMind: An Omni-Modal Video Dataset with Intent Grounding for Deep-Cognitive Video Understanding
Abstract:
This paper introduces VideoMind, a video-centric omni-modal dataset designed for deep video content cognition and enhanced multi-modal feature representation. The dataset comprises 103K video samples (3K reserved for testing), each paired with audio and systematically detailed textual descriptions. Specifically, every video and its audio is described across three hierarchical layers (factual, abstract, and intent), progressing from surface to depth. It contains over 22 million words, averaging ~225 words per sample. VideoMind's key distinction from existing datasets is its provision of intent expressions, which require contextual integration across the entire video and are not directly observable. These deep-cognitive expressions are generated using a Chain-of-Thought (COT) approach, prompting the mLLM through step-by-step reasoning. Each description includes annotations for subject, place, time, event, action, and intent, supporting downstream recognition tasks. Crucially, we establish a gold-standard benchmark with 3,000 manually validated samples for evaluating deep-cognitive video understanding. We design hybrid-cognitive retrieval experiments, scored by multi-level retrieval metrics, to appropriately assess deep video comprehension. Evaluation results for models (e.g., InternVideo, VAST, UMT-L) are released. VideoMind serves as a powerful benchmark for fine-grained cross-modal alignment and advances fields requiring in-depth video understanding, such as emotion and intent recognition. The data is publicly available on GitHub, HuggingFace, and OpenDataLab, https://github.com/cdx-cindy/VideoMind.
中文: 本文介绍了VideoMind视频数据集,该数据集通过链式思维方法生成包含意图表达的多层次文本描述,为深度视频理解和跨模态对齐任务提供了重要基准。
English: This paper presents VideoMind, a comprehensive video dataset with audio and multi-layered textual descriptions that uniquely includes intent expressions generated via Chain-of-Thought reasoning, serving as a benchmark for deep video understanding and cross-modal alignment tasks.

Authors:Daniil Morozov, Reuben Dorent, Nazim Haouchine
Title: A 3D Cross-modal Keypoint Descriptor for MR-US Matching and Registration
Abstract:
Intraoperative registration of real-time ultrasound (iUS) to preoperative Magnetic Resonance Imaging (MRI) remains an unsolved problem due to severe modality-specific differences in appearance, resolution, and field-of-view. To address this, we propose a novel 3D cross-modal keypoint descriptor for MRI-iUS matching and registration. Our approach employs a patient-specific matching-by-synthesis approach, generating synthetic iUS volumes from preoperative MRI. This enables supervised contrastive training to learn a shared descriptor space. A probabilistic keypoint detection strategy is then employed to identify anatomically salient and modality-consistent locations. During training, a curriculum-based triplet loss with dynamic hard negative mining is used to learn descriptors that are i) robust to iUS artifacts such as speckle noise and limited coverage, and ii) rotation-invariant . At inference, the method detects keypoints in MR and real iUS images and identifies sparse matches, which are then used to perform rigid registration. Our approach is evaluated using 3D MRI-iUS pairs from the ReMIND dataset. Experiments show that our approach outperforms state-of-the-art keypoint matching methods across 11 patients, with an average precision of $69.8\%$. For image registration, our method achieves a competitive mean Target Registration Error of 2.39 mm on the ReMIND2Reg benchmark. Compared to existing iUS-MR registration approach, our framework is interpretable, requires no manual initialization, and shows robustness to iUS field-of-view variation. Code is available at https://github.com/morozovdd/CrossKEY.
中文: 本研究提出一种新型三维跨模态关键点描述符用于MRI-超声配准,通过合成超声生成和对比学习实现无需人工初始化的鲁棒配准,在ReMIND数据集上以69.8%的匹配精度和2.39毫米配准误差超越现有方法。
English: This study introduces a novel 3D cross-modal keypoint descriptor for MRI-ultrasound registration, using synthetic ultrasound generation and contrastive learning to achieve robust, interpretable alignment without manual initialization, outperforming existing methods with 69.8% precision and 2.39 mm registration error.

Authors:Zhekai Chen, Ruihang Chu, Yukang Chen, Shiwei Zhang, Yujie Wei, Yingya Zhang, Xihui Liu
Title: TTS-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive Generation
Abstract:
Scaling visual generation models is essential for real-world content creation, yet requires substantial training and computational expenses. Alternatively, test-time scaling has garnered growing attention due to resource efficiency and promising performance. In this work, we present TTS-VAR, the first general test-time scaling framework for visual auto-regressive (VAR) models, modeling the generation process as a path searching problem. To dynamically balance computational efficiency with exploration capacity, we first introduce an adaptive descending batch size schedule throughout the causal generation process. Besides, inspired by VAR's hierarchical coarse-to-fine multi-scale generation, our framework integrates two key components: (i) At coarse scales, we observe that generated tokens are hard for evaluation, possibly leading to erroneous acceptance of inferior samples or rejection of superior samples. Noticing that the coarse scales contain sufficient structural information, we propose clustering-based diversity search. It preserves structural variety through semantic feature clustering, enabling later selection on samples with higher potential. (ii) In fine scales, resampling-based potential selection prioritizes promising candidates using potential scores, which are defined as reward functions incorporating multi-scale generation history. Experiments on the powerful VAR model Infinity show a notable 8.7% GenEval score improvement (from 0.69 to 0.75). Key insights reveal that early-stage structural features effectively influence final quality, and resampling efficacy varies across generation scales. Code is available at https://github.com/ali-vilab/TTS-VAR.
Chinese: TTS-VAR提出了一种视觉自回归模型的测试时扩展框架,通过自适应批次调度、基于聚类的多样性搜索保留结构信息,以及基于重采样的潜力选择来提升生成质量,使GenEval得分显著提高了8.7%。
English: TTS-VAR introduces a test-time scaling framework for visual auto-regressive models that employs adaptive batch scheduling, clustering-based diversity search for structural preservation, and resampling-based potential selection to enhance generation quality, achieving an 8.7% improvement in GenEval scores.

Authors:Tianheng Qiu, Jingchun Gao, Jingyu Li, Huiyi Leong, Xuan Huang, Xi Wang, Xiaocheng Zhang, Kele Xu, Lan Zhang
Title: IntentVCNet: Bridging Spatio-Temporal Gaps for Intention-Oriented Controllable Video Captioning
Abstract:
Intent-oriented controlled video captioning aims to generate targeted descriptions for specific targets in a video based on customized user intent. Current Large Visual Language Models (LVLMs) have gained strong instruction following and visual comprehension capabilities. Although the LVLMs demonstrated proficiency in spatial and temporal understanding respectively, it was not able to perform fine-grained spatial control in time sequences in direct response to instructions. This substantial spatio-temporal gap complicates efforts to achieve fine-grained intention-oriented control in video. Towards this end, we propose a novel IntentVCNet that unifies the temporal and spatial understanding knowledge inherent in LVLMs to bridge the spatio-temporal gap from both prompting and model perspectives. Specifically, we first propose a prompt combination strategy designed to enable LLM to model the implicit relationship between prompts that characterize user intent and video sequences. We then propose a parameter efficient box adapter that augments the object semantic information in the global visual context so that the visual token has a priori information about the user intent. The final experiment proves that the combination of the two strategies can further enhance the LVLM's ability to model spatial details in video sequences, and facilitate the LVLMs to accurately generate controlled intent-oriented captions. Our proposed method achieved state-of-the-art results in several open source LVLMs and was the runner-up in the IntentVC challenge. Our code is available on https://github.com/thqiu0419/IntentVCNet.
中文: 提出的IntentVCNet通过结合提示策略和框适配器,弥合了大型视觉语言模型中的时空差距,实现了细粒度的意图导向视频描述,并取得了最先进的成果。
English: The proposed IntentVCNet bridges the spatio-temporal gap in large visual language models by combining prompt strategies and a box adapter, enabling fine-grained intent-oriented video captioning and achieving state-of-the-art results.

Authors:Clément Cornet, Romaric Besançon, Hervé Le Borgne
Title: Explaining How Visual, Textual and Multimodal Encoders Share Concepts
Abstract:
Sparse autoencoders (SAEs) have emerged as a powerful technique for extracting human-interpretable features from neural networks activations. Previous works compared different models based on SAE-derived features but those comparisons have been restricted to models within the same modality. We propose a novel indicator allowing quantitative comparison of models across SAE features, and use it to conduct a comparative study of visual, textual and multimodal encoders. We also propose to quantify the Comparative Sharedness of individual features between different classes of models. With these two new tools, we conduct several studies on 21 encoders of the three types, with two significantly different sizes, and considering generalist and domain specific datasets. The results allow to revisit previous studies at the light of encoders trained in a multimodal context and to quantify to which extent all these models share some representations or features. They also suggest that visual features that are specific to VLMs among vision encoders are shared with text encoders, highlighting the impact of text pretraining. The code is available at https://github.com/CEA-LIST/SAEshareConcepts
中文摘要:稀疏自编码器通过新型量化指标和特征共享度评估,实现了跨模态模型的比较研究,发现多模态视觉模型中的特定视觉特征与文本编码器共享,体现了文本预训练的影响。
English Summary: Sparse autoencoders enable cross-modal model comparisons through a novel quantitative indicator and comparative sharedness metric, revealing that visual features in multimodal models overlap with text encoders due to pretraining influences.

Authors:Zongzheng Zhang, Xuchong Qiu, Boran Zhang, Guantian Zheng, Xunjiang Gu, Guoxuan Chi, Huan-ang Gao, Leichen Wang, Ziming Liu, Xinrun Li, Igor Gilitschenski, Hongyang Li, Hang Zhao, Hao Zhao
Title: Delving into Mapping Uncertainty for Mapless Trajectory Prediction
Abstract:
Recent advances in autonomous driving are moving towards mapless approaches, where High-Definition (HD) maps are generated online directly from sensor data, reducing the need for expensive labeling and maintenance. However, the reliability of these online-generated maps remains uncertain. While incorporating map uncertainty into downstream trajectory prediction tasks has shown potential for performance improvements, current strategies provide limited insights into the specific scenarios where this uncertainty is beneficial. In this work, we first analyze the driving scenarios in which mapping uncertainty has the greatest positive impact on trajectory prediction and identify a critical, previously overlooked factor: the agent's kinematic state. Building on these insights, we propose a novel Proprioceptive Scenario Gating that adaptively integrates map uncertainty into trajectory prediction based on forecasts of the ego vehicle's future kinematics. This lightweight, self-supervised approach enhances the synergy between online mapping and trajectory prediction, providing interpretability around where uncertainty is advantageous and outperforming previous integration methods. Additionally, we introduce a Covariance-based Map Uncertainty approach that better aligns with map geometry, further improving trajectory prediction. Extensive ablation studies confirm the effectiveness of our approach, achieving up to 23.6% improvement in mapless trajectory prediction performance over the state-of-the-art method using the real-world nuScenes driving dataset. Our code, data, and models are publicly available at https://github.com/Ethan-Zheng136/Map-Uncertainty-for-Trajectory-Prediction.
Chinese: 本研究提出了一种本体感知场景门控机制,通过基于自车运动状态预测自适应融合地图不确定性到轨迹预测中,在无地图自动驾驶系统中实现了高达23.6%的性能提升。
English: This study introduces a proprioceptive scenario gating mechanism that adaptively integrates map uncertainty into trajectory prediction based on ego-vehicle kinematic forecasts, achieving up to 23.6% performance improvement in mapless autonomous driving systems.

Authors:Frauke Wilm, Luis Carlos Rivera Monroy, Mathias Öttl, Lukas Mürdter, Leonid Mill, Andreas Maier
Title: A COCO-Formatted Instance-Level Dataset for Plasmodium Falciparum Detection in Giemsa-Stained Blood Smears
Abstract:
Accurate detection of Plasmodium falciparum in Giemsa-stained blood smears is an essential component of reliable malaria diagnosis, especially in developing countries. Deep learning-based object detection methods have demonstrated strong potential for automated Malaria diagnosis, but their adoption is limited by the scarcity of datasets with detailed instance-level annotations. In this work, we present an enhanced version of the publicly available NIH malaria dataset, with detailed bounding box annotations in COCO format to support object detection training. We validated the revised annotations by training a Faster R-CNN model to detect infected and non-infected red blood cells, as well as white blood cells. Cross-validation on the original dataset yielded F1 scores of up to 0.88 for infected cell detection. These results underscore the importance of annotation volume and consistency, and demonstrate that automated annotation refinement combined with targeted manual correction can produce training data of sufficient quality for robust detection performance. The updated annotations set is publicly available via GitHub: https://github.com/MIRA-Vision-Microscopy/malaria-thin-smear-coco.
中文: 本研究通过完善疟疾数据集的精细化标注,并采用Faster R-CNN模型验证,证明提升标注质量可实现可靠的感染细胞自动检测,最高F1分数达0.88。
English: This study enhances a malaria dataset with detailed object detection annotations and demonstrates through Faster R-CNN validation that improved annotation quality enables robust automated detection of infected cells, achieving an F1 score up to 0.88.

Authors:Francesco Dalmonte, Emirhan Bayar, Emre Akbas, Mariana-Iuliana Georgescu
Title: Q-Former Autoencoder: A Modern Framework for Medical Anomaly Detection
Abstract:
Anomaly detection in medical images is an important yet challenging task due to the diversity of possible anomalies and the practical impossibility of collecting comprehensively annotated data sets. In this work, we tackle unsupervised medical anomaly detection proposing a modernized autoencoder-based framework, the Q-Former Autoencoder, that leverages state-of-the-art pretrained vision foundation models, such as DINO, DINOv2 and Masked Autoencoder. Instead of training encoders from scratch, we directly utilize frozen vision foundation models as feature extractors, enabling rich, multi-stage, high-level representations without domain-specific fine-tuning. We propose the usage of the Q-Former architecture as the bottleneck, which enables the control of the length of the reconstruction sequence, while efficiently aggregating multiscale features. Additionally, we incorporate a perceptual loss computed using features from a pretrained Masked Autoencoder, guiding the reconstruction towards semantically meaningful structures. Our framework is evaluated on four diverse medical anomaly detection benchmarks, achieving state-of-the-art results on BraTS2021, RESC, and RSNA. Our results highlight the potential of vision foundation model encoders, pretrained on natural images, to generalize effectively to medical image analysis tasks without further fine-tuning. We release the code and models at https://github.com/emirhanbayar/QFAE.
中文: 本研究提出Q-Former自编码器,这是一种无监督医学异常检测框架,利用预训练的冻结视觉模型提取丰富特征,无需领域特定微调即在多个基准测试中取得最优性能。
English: This study introduces the Q-Former Autoencoder, an unsupervised medical anomaly detection framework that leverages frozen pretrained vision models to extract rich features and achieves state-of-the-art results on multiple benchmarks without domain-specific fine-tuning.

Authors:Haoran Xu, Saining Zhang, Peishuo Li, Baijun Ye, Xiaoxue Chen, Huan-ang Gao, Jv Zheng, Xiaowei Song, Ziqiao Peng, Run Miao, Jinrang Jia, Yifeng Shi, Guangqi Yi, Hang Zhao, Hao Tang, Hongyang Li, Kaicheng Yu, Hao Zhao
Title: CRUISE: Cooperative Reconstruction and Editing in V2X Scenarios using Gaussian Splatting
Abstract:
Vehicle-to-everything (V2X) communication plays a crucial role in autonomous driving, enabling cooperation between vehicles and infrastructure. While simulation has significantly contributed to various autonomous driving tasks, its potential for data generation and augmentation in V2X scenarios remains underexplored. In this paper, we introduce CRUISE, a comprehensive reconstruction-and-synthesis framework designed for V2X driving environments. CRUISE employs decomposed Gaussian Splatting to accurately reconstruct real-world scenes while supporting flexible editing. By decomposing dynamic traffic participants into editable Gaussian representations, CRUISE allows for seamless modification and augmentation of driving scenes. Furthermore, the framework renders images from both ego-vehicle and infrastructure views, enabling large-scale V2X dataset augmentation for training and evaluation. Our experimental results demonstrate that: 1) CRUISE reconstructs real-world V2X driving scenes with high fidelity; 2) using CRUISE improves 3D detection across ego-vehicle, infrastructure, and cooperative views, as well as cooperative 3D tracking on the V2X-Seq benchmark; and 3) CRUISE effectively generates challenging corner cases.
中文: 本文提出CRUISE框架,通过分解式高斯溅射重建和编辑车联网驾驶场景,实现高保真数据增强,有效提升自动驾驶在检测与跟踪任务中的性能表现。
English: This paper introduces CRUISE, a reconstruction-and-synthesis framework that uses decomposed Gaussian Splatting to reconstruct and edit V2X driving scenes, enabling high-fidelity data augmentation and improving autonomous driving performance in detection and tracking tasks.

Authors:Simin Huo, Ning Li
Title: Iwin Transformer: Hierarchical Vision Transformer using Interleaved Windows
Abstract:
We introduce Iwin Transformer, a novel position-embedding-free hierarchical vision transformer, which can be fine-tuned directly from low to high resolution, through the collaboration of innovative interleaved window attention and depthwise separable convolution. This approach uses attention to connect distant tokens and applies convolution to link neighboring tokens, enabling global information exchange within a single module, overcoming Swin Transformer's limitation of requiring two consecutive blocks to approximate global attention. Extensive experiments on visual benchmarks demonstrate that Iwin Transformer exhibits strong competitiveness in tasks such as image classification (87.4 top-1 accuracy on ImageNet-1K), semantic segmentation and video action recognition. We also validate the effectiveness of the core component in Iwin as a standalone module that can seamlessly replace the self-attention module in class-conditional image generation. The concepts and methods introduced by the Iwin Transformer have the potential to inspire future research, like Iwin 3D Attention in video generation. The code and models are available at https://github.com/cominder/Iwin-Transformer.
中文摘要:Iwin Transformer 是一种无需位置嵌入的分层视觉模型,通过交错窗口注意力与深度卷积的协同设计,在单一模块中实现全局信息交互,在图像分类、语义分割和视频识别等任务中展现出卓越性能。
English Summary: The Iwin Transformer is a hierarchical vision model that integrates interleaved window attention and depthwise convolution to achieve global information exchange in a single module, demonstrating strong performance across image classification, segmentation, and video tasks.

Authors:Yilong Hu, Shijie Chang, Lihe Zhang, Feng Tian, Weibing Sun, Huchuan Lu
Title: UniSegDiff: Boosting Unified Lesion Segmentation via a Staged Diffusion Model
Abstract:
The Diffusion Probabilistic Model (DPM) has demonstrated remarkable performance across a variety of generative tasks. The inherent randomness in diffusion models helps address issues such as blurring at the edges of medical images and labels, positioning Diffusion Probabilistic Models (DPMs) as a promising approach for lesion segmentation. However, we find that the current training and inference strategies of diffusion models result in an uneven distribution of attention across different timesteps, leading to longer training times and suboptimal solutions. To this end, we propose UniSegDiff, a novel diffusion model framework designed to address lesion segmentation in a unified manner across multiple modalities and organs. This framework introduces a staged training and inference approach, dynamically adjusting the prediction targets at different stages, forcing the model to maintain high attention across all timesteps, and achieves unified lesion segmentation through pre-training the feature extraction network for segmentation. We evaluate performance on six different organs across various imaging modalities. Comprehensive experimental results demonstrate that UniSegDiff significantly outperforms previous state-of-the-art (SOTA) approaches. The code is available at https://github.com/HUYILONG-Z/UniSegDiff.
Chinese: 提出的UniSegDiff框架通过引入分阶段训练和推理方法,在所有时间步保持一致的注意力,显著提升了多模态和多器官的病灶分割性能,明显优于以往的最先进方法。
English: The proposed UniSegDiff framework enhances lesion segmentation across multiple modalities and organs by introducing a staged training and inference approach that maintains consistent attention across all timesteps, significantly outperforming previous state-of-the-art methods.

Authors:Runmin Zhang, Zhu Yu, Si-Yuan Cao, Lingyu Zhu, Guangyi Zhang, Xiaokai Bai, Hui-Liang Shen
Title: Boosting Multi-View Indoor 3D Object Detection via Adaptive 3D Volume Construction
Abstract:
This work presents SGCDet, a novel multi-view indoor 3D object detection framework based on adaptive 3D volume construction. Unlike previous approaches that restrict the receptive field of voxels to fixed locations on images, we introduce a geometry and context aware aggregation module to integrate geometric and contextual information within adaptive regions in each image and dynamically adjust the contributions from different views, enhancing the representation capability of voxel features. Furthermore, we propose a sparse volume construction strategy that adaptively identifies and selects voxels with high occupancy probabilities for feature refinement, minimizing redundant computation in free space. Benefiting from the above designs, our framework achieves effective and efficient volume construction in an adaptive way. Better still, our network can be supervised using only 3D bounding boxes, eliminating the dependence on ground-truth scene geometry. Experimental results demonstrate that SGCDet achieves state-of-the-art performance on the ScanNet, ScanNet200 and ARKitScenes datasets. The source code is available at https://github.com/RM-Zhang/SGCDet.
中文: SGCDet提出了一种自适应三维体积构建框架,通过几何感知聚合和稀疏体素选择,仅需三维边界框监督即可实现最先进的室内三维物体检测。
English: SGCDet introduces an adaptive 3D volume construction framework with geometry-aware aggregation and sparse voxel selection, achieving state-of-the-art indoor 3D object detection using only 3D bounding box supervision.

Authors:Jiangjun Peng, Yisi Luo, Xiangyong Cao, Shuang Xu, Deyu Meng
Title: Beyond Low-rankness: Guaranteed Matrix Recovery via Modified Nuclear Norm
Abstract:
The nuclear norm (NN) has been widely explored in matrix recovery problems, such as Robust PCA and matrix completion, leveraging the inherent global low-rank structure of the data. In this study, we introduce a new modified nuclear norm (MNN) framework, where the MNN family norms are defined by adopting suitable transformations and performing the NN on the transformed matrix. The MNN framework offers two main advantages: (1) it jointly captures both local information and global low-rankness without requiring trade-off parameter tuning; (2) Under mild assumptions on the transformation, we provided exact theoretical recovery guarantees for both Robust PCA and MC tasks-an achievement not shared by existing methods that combine local and global information. Thanks to its general and flexible design, MNN can accommodate various proven transformations, enabling a unified and effective approach to structured low-rank recovery. Extensive experiments demonstrate the effectiveness of our method. Code and supplementary material are available at https://github.com/andrew-pengjj/modified_nuclear_norm.
中文: 本研究提出了一种改进的核范数框架,无需参数调优即可同时捕捉局部信息和全局低秩结构,为鲁棒主成分分析和矩阵补全任务提供了理论保证,并通过实验验证了其有效性。
English: This study introduces a modified nuclear norm framework that captures both local and global low-rank structures without parameter tuning, providing theoretical guarantees and demonstrating effectiveness across various transformations and applications.

Authors:Minje Park, Jeonghwa Lim, Taehyung Yu, Sunghoon Joo
Title: SemiSegECG: A Multi-Dataset Benchmark for Semi-Supervised Semantic Segmentation in ECG Delineation
Abstract:
Electrocardiogram (ECG) delineation, the segmentation of meaningful waveform features, is critical for clinical diagnosis. Despite recent advances using deep learning, progress has been limited by the scarcity of publicly available annotated datasets. Semi-supervised learning presents a promising solution by leveraging abundant unlabeled ECG data. In this study, we present SemiSegECG, the first systematic benchmark for semi-supervised semantic segmentation (SemiSeg) in ECG delineation. We curated and unified multiple public datasets, including previously underused sources, to support robust and diverse evaluation. We adopted five representative SemiSeg algorithms from computer vision, implemented them on two different architectures: the convolutional network and the transformer, and evaluated them in two different settings: in-domain and cross-domain. Additionally, we propose ECG-specific training configurations and augmentation strategies and introduce a standardized evaluation framework. Our results show that the transformer outperforms the convolutional network in semi-supervised ECG delineation. We anticipate that SemiSegECG will serve as a foundation for advancing semi-supervised ECG delineation methods and will facilitate further research in this domain.
中文:SemiSegECG首次建立了心电描记半监督语义分割的系统基准,通过整合多源数据和标准化评估框架,证明了基于Transformer的模型在半监督心电波形分割中优于卷积网络。
English: SemiSegECG introduces the first systematic benchmark for semi-supervised semantic segmentation in ECG delineation, demonstrating transformer-based models' superiority over convolutional networks while providing unified datasets and evaluation frameworks.

Authors:Jincheng Li, Chunyu Xie, Ji Ao, Dawei Leng, Yuhui Yin
Title: LMM-Det: Make Large Multimodal Models Excel in Object Detection
Abstract:
Large multimodal models (LMMs) have garnered wide-spread attention and interest within the artificial intelligence research and industrial communities, owing to their remarkable capability in multimodal understanding, reasoning, and in-context learning, among others. While LMMs have demonstrated promising results in tackling multimodal tasks like image captioning, visual question answering, and visual grounding, the object detection capabilities of LMMs exhibit a significant gap compared to specialist detectors. To bridge the gap, we depart from the conventional methods of integrating heavy detectors with LMMs and propose LMM-Det, a simple yet effective approach that leverages a Large Multimodal Model for vanilla object Detection without relying on specialized detection modules. Specifically, we conduct a comprehensive exploratory analysis when a large multimodal model meets with object detection, revealing that the recall rate degrades significantly compared with specialist detection models. To mitigate this, we propose to increase the recall rate by introducing data distribution adjustment and inference optimization tailored for object detection. We re-organize the instruction conversations to enhance the object detection capabilities of large multimodal models. We claim that a large multimodal model possesses detection capability without any extra detection modules. Extensive experiments support our claim and show the effectiveness of the versatile LMM-Det. The datasets, models, and codes are available at https://github.com/360CVGroup/LMM-Det.
中文: 本文提出LMM-Det方法,通过数据分布调整和推理优化提升召回率,使大型多模态模型无需专用检测模块即可实现物体检测功能。
English: The paper introduces LMM-Det, a novel approach that enables large multimodal models to perform object detection without specialized modules by improving recall through data distribution adjustments and inference optimization.

Authors:Chenyu Su, Weiwei Shang, Chen Qian, Fei Zhang, Shuang Cong
Title: ReSem3D: Refinable 3D Spatial Constraints via Fine-Grained Semantic Grounding for Generalizable Robotic Manipulation
Abstract:
Semantics-driven 3D spatial constraints align highlevel semantic representations with low-level action spaces, facilitating the unification of task understanding and execution in robotic manipulation. The synergistic reasoning of Multimodal Large Language Models (MLLMs) and Vision Foundation Models (VFMs) enables cross-modal 3D spatial constraint construction. Nevertheless, existing methods have three key limitations: (1) coarse semantic granularity in constraint modeling, (2) lack of real-time closed-loop planning, (3) compromised robustness in semantically diverse environments. To address these challenges, we propose ReSem3D, a unified manipulation framework for semantically diverse environments, leveraging the synergy between VFMs and MLLMs to achieve fine-grained visual grounding and dynamically constructs hierarchical 3D spatial constraints for real-time manipulation. Specifically, the framework is driven by hierarchical recursive reasoning in MLLMs, which interact with VFMs to automatically construct 3D spatial constraints from natural language instructions and RGB-D observations in two stages: part-level extraction and region-level refinement. Subsequently, these constraints are encoded as real-time optimization objectives in joint space, enabling reactive behavior to dynamic disturbances. Extensive simulation and real-world experiments are conducted in semantically rich household and sparse chemical lab environments. The results demonstrate that ReSem3D performs diverse manipulation tasks under zero-shot conditions, exhibiting strong adaptability and generalization. Code and videos are available at https://github.com/scy-v/ReSem3D and https://resem3d.github.io.
中文: ReSem3D框架通过多模态AI模型的协同作用,从自然语言指令构建精细化的3D空间约束,实现在多样化环境中的实时自适应机器人操作。
English: ReSem3D is a robotic manipulation framework that leverages multimodal AI models to create fine-grained 3D spatial constraints from natural language, enabling real-time adaptive task execution in diverse environments.

Authors:Zhuoguang Chen, Minghui Qin, Tianyuan Yuan, Zhe Liu, Hang Zhao
Title: LONG3R: Long Sequence Streaming 3D Reconstruction
Abstract:
Recent advancements in multi-view scene reconstruction have been significant, yet existing methods face limitations when processing streams of input images. These methods either rely on time-consuming offline optimization or are restricted to shorter sequences, hindering their applicability in real-time scenarios. In this work, we propose LONG3R (LOng sequence streaming 3D Reconstruction), a novel model designed for streaming multi-view 3D scene reconstruction over longer sequences. Our model achieves real-time processing by operating recurrently, maintaining and updating memory with each new observation. We first employ a memory gating mechanism to filter relevant memory, which, together with a new observation, is fed into a dual-source refined decoder for coarse-to-fine interaction. To effectively capture long-sequence memory, we propose a 3D spatio-temporal memory that dynamically prunes redundant spatial information while adaptively adjusting resolution along the scene. To enhance our model's performance on long sequences while maintaining training efficiency, we employ a two-stage curriculum training strategy, each stage targeting specific capabilities. Experiments demonstrate that LONG3R outperforms state-of-the-art streaming methods, particularly for longer sequences, while maintaining real-time inference speed. Project page: https://zgchen33.github.io/LONG3R/.

Authors:Chengchang Tian, Jianwei Ma, Yan Huang, Zhanye Chen, Honghao Wei, Hui Zhang, Wei Hong
Title: DATA: Domain-And-Time Alignment for High-Quality Feature Fusion in Collaborative Perception
Abstract:
Feature-level fusion shows promise in collaborative perception (CP) through balanced performance and communication bandwidth trade-off. However, its effectiveness critically relies on input feature quality. The acquisition of high-quality features faces domain gaps from hardware diversity and deployment conditions, alongside temporal misalignment from transmission delays. These challenges degrade feature quality with cumulative effects throughout the collaborative network. In this paper, we present the Domain-And-Time Alignment (DATA) network, designed to systematically align features while maximizing their semantic representations for fusion. Specifically, we propose a Consistency-preserving Domain Alignment Module (CDAM) that reduces domain gaps through proximal-region hierarchical downsampling and observability-constrained discriminator. We further propose a Progressive Temporal Alignment Module (PTAM) to handle transmission delays via multi-scale motion modeling and two-stage compensation. Building upon the aligned features, an Instance-focused Feature Aggregation Module (IFAM) is developed to enhance semantic representations. Extensive experiments demonstrate that DATA achieves state-of-the-art performance on three typical datasets, maintaining robustness with severe communication delays and pose errors. The code will be released at https://github.com/ChengchangTian/DATA.
Chinese: 领域与时间对齐(DATA)网络通过专用模块解决协作感知中的领域差异和时间错位问题,在保持对通信延迟和位姿误差鲁棒性的同时,实现了最先进的性能表现。
English: The Domain-And-Time Alignment (DATA) network addresses domain gaps and temporal misalignments in collaborative perception through specialized modules, achieving state-of-the-art performance with robust handling of communication delays and pose errors.

Authors:Minghao Fu, Guo-Hua Wang, Xiaohao Chen, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang
Title: TeEFusion: Blending Text Embeddings to Distill Classifier-Free Guidance
Abstract:
Recent advances in text-to-image synthesis largely benefit from sophisticated sampling strategies and classifier-free guidance (CFG) to ensure high-quality generation. However, CFG's reliance on two forward passes, especially when combined with intricate sampling algorithms, results in prohibitively high inference costs. To address this, we introduce TeEFusion (Text Embeddings Fusion), a novel and efficient distillation method that directly incorporates the guidance magnitude into the text embeddings and distills the teacher model's complex sampling strategy. By simply fusing conditional and unconditional text embeddings using linear operations, TeEFusion reconstructs the desired guidance without adding extra parameters, simultaneously enabling the student model to learn from the teacher's output produced via its sophisticated sampling approach. Extensive experiments on state-of-the-art models such as SD3 demonstrate that our method allows the student to closely mimic the teacher's performance with a far simpler and more efficient sampling strategy. Consequently, the student model achieves inference speeds up to 6$\times$ faster than the teacher model, while maintaining image quality at levels comparable to those obtained through the teacher's complex sampling approach. The code is publicly available at https://github.com/AIDC-AI/TeEFusion.
中文摘要:TeEFusion是一种高效的蒸馏方法,通过将引导融入文本嵌入并提炼复杂采样策略,使学生模型在保持与教师模型相当图像质量的同时,推理速度提升高达6倍。
English Summary: TeEFusion is an efficient distillation method that integrates guidance into text embeddings and distills complex sampling strategies, enabling student models to achieve up to 6× faster inference while maintaining image quality comparable to teacher models.

Authors:SeungJun Moon, Hah Min Lew, Seungeun Lee, Ji-Su Kang, Gyeong-Moon Park
Title: GeoAvatar: Adaptive Geometrical Gaussian Splatting for 3D Head Avatar
Abstract:
Despite recent progress in 3D head avatar generation, balancing identity preservation, i.e., reconstruction, with novel poses and expressions, i.e., animation, remains a challenge. Existing methods struggle to adapt Gaussians to varying geometrical deviations across facial regions, resulting in suboptimal quality. To address this, we propose GeoAvatar, a framework for adaptive geometrical Gaussian Splatting. GeoAvatar leverages Adaptive Pre-allocation Stage (APS), an unsupervised method that segments Gaussians into rigid and flexible sets for adaptive offset regularization. Then, based on mouth anatomy and dynamics, we introduce a novel mouth structure and the part-wise deformation strategy to enhance the animation fidelity of the mouth. Finally, we propose a regularization loss for precise rigging between Gaussians and 3DMM faces. Moreover, we release DynamicFace, a video dataset with highly expressive facial motions. Extensive experiments show the superiority of GeoAvatar compared to state-of-the-art methods in reconstruction and novel animation scenarios.

Authors:Jinhong He, Minglong Xue, Zhipu Liu, Mingliang Zhou, Aoxiang Ning, Palaiahnakote Shivakumara
Title: Degradation-Consistent Learning via Bidirectional Diffusion for Low-Light Image Enhancement
Abstract:
Low-light image enhancement aims to improve the visibility of degraded images to better align with human visual perception. While diffusion-based methods have shown promising performance due to their strong generative capabilities. However, their unidirectional modelling of degradation often struggles to capture the complexity of real-world degradation patterns, leading to structural inconsistencies and pixel misalignments. To address these challenges, we propose a bidirectional diffusion optimization mechanism that jointly models the degradation processes of both low-light and normal-light images, enabling more precise degradation parameter matching and enhancing generation quality. Specifically, we perform bidirectional diffusion-from low-to-normal light and from normal-to-low light during training and introduce an adaptive feature interaction block (AFI) to refine feature representation. By leveraging the complementarity between these two paths, our approach imposes an implicit symmetry constraint on illumination attenuation and noise distribution, facilitating consistent degradation learning and improving the models ability to perceive illumination and detail degradation. Additionally, we design a reflection-aware correction module (RACM) to guide color restoration post-denoising and suppress overexposed regions, ensuring content consistency and generating high-quality images that align with human visual perception. Extensive experiments on multiple benchmark datasets demonstrate that our method outperforms state-of-the-art methods in both quantitative and qualitative evaluations while generalizing effectively to diverse degradation scenarios. Code at https://github.com/hejh8/BidDiff
中文摘要:本文提出了一种双向扩散优化方法,通过联合建模低光与正常光图像的退化过程,结合自适应特征交互和反射感知校正模块,有效提升图像增强的结构一致性与视觉质量。
English Summary: This paper introduces a bidirectional diffusion optimization method for low-light image enhancement that jointly models degradation processes in both low-light and normal-light images, incorporating adaptive feature interaction and reflection-aware correction to improve structural consistency and visual quality.

Authors:Binghua Li, Ziqing Chang, Tong Liang, Chao Li, Toshihisa Tanaka, Shigeki Aoki, Qibin Zhao, Zhe Sun
Title: Parameter-Efficient Fine-Tuning of 3D DDPM for MRI Image Generation Using Tensor Networks
Abstract:
We address the challenge of parameter-efficient fine-tuning (PEFT) for three-dimensional (3D) U-Net-based denoising diffusion probabilistic models (DDPMs) in magnetic resonance imaging (MRI) image generation. Despite its practical significance, research on parameter-efficient representations of 3D convolution operations remains limited. To bridge this gap, we propose Tensor Volumetric Operator (TenVOO), a novel PEFT method specifically designed for fine-tuning DDPMs with 3D convolutional backbones. Leveraging tensor network modeling, TenVOO represents 3D convolution kernels with lower-dimensional tensors, effectively capturing complex spatial dependencies during fine-tuning with few parameters. We evaluate TenVOO on three downstream brain MRI datasets-ADNI, PPMI, and BraTS2021-by fine-tuning a DDPM pretrained on 59,830 T1-weighted brain MRI scans from the UK Biobank. Our results demonstrate that TenVOO achieves state-of-the-art performance in multi-scale structural similarity index measure (MS-SSIM), outperforming existing approaches in capturing spatial dependencies while requiring only 0.3% of the trainable parameters of the original model. Our code is available at: https://github.com/xiaovhua/tenvoo
中文: 我们提出TenVOO方法,针对三维U-Net扩散模型的参数高效微调,通过张量网络以少量参数表示卷积核,在脑部MRI数据集上实现了最优性能。
English: We propose TenVOO, a parameter-efficient fine-tuning method for 3D U-Net-based DDPMs in MRI generation that uses tensor networks to represent convolution kernels with fewer parameters while achieving superior performance on brain MRI datasets.

Authors:Qianyi He, Yuan Chang Leong
Title: A Multimodal Seq2Seq Transformer for Predicting Brain Responses to Naturalistic Stimuli
Abstract:
The Algonauts 2025 Challenge called on the community to develop encoding models that predict whole-brain fMRI responses to naturalistic multimodal movies. In this submission, we propose a sequence-to-sequence Transformer that autoregressively predicts fMRI activity from visual, auditory, and language inputs. Stimulus features were extracted using pretrained models including VideoMAE, HuBERT, Qwen, and BridgeTower. The decoder integrates information from prior brain states and current stimuli via dual cross-attention mechanisms that attend to both perceptual information extracted from the stimulus as well as narrative information provided by high-level summaries of the content. One core innovation of our approach is the use of sequences of multimodal context to predict sequences of brain activity, enabling the model to capture long-range temporal structure in both stimuli and neural responses. Another is the combination of a shared encoder with partial subject-specific decoder, which leverages common representational structure across subjects while accounting for individual variability. Our model achieves strong performance on both in-distribution and out-of-distribution data, demonstrating the effectiveness of temporally-aware, multimodal sequence modeling for brain activity prediction. The code is available at https://github.com/Angelneer926/Algonauts_challenge.
Chinese: 针对Algonauts 2025挑战赛,本研究提出一种序列到序列的Transformer模型,通过预训练模型提取视听语言特征,采用双重交叉注意力机制整合刺激信息与大脑状态,在分布内外数据上均实现了优异的脑活动预测性能。
English: The Algonauts 2025 Challenge submission introduces a sequence-to-sequence Transformer that autoregressively predicts whole-brain fMRI responses by integrating multimodal inputs—visual, auditory, and language—using pretrained models and dual cross-attention mechanisms, achieving robust performance on both in-distribution and out-of-distribution data.

Authors:Pascal Spiegler, Taha Koleilat, Arash Harirpoush, Corey S. Miller, Hassan Rivaz, Marta Kersten-Oertel, Yiming Xiao
Title: TextSAM-EUS: Text Prompt Learning for SAM to Accurately Segment Pancreatic Tumor in Endoscopic Ultrasound
Abstract:
Pancreatic cancer carries a poor prognosis and relies on endoscopic ultrasound (EUS) for targeted biopsy and radiotherapy. However, the speckle noise, low contrast, and unintuitive appearance of EUS make segmentation of pancreatic tumors with fully supervised deep learning (DL) models both error-prone and dependent on large, expert-curated annotation datasets. To address these challenges, we present TextSAM-EUS, a novel, lightweight, text-driven adaptation of the Segment Anything Model (SAM) that requires no manual geometric prompts at inference. Our approach leverages text prompt learning (context optimization) through the BiomedCLIP text encoder in conjunction with a LoRA-based adaptation of SAM's architecture to enable automatic pancreatic tumor segmentation in EUS, tuning only 0.86% of the total parameters. On the public Endoscopic Ultrasound Database of the Pancreas, TextSAM-EUS with automatic prompts attains 82.69% Dice and 85.28% normalized surface distance (NSD), and with manual geometric prompts reaches 83.10% Dice and 85.70% NSD, outperforming both existing state-of-the-art (SOTA) supervised DL models and foundation models (e.g., SAM and its variants). As the first attempt to incorporate prompt learning in SAM-based medical image segmentation, TextSAM-EUS offers a practical option for efficient and robust automatic EUS segmentation. Code is available at https://github.com/HealthX-Lab/TextSAM-EUS .
中文摘要:TextSAM-EUS是一种轻量级的文本驱动改进模型,无需手动几何提示即可实现内镜超声中胰腺肿瘤的自动分割,在仅调整极少量参数的情况下,其性能超越了现有最先进方法。
English Summary: TextSAM-EUS is a lightweight, text-adapted version of the Segment Anything Model that enables automatic pancreatic tumor segmentation in endoscopic ultrasound without manual prompts, achieving superior performance over existing methods while tuning only a minimal fraction of parameters.

Authors:Xiaoran Sun, Liyan Wang, Cong Wang, Yeying Jin, Kin-man Lam, Zhixun Su, Yang Yang, Jinshan Pan
Title: Adapting Large VLMs with Iterative and Manual Instructions for Generative Low-light Enhancement
Abstract:
Most existing low-light image enhancement (LLIE) methods rely on pre-trained model priors, low-light inputs, or both, while neglecting the semantic guidance available from normal-light images. This limitation hinders their effectiveness in complex lighting conditions. In this paper, we propose VLM-IMI, a novel framework that leverages large vision-language models (VLMs) with iterative and manual instructions (IMIs) for LLIE. VLM-IMI incorporates textual descriptions of the desired normal-light content as enhancement cues, enabling semantically informed restoration. To effectively integrate cross-modal priors, we introduce an instruction prior fusion module, which dynamically aligns and fuses image and text features, promoting the generation of detailed and semantically coherent outputs. During inference, we adopt an iterative and manual instruction strategy to refine textual instructions, progressively improving visual quality. This refinement enhances structural fidelity, semantic alignment, and the recovery of fine details under extremely low-light conditions. Extensive experiments across diverse scenarios demonstrate that VLM-IMI outperforms state-of-the-art methods in both quantitative metrics and perceptual quality. The source code is available at https://github.com/sunxiaoran01/VLM-IMI.
VLM-IMI introduces a novel low-light image enhancement framework that leverages vision-language models with iterative manual instructions, using textual descriptions of normal-light content to achieve semantically informed restoration that outperforms existing methods.
English Summary:

Authors:Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal
Title: GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs
Abstract:
Inference-time steering methods offer a lightweight alternative to fine-tuning large language models (LLMs) and vision-language models (VLMs) by modifying internal activations at test time without updating model weights. However, most existing approaches rely on fixed, global intervention vectors, overlook the causal influence of individual input tokens, and fail to leverage informative gradients from the model's logits, particularly in multimodal settings where visual and textual inputs contribute unevenly. To address these limitations, we introduce GrAInS, an inference-time steering approach that operates across both language-only and vision-language models and tasks. GrAInS uses contrastive, gradient-based attribution via Integrated Gradients to identify the top-k most influential tokens, both positively and negatively attributed based on their contribution to preferred versus dispreferred outputs. These tokens are then used to construct directional steering vectors that capture semantic shifts from undesirable to desirable behavior. During inference, GrAInS adjusts hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale. This enables fine-grained, interpretable, and modular control over model behavior, without retraining or auxiliary supervision. Empirically, GrAInS consistently outperforms both fine-tuning and existing steering baselines: it achieves a 13.22% accuracy gain on TruthfulQA using Llama-3.1-8B, reduces hallucination rates on MMHal-Bench from 0.624 to 0.514 with LLaVA-1.6-7B, and improves alignment win rates on SPA-VL by 8.11%, all while preserving the model's fluency and general capabilities.
中文: GrAInS提出了一种新颖的推理时引导方法,通过基于梯度的归因分析动态调整模型激活,无需重新训练即可在提升真实性、降低幻觉方面实现显著性能突破。
English: GrAInS introduces a novel inference-time steering method that uses gradient-based token attribution to dynamically adjust model activations, achieving significant performance improvements in truthfulness and reduced hallucinations without retraining.

Authors:Yuezun Li, Delong Zhu, Xinjie Cui, Siwei Lyu
Title: Celeb-DF++: A Large-scale Challenging Video DeepFake Benchmark for Generalizable Forensics
Abstract:
The rapid advancement of AI technologies has significantly increased the diversity of DeepFake videos circulating online, posing a pressing challenge for \textit{generalizable forensics}, \ie, detecting a wide range of unseen DeepFake types using a single model. Addressing this challenge requires datasets that are not only large-scale but also rich in forgery diversity. However, most existing datasets, despite their scale, include only a limited variety of forgery types, making them insufficient for developing generalizable detection methods. Therefore, we build upon our earlier Celeb-DF dataset and introduce {Celeb-DF++}, a new large-scale and challenging video DeepFake benchmark dedicated to the generalizable forensics challenge. Celeb-DF++ covers three commonly encountered forgery scenarios: Face-swap (FS), Face-reenactment (FR), and Talking-face (TF). Each scenario contains a substantial number of high-quality forged videos, generated using a total of 22 various recent DeepFake methods. These methods differ in terms of architectures, generation pipelines, and targeted facial regions, covering the most prevalent DeepFake cases witnessed in the wild. We also introduce evaluation protocols for measuring the generalizability of 24 recent detection methods, highlighting the limitations of existing detection methods and the difficulty of our new dataset.
中文: Celeb-DF++数据集的推出通过涵盖三种常见伪造场景下的22种不同方法,为提升DeepFake检测的泛化能力提供了大规模多样化基准,同时揭示了现有检测方法的局限性。
English: The introduction of the Celeb-DF++ dataset addresses the need for a large-scale and diverse benchmark to improve generalizable DeepFake detection by including 22 different forgery methods across three common scenarios, revealing the limitations of current detection techniques.

Authors:Jaeho Shin, Hyeonjae Gil, Junwoo Jang, Maani Ghaffari, Ayoung Kim
Title: Registration beyond Points: General Affine Subspace Alignment via Geodesic Distance on Grassmann Manifold
Abstract:
Affine Grassmannian has been favored for expressing proximity between lines and planes due to its theoretical exactness in measuring distances among features. Despite this advantage, the existing method can only measure the proximity without yielding the distance as an explicit function of rigid body transformation. Thus, an optimizable distance function on the manifold has remained underdeveloped, stifling its application in registration problems. This paper is the first to explicitly derive an optimizable cost function between two Grassmannian features with respect to rigid body transformation ($\mathbf{R}$ and $\mathbf{t}$). Specifically, we present a rigorous mathematical proof demonstrating that the bases of high-dimensional linear subspaces can serve as an explicit representation of the cost. Finally, we propose an optimizable cost function based on the transformed bases that can be applied to the registration problem of any affine subspace. Compared to vector parameter-based approaches, our method is able to find a globally optimal solution by directly minimizing the geodesic distance which is agnostic to representation ambiguity. The resulting cost function and its extension to the inlier-set maximizing Branch-and-Bound (BnB) solver have been demonstrated to improve the convergence of existing solutions or outperform them in various computer vision tasks. The code is available on https://github.com/joomeok/GrassmannRegistration.
Chinese: 本文首次在仿射格拉斯曼流形上推导出关于刚体变换的显式可优化代价函数,通过直接最小化测地距离来解决配准问题,并在多种计算机视觉任务中展现出优越性能。
English: This paper introduces the first explicit and optimizable cost function on the affine Grassmannian manifold that directly measures geodesic distance with respect to rigid body transformations, enabling globally optimal solutions for registration problems in computer vision.

Authors:Huy Nguyen, Kien Nguyen, Akila Pemasiri, Akmal Jahan, Clinton Fookes, Sridha Sridharan
Title: AG-VPReID.VIR: Bridging Aerial and Ground Platforms for Video-based Visible-Infrared Person Re-ID
Abstract:
Person re-identification (Re-ID) across visible and infrared modalities is crucial for 24-hour surveillance systems, but existing datasets primarily focus on ground-level perspectives. While ground-based IR systems offer nighttime capabilities, they suffer from occlusions, limited coverage, and vulnerability to obstructions--problems that aerial perspectives uniquely solve. To address these limitations, we introduce AG-VPReID.VIR, the first aerial-ground cross-modality video-based person Re-ID dataset. This dataset captures 1,837 identities across 4,861 tracklets (124,855 frames) using both UAV-mounted and fixed CCTV cameras in RGB and infrared modalities. AG-VPReID.VIR presents unique challenges including cross-viewpoint variations, modality discrepancies, and temporal dynamics. Additionally, we propose TCC-VPReID, a novel three-stream architecture designed to address the joint challenges of cross-platform and cross-modality person Re-ID. Our approach bridges the domain gaps between aerial-ground perspectives and RGB-IR modalities, through style-robust feature learning, memory-based cross-view adaptation, and intermediary-guided temporal modeling. Experiments show that AG-VPReID.VIR presents distinctive challenges compared to existing datasets, with our TCC-VPReID framework achieving significant performance gains across multiple evaluation protocols. Dataset and code are available at https://github.com/agvpreid25/AG-VPReID.VIR.
中文: 本文提出了首个空地跨模态视频行人重识别数据集AG-VPReID.VIR,并开发了新型三流TCC-VPReID框架,通过鲁棒特征学习有效解决了跨平台和跨模态的挑战,实现了卓越性能。
English: The authors introduce AG-VPReID.VIR, the first aerial-ground cross-modality video dataset for person re-identification, along with a novel three-stream TCC-VPReID framework that effectively addresses cross-platform and cross-modality challenges through robust feature learning and achieves superior performance.

Authors:Deepa Krishnaswamy, Cosmin Ciausu, Steve Pieper, Ron Kikinis, Benjamin Billot, Andrey Fedorov
Title: Benchmarking of Deep Learning Methods for Generic MRI Multi-Organ Abdominal Segmentation
Abstract:
Recent advances in deep learning have led to robust automated tools for segmentation of abdominal computed tomography (CT). Meanwhile, segmentation of magnetic resonance imaging (MRI) is substantially more challenging due to the inherent signal variability and the increased effort required for annotating training datasets. Hence, existing approaches are trained on limited sets of MRI sequences, which might limit their generalizability. To characterize the landscape of MRI abdominal segmentation tools, we present here a comprehensive benchmarking of the three state-of-the-art and open-source models: MRSegmentator, MRISegmentator-Abdomen, and TotalSegmentator MRI. Since these models are trained using labor-intensive manual annotation cycles, we also introduce and evaluate ABDSynth, a SynthSeg-based model purely trained on widely available CT segmentations (no real images). More generally, we assess accuracy and generalizability by leveraging three public datasets (not seen by any of the evaluated methods during their training), which span all major manufacturers, five MRI sequences, as well as a variety of subject conditions, voxel resolutions, and fields-of-view. Our results reveal that MRSegmentator achieves the best performance and is most generalizable. In contrast, ABDSynth yields slightly less accurate results, but its relaxed requirements in training data make it an alternative when the annotation budget is limited. The evaluation code and datasets are given for future benchmarking at https://github.com/deepakri201/AbdoBench, along with inference code and weights for ABDSynth.
中文: 深度学习进展已实现稳健的腹部CT分割,但MRI分割因信号变异性和标注数据有限而更具挑战性;基准测试显示MRSegmentator性能最优且泛化能力最强,而ABDSynth虽精度稍逊,却因降低标注需求成为实用替代方案。
English: Recent deep learning advances have enabled robust abdominal CT segmentation, but MRI segmentation remains challenging due to signal variability and limited annotated data, leading to a benchmarking study that found MRSegmentator as the most accurate and generalizable model while ABDSynth offers a practical alternative with reduced annotation needs.

Authors:Rameen Abdal, Or Patashnik, Ekaterina Deyneka, Hao Chen, Aliaksandr Siarohin, Sergey Tulyakov, Daniel Cohen-Or, Kfir Aberman
Title: Zero-Shot Dynamic Concept Personalization with Grid-Based LoRA
Abstract:
Recent advances in text-to-video generation have enabled high-quality synthesis from text and image prompts. While the personalization of dynamic concepts, which capture subject-specific appearance and motion from a single video, is now feasible, most existing methods require per-instance fine-tuning, limiting scalability. We introduce a fully zero-shot framework for dynamic concept personalization in text-to-video models. Our method leverages structured 2x2 video grids that spatially organize input and output pairs, enabling the training of lightweight Grid-LoRA adapters for editing and composition within these grids. At inference, a dedicated Grid Fill module completes partially observed layouts, producing temporally coherent and identity preserving outputs. Once trained, the entire system operates in a single forward pass, generalizing to previously unseen dynamic concepts without any test-time optimization. Extensive experiments demonstrate high-quality and consistent results across a wide range of subjects beyond trained concepts and editing scenarios.
中文: 本文提出了一种零样本动态概念个性化框架,通过Grid-LoRA适配器和网格填充模块,无需测试时优化即可实现可扩展的高质量视频生成。
English: The paper introduces a zero-shot framework for dynamic concept personalization in text-to-video generation, using Grid-LoRA adapters and a Grid Fill module to achieve scalable, high-quality video synthesis without test-time optimization.

Authors:Md. Al-Masrur Khan, Durgakant Pushp, Lantao Liu
Title: AFRDA: Attentive Feature Refinement for Domain Adaptive Semantic Segmentation
Abstract:
In Unsupervised Domain Adaptive Semantic Segmentation (UDA-SS), a model is trained on labeled source domain data (e.g., synthetic images) and adapted to an unlabeled target domain (e.g., real-world images) without access to target annotations. Existing UDA-SS methods often struggle to balance fine-grained local details with global contextual information, leading to segmentation errors in complex regions. To address this, we introduce the Adaptive Feature Refinement (AFR) module, which enhances segmentation accuracy by refining highresolution features using semantic priors from low-resolution logits. AFR also integrates high-frequency components, which capture fine-grained structures and provide crucial boundary information, improving object delineation. Additionally, AFR adaptively balances local and global information through uncertaintydriven attention, reducing misclassifications. Its lightweight design allows seamless integration into HRDA-based UDA methods, leading to state-of-the-art segmentation performance. Our approach improves existing UDA-SS methods by 1.05% mIoU on GTA V --> Cityscapes and 1.04% mIoU on Synthia-->Cityscapes. The implementation of our framework is available at: https://github.com/Masrur02/AFRDA
中文: 自适应特征优化(AFR)模块通过语义先验和高频分量优化高分辨率特征,以不确定性驱动注意力平衡局部与全局信息,从而在无监督领域自适应语义分割中实现最先进的性能。
English: The Adaptive Feature Refinement (AFR) module enhances unsupervised domain adaptive semantic segmentation by refining high-resolution features with semantic priors and high-frequency components, achieving state-of-the-art performance through improved local-global information balance.

Authors:Dou Hoon Kwark, Shirui Luo, Xiyue Zhu, Yudu Li, Zhi-Pei Liang, Volodymyr Kindratenko
Title: Hierarchical Diffusion Framework for Pseudo-Healthy Brain MRI Inpainting with Enhanced 3D Consistency
Abstract:
Pseudo-healthy image inpainting is an essential preprocessing step for analyzing pathological brain MRI scans. Most current inpainting methods favor slice-wise 2D models for their high in-plane fidelity, but their independence across slices produces discontinuities in the volume. Fully 3D models alleviate this issue, but their high model capacity demands extensive training data for reliable, high-fidelity synthesis -- often impractical in medical settings. We address these limitations with a hierarchical diffusion framework by replacing direct 3D modeling with two perpendicular coarse-to-fine 2D stages. An axial diffusion model first yields a coarse, globally consistent inpainting; a coronal diffusion model then refines anatomical details. By combining perpendicular spatial views with adaptive resampling, our method balances data efficiency and volumetric consistency. Our experiments show our approach outperforms state-of-the-art baselines in both realism and volumetric consistency, making it a promising solution for pseudo-healthy image inpainting. Code is available at https://github.com/dou0000/3dMRI-Consistent-Inpaint.
中文: 本文提出一种分层扩散框架,通过轴向和冠状扩散模型的组合,在保持数据效率的同时实现了优异的体积一致性和解剖细节还原,为伪健康脑部MRI修复提供了创新解决方案。
English: This paper introduces a hierarchical diffusion framework for pseudo-healthy brain MRI inpainting that combines axial and coronal diffusion models to achieve superior volumetric consistency and anatomical detail while maintaining data efficiency.

Authors:Semih Eren, Deniz Kucukahmetler, Nico Scherf
Title: Multimodal Recurrent Ensembles for Predicting Brain Responses to Naturalistic Movies (Algonauts 2025)
Abstract:
Accurately predicting distributed cortical responses to naturalistic stimuli requires models that integrate visual, auditory and semantic information over time. We present a hierarchical multimodal recurrent ensemble that maps pretrained video, audio, and language embeddings to fMRI time series recorded while four subjects watched almost 80 hours of movies provided by the Algonauts 2025 challenge. Modality-specific bidirectional RNNs encode temporal dynamics; their hidden states are fused and passed to a second recurrent layer, and lightweight subject-specific heads output responses for 1000 cortical parcels. Training relies on a composite MSE-correlation loss and a curriculum that gradually shifts emphasis from early sensory to late association regions. Averaging 100 model variants further boosts robustness. The resulting system ranked third on the competition leaderboard, achieving an overall Pearson r = 0.2094 and the highest single-parcel peak score (mean r = 0.63) among all participants, with particularly strong gains for the most challenging subject (Subject 5). The approach establishes a simple, extensible baseline for future multimodal brain-encoding benchmarks.
Chinese: 该分层多模态循环集成模型通过融合视频、音频和语言嵌入的时间动态,有效预测了自然刺激下的皮层响应,在Algonauts 2025挑战赛中表现优异,为未来多模态大脑编码基准建立了可扩展的坚实基础。
English: A hierarchical multimodal recurrent ensemble model effectively predicts cortical responses to naturalistic stimuli by integrating temporal video, audio, and language embeddings, achieving strong performance in the Algonauts 2025 challenge and setting a robust baseline for future brain-encoding benchmarks.

Authors:Yi Xin, Juncheng Yan, Qi Qin, Zhen Li, Dongyang Liu, Shicheng Li, Victor Shea-Jay Huang, Yupeng Zhou, Renrui Zhang, Le Zhuo, Tiancheng Han, Xiaoqing Sun, Siqi Luo, Mengmeng Wang, Bin Fu, Yuewen Cao, Hongsheng Li, Guangtao Zhai, Xiaohong Liu, Yu Qiao, Peng Gao
Title: Lumina-mGPT 2.0: Stand-Alone AutoRegressive Image Modeling
Abstract:
We present Lumina-mGPT 2.0, a stand-alone, decoder-only autoregressive model that revisits and revitalizes the autoregressive paradigm for high-quality image generation and beyond. Unlike existing approaches that rely on pretrained components or hybrid architectures, Lumina-mGPT 2.0 is trained entirely from scratch, enabling unrestricted architectural design and licensing freedom. It achieves generation quality on par with state-of-the-art diffusion models such as DALL-E 3 and SANA, while preserving the inherent flexibility and compositionality of autoregressive modeling. Our unified tokenization scheme allows the model to seamlessly handle a wide spectrum of tasks-including subject-driven generation, image editing, controllable synthesis, and dense prediction-within a single generative framework. To further boost usability, we incorporate efficient decoding strategies like inference-time scaling and speculative Jacobi sampling to improve quality and speed, respectively. Extensive evaluations on standard text-to-image benchmarks (e.g., GenEval, DPG) demonstrate that Lumina-mGPT 2.0 not only matches but in some cases surpasses diffusion-based models. Moreover, we confirm its multi-task capabilities on the Graph200K benchmark, with the native Lumina-mGPT 2.0 performing exceptionally well. These results position Lumina-mGPT 2.0 as a strong, flexible foundation model for unified multimodal generation. We have released our training details, code, and models at https://github.com/Alpha-VLLM/Lumina-mGPT-2.0.
中文: Lumina-mGPT 2.0 是一个独立的自回归模型,在图像生成质量上达到了与DALL-E 3相媲美的先进水平,同时通过统一框架实现了架构自由和多任务处理能力。
English: Lumina-mGPT 2.0 is a standalone autoregressive model that achieves state-of-the-art image generation quality comparable to DALL-E 3 while offering architectural freedom and multi-task capabilities within a unified framework.

Authors:Yiwen Chen, Zhihao Li, Yikai Wang, Hu Zhang, Qin Li, Chi Zhang, Guosheng Lin
Title: Ultra3D: Efficient and High-Fidelity 3D Generation with Part Attention
Abstract:
Recent advances in sparse voxel representations have significantly improved the quality of 3D content generation, enabling high-resolution modeling with fine-grained geometry. However, existing frameworks suffer from severe computational inefficiencies due to the quadratic complexity of attention mechanisms in their two-stage diffusion pipelines. In this work, we propose Ultra3D, an efficient 3D generation framework that significantly accelerates sparse voxel modeling without compromising quality. Our method leverages the compact VecSet representation to efficiently generate a coarse object layout in the first stage, reducing token count and accelerating voxel coordinate prediction. To refine per-voxel latent features in the second stage, we introduce Part Attention, a geometry-aware localized attention mechanism that restricts attention computation within semantically consistent part regions. This design preserves structural continuity while avoiding unnecessary global attention, achieving up to 6.7x speed-up in latent generation. To support this mechanism, we construct a scalable part annotation pipeline that converts raw meshes into part-labeled sparse voxels. Extensive experiments demonstrate that Ultra3D supports high-resolution 3D generation at 1024 resolution and achieves state-of-the-art performance in both visual fidelity and user preference.

Authors:Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, Kaipeng Zhang
Title: Yume: An Interactive World Generation Model
Abstract:
Yume aims to use images, text, or videos to create an interactive, realistic, and dynamic world, which allows exploration and control using peripheral devices or neural signals. In this report, we present a preview version of \method, which creates a dynamic world from an input image and allows exploration of the world using keyboard actions. To achieve this high-fidelity and interactive video world generation, we introduce a well-designed framework, which consists of four main components, including camera motion quantization, video generation architecture, advanced sampler, and model acceleration. First, we quantize camera motions for stable training and user-friendly interaction using keyboard inputs. Then, we introduce the Masked Video Diffusion Transformer~(MVDT) with a memory module for infinite video generation in an autoregressive manner. After that, training-free Anti-Artifact Mechanism (AAM) and Time Travel Sampling based on Stochastic Differential Equations (TTS-SDE) are introduced to the sampler for better visual quality and more precise control. Moreover, we investigate model acceleration by synergistic optimization of adversarial distillation and caching mechanisms. We use the high-quality world exploration dataset \sekai to train \method, and it achieves remarkable results in diverse scenes and applications. All data, codebase, and model weights are available on https://github.com/stdstu12/YUME. Yume will update monthly to achieve its original goal. Project page: https://stdstu12.github.io/YUME-Project/.
中文: Yume是一个通过图像生成动态虚拟世界的交互系统,其框架融合了相机运动量化、视频扩散变换器和抗伪影机制,支持用户通过键盘操作探索虚拟环境。
English: Yume is an interactive system that generates dynamic virtual worlds from images, enabling exploration via keyboard controls through a framework integrating camera motion quantization, video diffusion transformers, and artifact reduction techniques.

Authors:Jialiang Wang, Xianming Liu, Xiong Zhou, Gangfeng Hu, Deming Zhai, Junjun Jiang, Xiangyang Ji
Title: Joint Asymmetric Loss for Learning with Noisy Labels
Abstract:
Learning with noisy labels is a crucial task for training accurate deep neural networks. To mitigate label noise, prior studies have proposed various robust loss functions, particularly symmetric losses. Nevertheless, symmetric losses usually suffer from the underfitting issue due to the overly strict constraint. To address this problem, the Active Passive Loss (APL) jointly optimizes an active and a passive loss to mutually enhance the overall fitting ability. Within APL, symmetric losses have been successfully extended, yielding advanced robust loss functions. Despite these advancements, emerging theoretical analyses indicate that asymmetric losses, a new class of robust loss functions, possess superior properties compared to symmetric losses. However, existing asymmetric losses are not compatible with advanced optimization frameworks such as APL, limiting their potential and applicability. Motivated by this theoretical gap and the prospect of asymmetric losses, we extend the asymmetric loss to the more complex passive loss scenario and propose the Asymetric Mean Square Error (AMSE), a novel asymmetric loss. We rigorously establish the necessary and sufficient condition under which AMSE satisfies the asymmetric condition. By substituting the traditional symmetric passive loss in APL with our proposed AMSE, we introduce a novel robust loss framework termed Joint Asymmetric Loss (JAL). Extensive experiments demonstrate the effectiveness of our method in mitigating label noise. Code available at: https://github.com/cswjl/joint-asymmetric-loss
中文: 本文提出了一种新颖的非对称均方误差(AMSE)作为稳健损失函数,以解决对称损失在处理噪声标签时的不足,并将其融入主动被动损失框架中形成联合非对称损失(JAL),通过大量实验验证了其有效性。
English: This paper introduces the Asymmetric Mean Square Error (AMSE) as a novel robust loss function to address the limitations of symmetric losses in handling noisy labels, integrating it into the Active Passive Loss framework to form the Joint Asymmetric Loss (JAL), which demonstrates effectiveness through extensive experiments.

Authors:Daiqi Liu, Tomás Arias-Vergara, Jana Hutter, Andreas Maier, Paula Andrea Pérez-Toro
Title: Audio-Vision Contrastive Learning for Phonological Class Recognition
Abstract:
Accurate classification of articulatory-phonological features plays a vital role in understanding human speech production and developing robust speech technologies, particularly in clinical contexts where targeted phonemic analysis and therapy can improve disease diagnosis accuracy and personalized rehabilitation. In this work, we propose a multimodal deep learning framework that combines real-time magnetic resonance imaging (rtMRI) and speech signals to classify three key articulatory dimensions: manner of articulation, place of articulation, and voicing. We perform classification on 15 phonological classes derived from the aforementioned articulatory dimensions and evaluate the system with four audio/vision configurations: unimodal rtMRI, unimodal audio signals, multimodal middle fusion, and contrastive learning-based audio-vision fusion. Experimental results on the USC-TIMIT dataset show that our contrastive learning-based approach achieves state-of-the-art performance, with an average F1-score of 0.81, representing an absolute increase of 0.23 over the unimodal baseline. The results confirm the effectiveness of contrastive representation learning for multimodal articulatory analysis. Our code and processed dataset will be made publicly available at https://github.com/DaE-plz/AC_Contrastive_Phonology to support future research.
中文: 本研究提出了一种结合实时磁共振成像和语音信号的多模态深度学习框架,通过对比学习方法对发音特征进行分类,以0.81的F1分数实现了最先进的性能。
English: This study introduces a multimodal deep learning framework integrating real-time MRI and speech signals to classify articulatory features, achieving state-of-the-art performance with a 0.81 F1-score through contrastive learning.

Authors:Jiahui Yin, Xinxing Cheng, Jinming Duan, Yan Pang, Declan O'Regan, Hadrien Reynaud, Qingjie Meng
Title: MCM: Mamba-based Cardiac Motion Tracking using Sequential Images in MRI
Abstract:
Myocardial motion tracking is important for assessing cardiac function and diagnosing cardiovascular diseases, for which cine cardiac magnetic resonance (CMR) has been established as the gold standard imaging modality. Many existing methods learn motion from single image pairs consisting of a reference frame and a randomly selected target frame from the cardiac cycle. However, these methods overlook the continuous nature of cardiac motion and often yield inconsistent and non-smooth motion estimations. In this work, we propose a novel Mamba-based cardiac motion tracking network (MCM) that explicitly incorporates target image sequence from the cardiac cycle to achieve smooth and temporally consistent motion tracking. By developing a bi-directional Mamba block equipped with a bi-directional scanning mechanism, our method facilitates the estimation of plausible deformation fields. With our proposed motion decoder that integrates motion information from frames adjacent to the target frame, our method further enhances temporal coherence. Moreover, by taking advantage of Mamba's structured state-space formulation, the proposed method learns the continuous dynamics of the myocardium from sequential images without increasing computational complexity. We evaluate the proposed method on two public datasets. The experimental results demonstrate that the proposed method quantitatively and qualitatively outperforms both conventional and state-of-the-art learning-based cardiac motion tracking methods. The code is available at https://github.com/yjh-0104/MCM.
中文: 本研究提出的基于Mamba的心脏运动追踪网络(MCM)通过整合连续心脏图像序列和双向扫描机制,实现了平滑且时序一致的运动追踪,在精度和连贯性上均优于现有方法。
English: The proposed Mamba-based cardiac motion tracking network (MCM) utilizes sequential cardiac images and a bi-directional scanning mechanism to achieve smooth, temporally consistent motion estimation, outperforming existing methods in accuracy and coherence.

Authors:Xuzhi Wang, Xinran Wu, Song Wang, Lingdong Kong, Ziping Zhao
Title: Monocular Semantic Scene Completion via Masked Recurrent Networks
Abstract:
Monocular Semantic Scene Completion (MSSC) aims to predict the voxel-wise occupancy and semantic category from a single-view RGB image. Existing methods adopt a single-stage framework that aims to simultaneously achieve visible region segmentation and occluded region hallucination, while also being affected by inaccurate depth estimation. Such methods often achieve suboptimal performance, especially in complex scenes. We propose a novel two-stage framework that decomposes MSSC into coarse MSSC followed by the Masked Recurrent Network. Specifically, we propose the Masked Sparse Gated Recurrent Unit (MS-GRU) which concentrates on the occupied regions by the proposed mask updating mechanism, and a sparse GRU design is proposed to reduce the computation cost. Additionally, we propose the distance attention projection to reduce projection errors by assigning different attention scores according to the distance to the observed surface. Experimental results demonstrate that our proposed unified framework, MonoMRN, effectively supports both indoor and outdoor scenes and achieves state-of-the-art performance on the NYUv2 and SemanticKITTI datasets. Furthermore, we conduct robustness analysis under various disturbances, highlighting the role of the Masked Recurrent Network in enhancing the model's resilience to such challenges. The source code is publicly available.
中文: 提出的两阶段MonoMRN框架通过掩膜循环网络和距离注意力投影技术,在单目语义场景补全任务中实现了最优性能,能有效处理室内外场景并显著提升模型抗干扰能力。
English: The proposed two-stage MonoMRN framework, featuring a Masked Recurrent Network with distance attention projection, achieves state-of-the-art monocular semantic scene completion by efficiently handling both indoor and outdoor scenes while improving robustness against disturbances.

Authors:Yotam Erel, Olaf Dünkel, Rishabh Dabral, Vladislav Golyanik, Christian Theobalt, Amit H. Bermano
Title: Attention (as Discrete-Time Markov) Chains
Abstract:
We introduce a new interpretation of the attention matrix as a discrete-time Markov chain. Our interpretation sheds light on common operations involving attention scores such as selection, summation, and averaging in a unified framework. It further extends them by considering indirect attention, propagated through the Markov chain, as opposed to previous studies that only model immediate effects. Our main observation is that tokens corresponding to semantically similar regions form a set of metastable states, where the attention clusters, while noisy attention scores tend to disperse. Metastable states and their prevalence can be easily computed through simple matrix multiplication and eigenanalysis, respectively. Using these lightweight tools, we demonstrate state-of-the-art zero-shot segmentation. Lastly, we define TokenRank -- the steady state vector of the Markov chain, which measures global token importance. We demonstrate that using it brings improvements in unconditional image generation. We believe our framework offers a fresh view of how tokens are being attended in modern visual transformers.
中文摘要:本文将注意力矩阵重新解释为马尔可夫链,揭示了语义相似标记形成亚稳态的机制,并引入衡量全局标记重要性的TokenRank方法,实现了先进的零样本分割并提升了图像生成效果。
English Summary: This paper reinterprets the attention matrix as a Markov chain, revealing how semantically similar tokens form metastable states and introducing TokenRank for measuring global token importance, achieving state-of-the-art zero-shot segmentation and improved image generation.

Authors:Olaf Dünkel, Artur Jesslen, Jiahao Xie, Christian Theobalt, Christian Rupprecht, Adam Kortylewski
Title: CNS-Bench: Benchmarking Image Classifier Robustness Under Continuous Nuisance Shifts
Abstract:
An important challenge when using computer vision models in the real world is to evaluate their performance in potential out-of-distribution (OOD) scenarios. While simple synthetic corruptions are commonly applied to test OOD robustness, they often fail to capture nuisance shifts that occur in the real world. Recently, diffusion models have been applied to generate realistic images for benchmarking, but they are restricted to binary nuisance shifts. In this work, we introduce CNS-Bench, a Continuous Nuisance Shift Benchmark to quantify OOD robustness of image classifiers for continuous and realistic generative nuisance shifts. CNS-Bench allows generating a wide range of individual nuisance shifts in continuous severities by applying LoRA adapters to diffusion models. To address failure cases, we propose a filtering mechanism that outperforms previous methods, thereby enabling reliable benchmarking with generative models. With the proposed benchmark, we perform a large-scale study to evaluate the robustness of more than 40 classifiers under various nuisance shifts. Through carefully designed comparisons and analyses, we find that model rankings can change for varying shifts and shift scales, which cannot be captured when applying common binary shifts. Additionally, we show that evaluating the model performance on a continuous scale allows the identification of model failure points, providing a more nuanced understanding of model robustness. Project page including code and data: https://genintel.github.io/CNS.
中文: CNS-Bench通过扩散模型与LoRA适配器构建连续干扰变化基准,评估图像分类器在真实多变干扰下的分布外鲁棒性,发现模型排名随干扰类型和程度改变,突破了传统二元基准的局限性。
English: CNS-Bench introduces a continuous nuisance shift benchmark using diffusion models with LoRA adapters to evaluate image classifiers' out-of-distribution robustness across realistic, varying severities, revealing that model rankings shift with different nuisance types and scales, unlike binary benchmarks.

Authors:Yang Li, Zongzheng Zhang, Xuchong Qiu, Xinrun Li, Ziming Liu, Leichen Wang, Ruikai Li, Zhenxin Zhu, Huan-ang Gao, Xiaojian Lin, Zhiyong Cui, Hang Zhao, Hao Zhao
Title: Reusing Attention for One-stage Lane Topology Understanding
Abstract:
Understanding lane toplogy relationships accurately is critical for safe autonomous driving. However, existing two-stage methods suffer from inefficiencies due to error propagations and increased computational overheads. To address these challenges, we propose a one-stage architecture that simultaneously predicts traffic elements, lane centerlines and topology relationship, improving both the accuracy and inference speed of lane topology understanding for autonomous driving. Our key innovation lies in reusing intermediate attention resources within distinct transformer decoders. This approach effectively leverages the inherent relational knowledge within the element detection module to enable the modeling of topology relationships among traffic elements and lanes without requiring additional computationally expensive graph networks. Furthermore, we are the first to demonstrate that knowledge can be distilled from models that utilize standard definition (SD) maps to those operates without using SD maps, enabling superior performance even in the absence of SD maps. Extensive experiments on the OpenLane-V2 dataset show that our approach outperforms baseline methods in both accuracy and efficiency, achieving superior results in lane detection, traffic element identification, and topology reasoning. Our code is available at https://github.com/Yang-Li-2000/one-stage.git.
中文: 本文提出了一种单阶段架构,通过共享Transformer解码器的注意力资源同时预测交通要素、车道中心线和拓扑关系,不仅提高了准确性和推理速度,还无需昂贵图网络即可建模拓扑关系,并首次实现了从标准地图模型到无地图模型的知识蒸馏。
English: This paper introduces a one-stage architecture that simultaneously predicts traffic elements, lane centerlines, and topology relationships using shared transformer decoder attention, improving accuracy and speed while eliminating the need for costly graph networks and enabling knowledge transfer from SD map models.

Authors:Maciej K. Wozniak, Lianhang Liu, Yixi Cai, Patric Jensfelt
Title: PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving
Abstract:
While end-to-end autonomous driving models show promising results, their practical deployment is often hindered by large model sizes, a reliance on expensive LiDAR sensors and computationally intensive BEV feature representations. This limits their scalability, especially for mass-market vehicles equipped only with cameras. To address these challenges, we propose PRIX (Plan from Raw Pixels). Our novel and efficient end-to-end driving architecture operates using only camera data, without explicit BEV representation and forgoing the need for LiDAR. PRIX leverages a visual feature extractor coupled with a generative planning head to predict safe trajectories from raw pixel inputs directly. A core component of our architecture is the Context-aware Recalibration Transformer (CaRT), a novel module designed to effectively enhance multi-level visual features for more robust planning. We demonstrate through comprehensive experiments that PRIX achieves state-of-the-art performance on the NavSim and nuScenes benchmarks, matching the capabilities of larger, multimodal diffusion planners while being significantly more efficient in terms of inference speed and model size, making it a practical solution for real-world deployment. Our work is open-source and the code will be at https://maxiuw.github.io/prix.

Authors:Yuqing Lan, Chenyang Zhu, Shuaifeng Zhi, Jiazhao Zhang, Zhoufeng Wang, Renjiao Yi, Yijie Wang, Kai Xu
Title: RemixFusion: Residual-based Mixed Representation for Large-scale Online RGB-D Reconstruction
Abstract:
The introduction of the neural implicit representation has notably propelled the advancement of online dense reconstruction techniques. Compared to traditional explicit representations, such as TSDF, it improves the mapping completeness and memory efficiency. However, the lack of reconstruction details and the time-consuming learning of neural representations hinder the widespread application of neural-based methods to large-scale online reconstruction. We introduce RemixFusion, a novel residual-based mixed representation for scene reconstruction and camera pose estimation dedicated to high-quality and large-scale online RGB-D reconstruction. In particular, we propose a residual-based map representation comprised of an explicit coarse TSDF grid and an implicit neural module that produces residuals representing fine-grained details to be added to the coarse grid. Such mixed representation allows for detail-rich reconstruction with bounded time and memory budget, contrasting with the overly-smoothed results by the purely implicit representations, thus paving the way for high-quality camera tracking. Furthermore, we extend the residual-based representation to handle multi-frame joint pose optimization via bundle adjustment (BA). In contrast to the existing methods, which optimize poses directly, we opt to optimize pose changes. Combined with a novel technique for adaptive gradient amplification, our method attains better optimization convergence and global optimality. Furthermore, we adopt a local moving volume to factorize the mixed scene representation with a divide-and-conquer design to facilitate efficient online learning in our residual-based framework. Extensive experiments demonstrate that our method surpasses all state-of-the-art ones, including those based either on explicit or implicit representations, in terms of the accuracy of both mapping and tracking on large-scale scenes.
Chinese: RemixFusion提出了一种基于残差的混合表示法,结合显式TSDF网格与隐式神经模块,实现了高质量、大规模的在线RGB-D重建和相机姿态估计,在精度和效率上均超越了现有最优方法。
English: RemixFusion introduces a residual-based mixed representation combining explicit TSDF grids with implicit neural modules to achieve high-quality, large-scale online RGB-D reconstruction and camera pose estimation, surpassing state-of-the-art methods in accuracy and efficiency.

Authors:Xinyao Liu, Diping Song
Title: Constructing Ophthalmic MLLM for Positioning-diagnosis Collaboration Through Clinical Cognitive Chain Reasoning
Abstract:
Multimodal large language models (MLLMs) demonstrate significant potential in the field of medical diagnosis. However, they face critical challenges in specialized domains such as ophthalmology, particularly the fragmentation of annotation granularity and inconsistencies in clinical reasoning logic, which hinder precise cross-modal understanding. This paper introduces FundusExpert, an ophthalmology-specific MLLM with integrated positioning-diagnosis reasoning capabilities, along with FundusGen, a dataset constructed through the intelligent Fundus-Engine system. Fundus-Engine automates localization and leverages MLLM-based semantic expansion to integrate global disease classification, local object detection, and fine-grained feature analysis within a single fundus image. Additionally, by constructing a clinically aligned cognitive chain, it guides the model to generate interpretable reasoning paths. FundusExpert, fine-tuned with instruction data from FundusGen, achieves the best performance in ophthalmic question-answering tasks, surpassing the average accuracy of the 40B MedRegA by 26.6%. It also excels in zero-shot report generation tasks, achieving a clinical consistency of 77.0%, significantly outperforming GPT-4o's 47.6%. Furthermore, we reveal a scaling law between data quality and model capability ($L \propto N^{0.068}$), demonstrating that the cognitive alignment annotations in FundusGen enhance data utilization efficiency. By integrating region-level localization with diagnostic reasoning chains, our work develops a scalable, clinically-aligned MLLM and explores a pathway toward bridging the visual-language gap in specific MLLMs. Our project can be found at https://github.com/MeteorElf/FundusExpert.
中文: FundusExpert是一种专为眼科设计的先进多模态大语言模型,通过整合定位与诊断推理能力,有效解决了标注粒度碎片化和临床逻辑不一致的问题,在医学任务中表现出卓越性能。
English: FundusExpert is an advanced multimodal large language model designed for ophthalmology, integrating positioning and diagnostic reasoning to achieve superior performance in medical tasks by addressing annotation fragmentation and clinical reasoning inconsistencies.

Authors:Jiajun Luo, Yicheng Xiao, Jianru Xu, Yangxiu You, Rongwei Lu, Chen Tang, Jingyan Jiang, Zhi Wang
Title: Accelerating Parallel Diffusion Model Serving with Residual Compression
Abstract:
Diffusion models produce realistic images and videos but require substantial computational resources, necessitating multi-accelerator parallelism for real-time deployment. However, parallel inference introduces significant communication overhead from exchanging large activations between devices, limiting efficiency and scalability. We present CompactFusion, a compression framework that significantly reduces communication while preserving generation quality. Our key observation is that diffusion activations exhibit strong temporal redundancy-adjacent steps produce highly similar activations, saturating bandwidth with near-duplicate data carrying little new information. To address this inefficiency, we seek a more compact representation that encodes only the essential information. CompactFusion achieves this via Residual Compression that transmits only compressed residuals (step-wise activation differences). Based on empirical analysis and theoretical justification, we show that it effectively removes redundant data, enabling substantial data reduction while maintaining high fidelity. We also integrate lightweight error feedback to prevent error accumulation. CompactFusion establishes a new paradigm for parallel diffusion inference, delivering lower latency and significantly higher generation quality than prior methods. On 4xL20, it achieves 3.0x speedup while greatly improving fidelity. It also uniquely supports communication-heavy strategies like sequence parallelism on slow networks, achieving 6.7x speedup over prior overlap-based method. CompactFusion applies broadly across diffusion models and parallel settings, and integrates easily without requiring pipeline rework. Portable implementation demonstrated on xDiT is publicly available at https://github.com/Cobalt-27/CompactFusion
中文: CompactFusion是一种压缩框架,通过在设备间仅传输压缩残差来减少并行扩散模型推理中的通信开销,在保持生成质量的同时实现显著加速。
English: CompactFusion is a compression framework that reduces communication overhead in parallel diffusion model inference by transmitting only compressed residuals between devices, achieving significant speedups while maintaining generation quality.

Authors:Jorgen Cani, Christos Diou, Spyridon Evangelatos, Vasileios Argyriou, Panagiotis Radoglou-Grammatikis, Panagiotis Sarigiannidis, Iraklis Varlamis, Georgios Th. Papadopoulos
Title: Illicit object detection in X-ray imaging using deep learning techniques: A comparative evaluation
Abstract:
Automated X-ray inspection is crucial for efficient and unobtrusive security screening in various public settings. However, challenges such as object occlusion, variations in the physical properties of items, diversity in X-ray scanning devices, and limited training data hinder accurate and reliable detection of illicit items. Despite the large body of research in the field, reported experimental evaluations are often incomplete, with frequently conflicting outcomes. To shed light on the research landscape and facilitate further research, a systematic, detailed, and thorough comparative evaluation of recent Deep Learning (DL)-based methods for X-ray object detection is conducted. For this, a comprehensive evaluation framework is developed, composed of: a) Six recent, large-scale, and widely used public datasets for X-ray illicit item detection (OPIXray, CLCXray, SIXray, EDS, HiXray, and PIDray), b) Ten different state-of-the-art object detection schemes covering all main categories in the literature, including generic Convolutional Neural Network (CNN), custom CNN, generic transformer, and hybrid CNN-transformer architectures, and c) Various detection (mAP50 and mAP50:95) and time/computational-complexity (inference time (ms), parameter size (M), and computational load (GFLOPS)) metrics. A thorough analysis of the results leads to critical observations and insights, emphasizing key aspects such as: a) Overall behavior of the object detection schemes, b) Object-level detection performance, c) Dataset-specific observations, and d) Time efficiency and computational complexity analysis. To support reproducibility of the reported experimental results, the evaluation code and model weights are made publicly available at https://github.com/jgenc/xray-comparative-evaluation.
Automated X-ray inspection faces challenges like object occlusion and limited training data, prompting a systematic evaluation of deep learning methods using six datasets and ten detection schemes to provide critical insights and publicly available resources.
English Summary:

Authors:Minglong Xue, Aoxiang Ning, Shivakumara Palaiahnakote, Mingliang Zhou
Title: DFDNet: Dynamic Frequency-Guided De-Flare Network
Abstract:
Strong light sources in nighttime photography frequently produce flares in images, significantly degrading visual quality and impacting the performance of downstream tasks. While some progress has been made, existing methods continue to struggle with removing large-scale flare artifacts and repairing structural damage in regions near the light source. We observe that these challenging flare artifacts exhibit more significant discrepancies from the reference images in the frequency domain compared to the spatial domain. Therefore, this paper presents a novel dynamic frequency-guided deflare network (DFDNet) that decouples content information from flare artifacts in the frequency domain, effectively removing large-scale flare artifacts. Specifically, DFDNet consists mainly of a global dynamic frequency-domain guidance (GDFG) module and a local detail guidance module (LDGM). The GDFG module guides the network to perceive the frequency characteristics of flare artifacts by dynamically optimizing global frequency domain features, effectively separating flare information from content information. Additionally, we design an LDGM via a contrastive learning strategy that aligns the local features of the light source with the reference image, reduces local detail damage from flare removal, and improves fine-grained image restoration. The experimental results demonstrate that the proposed method outperforms existing state-of-the-art methods in terms of performance. The code is available at \href{https://github.com/AXNing/DFDNet}{https://github.com/AXNing/DFDNet}.
中文摘要:本文提出DFDNet,一种动态频率引导的去眩光网络,通过在频域分离内容与眩光并结合对比学习优化局部细节,有效消除夜间图像中的大规模眩光伪影。
English Summary: The paper introduces DFDNet, a dynamic frequency-guided deflare network that effectively removes large-scale flare artifacts in nighttime images by decoupling content from flares in the frequency domain and enhancing local details through contrastive learning.

Authors:Francesco Tonini, Lorenzo Vaquero, Alessandro Conti, Cigdem Beyan, Elisa Ricci
Title: Dynamic Scoring with Enhanced Semantics for Training-Free Human-Object Interaction Detection
Abstract:
Human-Object Interaction (HOI) detection aims to identify humans and objects within images and interpret their interactions. Existing HOI methods rely heavily on large datasets with manual annotations to learn interactions from visual cues. These annotations are labor-intensive to create, prone to inconsistency, and limit scalability to new domains and rare interactions. We argue that recent advances in Vision-Language Models (VLMs) offer untapped potential, particularly in enhancing interaction representation. While prior work has injected such potential and even proposed training-free methods, there remain key gaps. Consequently, we propose a novel training-free HOI detection framework for Dynamic Scoring with enhanced semantics (DYSCO) that effectively utilizes textual and visual interaction representations within a multimodal registry, enabling robust and nuanced interaction understanding. This registry incorporates a small set of visual cues and uses innovative interaction signatures to improve the semantic alignment of verbs, facilitating effective generalization to rare interactions. Additionally, we propose a unique multi-head attention mechanism that adaptively weights the contributions of the visual and textual features. Experimental results demonstrate that our DYSCO surpasses training-free state-of-the-art models and is competitive with training-based approaches, particularly excelling in rare interactions. Code is available at https://github.com/francescotonini/dysco.
中文: 摘要提出DYSCO,一种无需训练的人物交互检测框架,通过多模态注册表和自适应特征加权机制利用视觉语言模型增强交互表征,在罕见交互场景中表现尤为突出。
English: The abstract introduces DYSCO, a training-free Human-Object Interaction detection framework that leverages Vision-Language Models to enhance interaction representation through a multimodal registry and adaptive feature weighting, achieving superior performance especially in rare interactions.

Authors:Sneha George Gnanakalavathy, Hairil Abdul Razak, Robert Meertens, Jonathan E. Fieldsend, Xujiong Ye, Mohammed M. Abdelsamea
Title: CAPRI-CT: Causal Analysis and Predictive Reasoning for Image Quality Optimization in Computed Tomography
Abstract:
In computed tomography (CT), achieving high image quality while minimizing radiation exposure remains a key clinical challenge. This paper presents CAPRI-CT, a novel causal-aware deep learning framework for Causal Analysis and Predictive Reasoning for Image Quality Optimization in CT imaging. CAPRI-CT integrates image data with acquisition metadata (such as tube voltage, tube current, and contrast agent types) to model the underlying causal relationships that influence image quality. An ensemble of Variational Autoencoders (VAEs) is employed to extract meaningful features and generate causal representations from observational data, including CT images and associated imaging parameters. These input features are fused to predict the Signal-to-Noise Ratio (SNR) and support counterfactual inference, enabling what-if simulations, such as changes in contrast agents (types and concentrations) or scan parameters. CAPRI-CT is trained and validated using an ensemble learning approach, achieving strong predictive performance. By facilitating both prediction and interpretability, CAPRI-CT provides actionable insights that could help radiologists and technicians design more efficient CT protocols without repeated physical scans. The source code and dataset are publicly available at https://github.com/SnehaGeorge22/capri-ct.
中文: 本文提出CAPRI-CT框架,通过整合CT图像与采集参数建立因果模型,利用变分自编码器预测信噪比并支持反事实推理,为优化CT扫描方案提供可操作的见解。
English: This paper introduces CAPRI-CT, a causal-aware deep learning framework that models causal relationships between CT imaging parameters and image quality to predict SNR and enable counterfactual simulations, aiding in protocol optimization without physical scans.

Authors:Jun Li, Jinpeng Wang, Chaolei Tan, Niu Lian, Long Chen, Yaowei Wang, Min Zhang, Shu-Tao Xia, Bin Chen
Title: HLFormer: Enhancing Partially Relevant Video Retrieval with Hyperbolic Learning
Abstract:
Partially Relevant Video Retrieval (PRVR) addresses the critical challenge of matching untrimmed videos with text queries describing only partial content. Existing methods suffer from geometric distortion in Euclidean space that sometimes misrepresents the intrinsic hierarchical structure of videos and overlooks certain hierarchical semantics, ultimately leading to suboptimal temporal modeling. To address this issue, we propose the first hyperbolic modeling framework for PRVR, namely HLFormer, which leverages hyperbolic space learning to compensate for the suboptimal hierarchical modeling capabilities of Euclidean space. Specifically, HLFormer integrates the Lorentz Attention Block and Euclidean Attention Block to encode video embeddings in hybrid spaces, using the Mean-Guided Adaptive Interaction Module to dynamically fuse features. Additionally, we introduce a Partial Order Preservation Loss to enforce "text < video" hierarchy through Lorentzian cone constraints. This approach further enhances cross-modal matching by reinforcing partial relevance between video content and text queries. Extensive experiments show that HLFormer outperforms state-of-the-art methods. Code is released at https://github.com/lijun2005/ICCV25-HLFormer.
中文摘要:本文提出HLFormer,一种用于部分相关视频检索的双曲建模框架,通过融合混合空间编码和偏序保持机制,有效克服欧几里得空间在层次结构建模上的不足,显著提升了跨模态匹配性能。
English Summary: The paper introduces HLFormer, a hyperbolic modeling framework for Partially Relevant Video Retrieval that overcomes Euclidean space limitations by integrating hybrid space encoding and partial order preservation to enhance hierarchical representation and cross-modal matching.

Authors:Hyeongjin Nam, Donghwan Kim, Gyeongsik Moon, Kyoung Mu Lee
Title: PARTE: Part-Guided Texturing for 3D Human Reconstruction from a Single Image
Abstract:
The misaligned human texture across different human parts is one of the main limitations of existing 3D human reconstruction methods. Each human part, such as a jacket or pants, should maintain a distinct texture without blending into others. The structural coherence of human parts serves as a crucial cue to infer human textures in the invisible regions of a single image. However, most existing 3D human reconstruction methods do not explicitly exploit such part segmentation priors, leading to misaligned textures in their reconstructions. In this regard, we present PARTE, which utilizes 3D human part information as a key guide to reconstruct 3D human textures. Our framework comprises two core components. First, to infer 3D human part information from a single image, we propose a 3D part segmentation module (PartSegmenter) that initially reconstructs a textureless human surface and predicts human part labels based on the textureless surface. Second, to incorporate part information into texture reconstruction, we introduce a part-guided texturing module (PartTexturer), which acquires prior knowledge from a pre-trained image generation network on texture alignment of human parts. Extensive experiments demonstrate that our framework achieves state-of-the-art quality in 3D human reconstruction. The project page is available at https://hygenie1228.github.io/PARTE/.

Authors:Chao He, Jianqiang Ren, Jianjing Xiang, Xiejie Shen
Title: CartoonAlive: Towards Expressive Live2D Modeling from Single Portraits
Abstract:
With the rapid advancement of large foundation models, AIGC, cloud rendering, and real-time motion capture technologies, digital humans are now capable of achieving synchronized facial expressions and body movements, engaging in intelligent dialogues driven by natural language, and enabling the fast creation of personalized avatars. While current mainstream approaches to digital humans primarily focus on 3D models and 2D video-based representations, interactive 2D cartoon-style digital humans have received relatively less attention. Compared to 3D digital humans that require complex modeling and high rendering costs, and 2D video-based solutions that lack flexibility and real-time interactivity, 2D cartoon-style Live2D models offer a more efficient and expressive alternative. By simulating 3D-like motion through layered segmentation without the need for traditional 3D modeling, Live2D enables dynamic and real-time manipulation. In this technical report, we present CartoonAlive, an innovative method for generating high-quality Live2D digital humans from a single input portrait image. CartoonAlive leverages the shape basis concept commonly used in 3D face modeling to construct facial blendshapes suitable for Live2D. It then infers the corresponding blendshape weights based on facial keypoints detected from the input image. This approach allows for the rapid generation of a highly expressive and visually accurate Live2D model that closely resembles the input portrait, within less than half a minute. Our work provides a practical and scalable solution for creating interactive 2D cartoon characters, opening new possibilities in digital content creation and virtual character animation. The project homepage is https://human3daigc.github.io/CartoonAlive_webpage/.

Authors:Peiqi Chen, Lei Yu, Yi Wan, Yingying Pei, Xinyi Liu, Yongxiang Yao, Yingying Zhang, Lixiang Ru, Liheng Zhong, Jingdong Chen, Ming Yang, Yongjun Zhang
Title: CasP: Improving Semi-Dense Feature Matching Pipeline Leveraging Cascaded Correspondence Priors for Guidance
Abstract:
Semi-dense feature matching methods have shown strong performance in challenging scenarios. However, the existing pipeline relies on a global search across the entire feature map to establish coarse matches, limiting further improvements in accuracy and efficiency. Motivated by this limitation, we propose a novel pipeline, CasP, which leverages cascaded correspondence priors for guidance. Specifically, the matching stage is decomposed into two progressive phases, bridged by a region-based selective cross-attention mechanism designed to enhance feature discriminability. In the second phase, one-to-one matches are determined by restricting the search range to the one-to-many prior areas identified in the first phase. Additionally, this pipeline benefits from incorporating high-level features, which helps reduce the computational costs of low-level feature extraction. The acceleration gains of CasP increase with higher resolution, and our lite model achieves a speedup of $\sim2.2\times$ at a resolution of 1152 compared to the most efficient method, ELoFTR. Furthermore, extensive experiments demonstrate its superiority in geometric estimation, particularly with impressive cross-domain generalization. These advantages highlight its potential for latency-sensitive and high-robustness applications, such as SLAM and UAV systems. Code is available at https://github.com/pq-chen/CasP.
中文摘要:提出的CasP流程采用级联对应先验和两阶段匹配机制,通过选择性交叉注意力增强特征区分度,在提升半稠密特征匹配精度的同时大幅加速计算,尤其在跨域泛化方面表现突出,适用于SLAM和无人机等实时高鲁棒性场景。
English Summary: The proposed CasP pipeline introduces cascaded correspondence priors and a two-phase matching process with selective cross-attention to enhance accuracy and efficiency in semi-dense feature matching, achieving significant speed improvements and superior cross-domain generalization for applications like SLAM and UAV systems.

Authors:Ruodai Cui, Li Niu, Guosheng Hu
Title: Unsupervised Exposure Correction
Abstract:
Current exposure correction methods have three challenges, labor-intensive paired data annotation, limited generalizability, and performance degradation in low-level computer vision tasks. In this work, we introduce an innovative Unsupervised Exposure Correction (UEC) method that eliminates the need for manual annotations, offers improved generalizability, and enhances performance in low-level downstream tasks. Our model is trained using freely available paired data from an emulated Image Signal Processing (ISP) pipeline. This approach does not need expensive manual annotations, thereby minimizing individual style biases from the annotation and consequently improving its generalizability. Furthermore, we present a large-scale Radiometry Correction Dataset, specifically designed to emphasize exposure variations, to facilitate unsupervised learning. In addition, we develop a transformation function that preserves image details and outperforms state-of-the-art supervised methods [12], while utilizing only 0.01% of their parameters. Our work further investigates the broader impact of exposure correction on downstream tasks, including edge detection, demonstrating its effectiveness in mitigating the adverse effects of poor exposure on low-level features. The source code and dataset are publicly available at https://github.com/BeyondHeaven/uec_code.
Chinese: 本文提出了一种无监督曝光校正(UEC)方法,通过使用模拟配对数据和新型变换函数,无需人工标注,提高了泛化能力,并增强了低级视觉任务的性能。
English: This paper introduces an Unsupervised Exposure Correction (UEC) method that eliminates manual annotations, improves generalizability, and enhances performance in low-level vision tasks by using simulated paired data and a novel transformation function.

Authors:Jooyeol Yun, Heng Wang, Yotaro Shimose, Jaegul Choo, Shingo Takamatsu
Title: DesignLab: Designing Slides Through Iterative Detection and Correction
Abstract:
Designing high-quality presentation slides can be challenging for non-experts due to the complexity involved in navigating various design choices. Numerous automated tools can suggest layouts and color schemes, yet often lack the ability to refine their own output, which is a key aspect in real-world workflows. We propose DesignLab, which separates the design process into two roles, the design reviewer, who identifies design-related issues, and the design contributor who corrects them. This decomposition enables an iterative loop where the reviewer continuously detects issues and the contributor corrects them, allowing a draft to be further polished with each iteration, reaching qualities that were unattainable. We fine-tune large language models for these roles and simulate intermediate drafts by introducing controlled perturbations, enabling the design reviewer learn design errors and the contributor learn how to fix them. Our experiments show that DesignLab outperforms existing design-generation methods, including a commercial tool, by embracing the iterative nature of designing which can result in polished, professional slides.

Authors:Haiyu Wu, Jaskirat Singh, Sicong Tian, Liang Zheng, Kevin W. Bowyer
Title: Vec2Face+ for Face Dataset Generation
Abstract:
When synthesizing identities as face recognition training data, it is generally believed that large inter-class separability and intra-class attribute variation are essential for synthesizing a quality dataset. % This belief is generally correct, and this is what we aim for. However, when increasing intra-class variation, existing methods overlook the necessity of maintaining intra-class identity consistency. % To address this and generate high-quality face training data, we propose Vec2Face+, a generative model that creates images directly from image features and allows for continuous and easy control of face identities and attributes. Using Vec2Face+, we obtain datasets with proper inter-class separability and intra-class variation and identity consistency using three strategies: 1) we sample vectors sufficiently different from others to generate well-separated identities; 2) we propose an AttrOP algorithm for increasing general attribute variations; 3) we propose LoRA-based pose control for generating images with profile head poses, which is more efficient and identity-preserving than AttrOP. % Our system generates VFace10K, a synthetic face dataset with 10K identities, which allows an FR model to achieve state-of-the-art accuracy on seven real-world test sets. Scaling the size to 4M and 12M images, the corresponding VFace100K and VFace300K datasets yield higher accuracy than the real-world training dataset, CASIA-WebFace, on five real-world test sets. This is the first time a synthetic dataset beats the CASIA-WebFace in average accuracy. In addition, we find that only 1 out of 11 synthetic datasets outperforms random guessing (\emph{i.e., 50\%}) in twin verification and that models trained with synthetic identities are more biased than those trained with real identities. Both are important aspects for future investigation. Code is available at https://github.com/HaiyuWu/Vec2Face_plus
中文: Vec2Face+ 是一种生成模型,通过直接处理图像特征并控制人脸身份与属性,在保持类内身份一致性的同时增强类间差异与类内变化,生成的合成人脸数据集在多个真实测试集上达到最先进精度,首次超越真实数据集CASIA-WebFace。
English: Vec2Face+ is a generative model that synthesizes high-quality face recognition training data by ensuring strong inter-class separability and intra-class variation while maintaining identity consistency, achieving state-of-the-art accuracy on real-world test sets and surpassing real datasets like CASIA-WebFace.

Authors:Bharath Krishnamurthy, Ajita Rattani
Title: DOOMGAN:High-Fidelity Dynamic Identity Obfuscation Ocular Generative Morphing
Abstract:
Ocular biometrics in the visible spectrum have emerged as a prominent modality due to their high accuracy, resistance to spoofing, and non-invasive nature. However, morphing attacks, synthetic biometric traits created by blending features from multiple individuals, threaten biometric system integrity. While extensively studied for near-infrared iris and face biometrics, morphing in visible-spectrum ocular data remains underexplored. Simulating such attacks demands advanced generation models that handle uncontrolled conditions while preserving detailed ocular features like iris boundaries and periocular textures. To address this gap, we introduce DOOMGAN, that encompasses landmark-driven encoding of visible ocular anatomy, attention-guided generation for realistic morph synthesis, and dynamic weighting of multi-faceted losses for optimized convergence. DOOMGAN achieves over 20% higher attack success rates than baseline methods under stringent thresholds, along with 20% better elliptical iris structure generation and 30% improved gaze consistency. We also release the first comprehensive ocular morphing dataset to support further research in this domain.
Chinese: DOOMGAN作为一种先进模型,通过结合地标驱动编码、注意力引导生成和动态损失加权,有效模拟可见光谱眼部生物特征的形态攻击,相比基线方法在攻击成功率和特征保持方面表现更优。
English: DOOMGAN is introduced as an advanced model that effectively simulates morphing attacks on visible-spectrum ocular biometrics by integrating landmark-driven encoding, attention-guided generation, and dynamic loss weighting, achieving superior attack success rates and feature preservation compared to baseline methods.

Authors:Ruodai Cui, Lei Zhang
Title: UNICE: Training A Universal Image Contrast Enhancer
Abstract:
Existing image contrast enhancement methods are typically designed for specific tasks such as under-/over-exposure correction, low-light and backlit image enhancement, etc. The learned models, however, exhibit poor generalization performance across different tasks, even across different datasets of a specific task. It is important to explore whether we can learn a universal and generalized model for various contrast enhancement tasks. In this work, we observe that the common key factor of these tasks lies in the need of exposure and contrast adjustment, which can be well-addressed if high-dynamic range (HDR) inputs are available. We hence collect 46,928 HDR raw images from public sources, and render 328,496 sRGB images to build multi-exposure sequences (MES) and the corresponding pseudo sRGB ground-truths via multi-exposure fusion. Consequently, we train a network to generate an MES from a single sRGB image, followed by training another network to fuse the generated MES into an enhanced image. Our proposed method, namely UNiversal Image Contrast Enhancer (UNICE), is free of costly human labeling. However, it demonstrates significantly stronger generalization performance than existing image contrast enhancement methods across and within different tasks, even outperforming manually created ground-truths in multiple no-reference image quality metrics. The dataset, code and model are available at https://github.com/BeyondHeaven/UNICE.
中文: 现有图像对比度增强方法通常针对特定任务设计且泛化能力差,而UNICE通过利用高动态范围输入和多曝光融合技术,无需人工标注即可显著提升不同任务间的通用性和增强效果。
English: Current image contrast enhancement methods are task-specific and lack generalization, but UNICE introduces a universal model that uses high-dynamic range inputs and multi-exposure fusion to significantly improve performance across various tasks without human labeling.

Authors:Ting Jiang, Yixiao Wang, Hancheng Ye, Zishan Shao, Jingwei Sun, Jingyang Zhang, Zekai Chen, Jianyi Zhang, Yiran Chen, Hai Li
Title: SADA: Stability-guided Adaptive Diffusion Acceleration
Abstract:
Diffusion models have achieved remarkable success in generative tasks but suffer from high computational costs due to their iterative sampling process and quadratic attention costs. Existing training-free acceleration strategies that reduce per-step computation cost, while effectively reducing sampling time, demonstrate low faithfulness compared to the original baseline. We hypothesize that this fidelity gap arises because (a) different prompts correspond to varying denoising trajectory, and (b) such methods do not consider the underlying ODE formulation and its numerical solution. In this paper, we propose Stability-guided Adaptive Diffusion Acceleration (SADA), a novel paradigm that unifies step-wise and token-wise sparsity decisions via a single stability criterion to accelerate sampling of ODE-based generative models (Diffusion and Flow-matching). For (a), SADA adaptively allocates sparsity based on the sampling trajectory. For (b), SADA introduces principled approximation schemes that leverage the precise gradient information from the numerical ODE solver. Comprehensive evaluations on SD-2, SDXL, and Flux using both EDM and DPM++ solvers reveal consistent $\ge 1.8\times$ speedups with minimal fidelity degradation (LPIPS $\leq 0.10$ and FID $\leq 4.5$) compared to unmodified baselines, significantly outperforming prior methods. Moreover, SADA adapts seamlessly to other pipelines and modalities: It accelerates ControlNet without any modifications and speeds up MusicLDM by $1.8\times$ with $\sim 0.01$ spectrogram LPIPS.
中文: 扩散模型存在计算成本高且现有加速方法保真度低的问题,而提出的SADA框架通过自适应稀疏分配和利用ODE求解器梯度,在多种模型和模态上实现了显著加速且质量损失极小。
English: Diffusion models face high computational costs and fidelity issues with current acceleration methods, but the proposed SADA framework adaptively optimizes sparsity and leverages ODE solver gradients to achieve significant speedups with minimal quality loss across various models and modalities.

Authors:Zaipeng Duan, Chenxu Dang, Xuzhong Hu, Pei An, Junfeng Ding, Jie Zhan, Yunbiao Xu, Jie Ma
Title: SDGOCC: Semantic and Depth-Guided Bird's-Eye View Transformation for 3D Multimodal Occupancy Prediction
Abstract:
Multimodal 3D occupancy prediction has garnered significant attention for its potential in autonomous driving. However, most existing approaches are single-modality: camera-based methods lack depth information, while LiDAR-based methods struggle with occlusions. Current lightweight methods primarily rely on the Lift-Splat-Shoot (LSS) pipeline, which suffers from inaccurate depth estimation and fails to fully exploit the geometric and semantic information of 3D LiDAR points. Therefore, we propose a novel multimodal occupancy prediction network called SDG-OCC, which incorporates a joint semantic and depth-guided view transformation coupled with a fusion-to-occupancy-driven active distillation. The enhanced view transformation constructs accurate depth distributions by integrating pixel semantics and co-point depth through diffusion and bilinear discretization. The fusion-to-occupancy-driven active distillation extracts rich semantic information from multimodal data and selectively transfers knowledge to image features based on LiDAR-identified regions. Finally, for optimal performance, we introduce SDG-Fusion, which uses fusion alone, and SDG-KL, which integrates both fusion and distillation for faster inference. Our method achieves state-of-the-art (SOTA) performance with real-time processing on the Occ3D-nuScenes dataset and shows comparable performance on the more challenging SurroundOcc-nuScenes dataset, demonstrating its effectiveness and robustness. The code will be released at https://github.com/DzpLab/SDGOCC.
Chinese: 提出的SDG-OCC网络通过联合语义和深度引导的视角变换与主动蒸馏技术,克服了现有多模态3D占据预测方法的局限性,在基准数据集上实现了最先进的性能并支持实时处理。
English: The proposed SDG-OCC network introduces a joint semantic and depth-guided view transformation with active distillation to overcome limitations in existing multimodal 3D occupancy prediction methods, achieving state-of-the-art performance on benchmark datasets while enabling real-time processing.

Authors:Luchuan Song, Yang Zhou, Zhan Xu, Yi Zhou, Deepali Aneja, Chenliang Xu
Title: StreamME: Simplify 3D Gaussian Avatar within Live Stream
Abstract:
We propose StreamME, a method focuses on fast 3D avatar reconstruction. The StreamME synchronously records and reconstructs a head avatar from live video streams without any pre-cached data, enabling seamless integration of the reconstructed appearance into downstream applications. This exceptionally fast training strategy, which we refer to as on-the-fly training, is central to our approach. Our method is built upon 3D Gaussian Splatting (3DGS), eliminating the reliance on MLPs in deformable 3DGS and relying solely on geometry, which significantly improves the adaptation speed to facial expression. To further ensure high efficiency in on-the-fly training, we introduced a simplification strategy based on primary points, which distributes the point clouds more sparsely across the facial surface, optimizing points number while maintaining rendering quality. Leveraging the on-the-fly training capabilities, our method protects the facial privacy and reduces communication bandwidth in VR system or online conference. Additionally, it can be directly applied to downstream application such as animation, toonify, and relighting. Please refer to our project page for more details: https://songluchuan.github.io/StreamME/.
中文: StreamME是一种快速3D虚拟形象重建方法,通过基于3D高斯泼溅的实时训练技术,能在直播视频流中同步构建头像模型,既保护面部隐私又优化传输带宽。
English: StreamME is a fast 3D avatar reconstruction method that uses on-the-fly training with 3D Gaussian Splatting to enable real-time avatar creation from live video streams while ensuring facial privacy and efficient bandwidth usage.

Authors:Yue Ma, Kunyu Feng, Zhongyuan Hu, Xinyu Wang, Yucheng Wang, Mingzhe Zheng, Xuanhua He, Chenyang Zhu, Hongyu Liu, Yingqing He, Zeyu Wang, Zhifeng Li, Xiu Li, Wei Liu, Dan Xu, Linfeng Zhang, Qifeng Chen
Title: Controllable Video Generation: A Survey
Abstract:
With the rapid development of AI-generated content (AIGC), video generation has emerged as one of its most dynamic and impactful subfields. In particular, the advancement of video generation foundation models has led to growing demand for controllable video generation methods that can more accurately reflect user intent. Most existing foundation models are designed for text-to-video generation, where text prompts alone are often insufficient to express complex, multi-modal, and fine-grained user requirements. This limitation makes it challenging for users to generate videos with precise control using current models. To address this issue, recent research has explored the integration of additional non-textual conditions, such as camera motion, depth maps, and human pose, to extend pretrained video generation models and enable more controllable video synthesis. These approaches aim to enhance the flexibility and practical applicability of AIGC-driven video generation systems. In this survey, we provide a systematic review of controllable video generation, covering both theoretical foundations and recent advances in the field. We begin by introducing the key concepts and commonly used open-source video generation models. We then focus on control mechanisms in video diffusion models, analyzing how different types of conditions can be incorporated into the denoising process to guide generation. Finally, we categorize existing methods based on the types of control signals they leverage, including single-condition generation, multi-condition generation, and universal controllable generation. For a complete list of the literature on controllable video generation reviewed, please visit our curated repository at https://github.com/mayuelala/Awesome-Controllable-Video-Generation.
中文摘要:本综述系统梳理了可控视频生成领域,针对纯文本提示的局限性,探索了相机运动、深度图等多模态控制机制,以提升人工智能生成视频与用户意图的匹配度。
English Summary: This survey systematically reviews the field of controllable video generation, addressing the limitations of text-only prompts by exploring multi-modal control mechanisms like camera motion and depth maps to enhance user intent alignment in AI-generated videos.

Authors:Joey Spronck, Leander van Eekelen, Dominique van Midden, Joep Bogaerts, Leslie Tessier, Valerie Dechering, Muradije Demirel-Andishmand, Gabriel Silva de Souza, Roland Nemeth, Enrico Munari, Giuseppe Bogina, Ilaria Girolami, Albino Eccher, Balazs Acs, Ceren Boyaci, Natalie Klubickova, Monika Looijen-Salamon, Shoko Vos, Francesco Ciompi
Title: A tissue and cell-level annotated H&E and PD-L1 histopathology image dataset in non-small cell lung cancer
Abstract:
The tumor immune microenvironment (TIME) in non-small cell lung cancer (NSCLC) histopathology contains morphological and molecular characteristics predictive of immunotherapy response. Computational quantification of TIME characteristics, such as cell detection and tissue segmentation, can support biomarker development. However, currently available digital pathology datasets of NSCLC for the development of cell detection or tissue segmentation algorithms are limited in scope, lack annotations of clinically prevalent metastatic sites, and forgo molecular information such as PD-L1 immunohistochemistry (IHC). To fill this gap, we introduce the IGNITE data toolkit, a multi-stain, multi-centric, and multi-scanner dataset of annotated NSCLC whole-slide images. We publicly release 887 fully annotated regions of interest from 155 unique patients across three complementary tasks: (i) multi-class semantic segmentation of tissue compartments in H&E-stained slides, with 16 classes spanning primary and metastatic NSCLC, (ii) nuclei detection, and (iii) PD-L1 positive tumor cell detection in PD-L1 IHC slides. To the best of our knowledge, this is the first public NSCLC dataset with manual annotations of H&E in metastatic sites and PD-L1 IHC.
中文摘要:IGNITE数据工具包作为首个公开的NSCLC数据集,提供了包含转移部位H&E染色和PD-L1免疫组化的全面标注,填补了当前数字病理学资源的空白。
English Summary: The IGNITE data toolkit is introduced as the first public NSCLC dataset with comprehensive annotations for tissue segmentation, nuclei detection, and PD-L1 IHC analysis, addressing current limitations in digital pathology resources.

Authors:Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, Fu-En Yang
Title: ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
Abstract:
Vision-language-action (VLA) reasoning tasks require agents to interpret multimodal instructions, perform long-horizon planning, and act adaptively in dynamic environments. Existing approaches typically train VLA models in an end-to-end fashion, directly mapping inputs to actions without explicit reasoning, which hinders their ability to plan over multiple steps or adapt to complex task variations. In this paper, we propose ThinkAct, a dual-system framework that bridges high-level reasoning with low-level action execution via reinforced visual latent planning. ThinkAct trains a multimodal LLM to generate embodied reasoning plans guided by reinforcing action-aligned visual rewards based on goal completion and trajectory consistency. These reasoning plans are compressed into a visual plan latent that conditions a downstream action model for robust action execution on target environments. Extensive experiments on embodied reasoning and robot manipulation benchmarks demonstrate that ThinkAct enables few-shot adaptation, long-horizon planning, and self-correction behaviors in complex embodied AI tasks.
Chinese: ThinkAct提出了一种双系统框架,通过强化视觉潜在规划将高层推理与低层动作执行相结合,在复杂具身AI任务中实现了少样本适应、长程规划和自我纠正行为。
English: ThinkAct introduces a dual-system framework that integrates high-level reasoning with low-level action execution through reinforced visual latent planning, enabling few-shot adaptation, long-horizon planning, and self-correction in complex embodied AI tasks.

Authors:Changhao Li, Xinrui Chen, Ji Wang, Kang Zhao, Jianfei Chen
Title: Task-Specific Zero-shot Quantization-Aware Training for Object Detection
Abstract:
Quantization is a key technique to reduce network size and computational complexity by representing the network parameters with a lower precision. Traditional quantization methods rely on access to original training data, which is often restricted due to privacy concerns or security challenges. Zero-shot Quantization (ZSQ) addresses this by using synthetic data generated from pre-trained models, eliminating the need for real training data. Recently, ZSQ has been extended to object detection. However, existing methods use unlabeled task-agnostic synthetic images that lack the specific information required for object detection, leading to suboptimal performance. In this paper, we propose a novel task-specific ZSQ framework for object detection networks, which consists of two main stages. First, we introduce a bounding box and category sampling strategy to synthesize a task-specific calibration set from the pre-trained network, reconstructing object locations, sizes, and category distributions without any prior knowledge. Second, we integrate task-specific training into the knowledge distillation process to restore the performance of quantized detection networks. Extensive experiments conducted on the MS-COCO and Pascal VOC datasets demonstrate the efficiency and state-of-the-art performance of our method. Our code is publicly available at: https://github.com/DFQ-Dojo/dfq-toolkit .
中文: 本文提出了一种新颖的面向目标检测网络的任务特定零样本量化框架,通过合成任务特定校准数据并结合任务感知训练来恢复性能,在基准数据集上取得了领先水平的结果。
English: This paper introduces a novel task-specific zero-shot quantization framework for object detection networks that synthesizes task-specific calibration data and integrates task-aware training to restore performance, achieving state-of-the-art results on benchmark datasets.

Authors:Marcel Kleinmann, Shashank Agnihotri, Margret Keuper
Title: Faithful, Interpretable Chest X-ray Diagnosis with Anti-Aliased B-cos Networks
Abstract:
Faithfulness and interpretability are essential for deploying deep neural networks (DNNs) in safety-critical domains such as medical imaging. B-cos networks offer a promising solution by replacing standard linear layers with a weight-input alignment mechanism, producing inherently interpretable, class-specific explanations without post-hoc methods. While maintaining diagnostic performance competitive with state-of-the-art DNNs, standard B-cos models suffer from severe aliasing artifacts in their explanation maps, making them unsuitable for clinical use where clarity is essential. In this work, we address these limitations by introducing anti-aliasing strategies using FLCPooling (FLC) and BlurPool (BP) to significantly improve explanation quality. Our experiments on chest X-ray datasets demonstrate that the modified $\text{B-cos}_\text{FLC}$ and $\text{B-cos}_\text{BP}$ preserve strong predictive performance while providing faithful and artifact-free explanations suitable for clinical application in multi-class and multi-label settings. Code available at: GitHub repository (url: https://github.com/mkleinma/B-cos-medical-paper).
Chinese: 本研究通过引入FLCPooling和BlurPool等抗锯齿策略改进B-cos网络,在保持诊断性能的同时消除解释图中的伪影,使其适用于医学影像的临床应用。
English: The study enhances B-cos networks by integrating anti-aliasing techniques like FLCPooling and BlurPool, which eliminate artifacts in explanation maps while preserving diagnostic accuracy, making them viable for clinical use in medical imaging.

Authors:Keneni W. Tesema, Lyndon Hill, Mark W. Jones, Gary K. L. Tam
Title: Denoising-While-Completing Network (DWCNet): Robust Point Cloud Completion Under Corruption
Abstract:
Point cloud completion is crucial for 3D computer vision tasks in autonomous driving, augmented reality, and robotics. However, obtaining clean and complete point clouds from real-world environments is challenging due to noise and occlusions. Consequently, most existing completion networks -- trained on synthetic data -- struggle with real-world degradations. In this work, we tackle the problem of completing and denoising highly corrupted partial point clouds affected by multiple simultaneous degradations. To benchmark robustness, we introduce the Corrupted Point Cloud Completion Dataset (CPCCD), which highlights the limitations of current methods under diverse corruptions. Building on these insights, we propose DWCNet (Denoising-While-Completing Network), a completion framework enhanced with a Noise Management Module (NMM) that leverages contrastive learning and self-attention to suppress noise and model structural relationships. DWCNet achieves state-of-the-art performance on both clean and corrupted, synthetic and real-world datasets. The dataset and code will be publicly available at https://github.com/keneniwt/DWCNET-Robust-Point-Cloud-Completion-against-Corruptions
中文: 本文提出DWCNet点云补全框架,通过噪声管理模块有效处理多种现实世界退化问题,在合成和真实数据集上均实现了最先进的性能。
English: This paper introduces DWCNet, a robust point cloud completion framework with a Noise Management Module that effectively handles multiple real-world degradations, achieving state-of-the-art performance on both synthetic and real-world datasets.

Authors:Yiguo He, Junjie Zhu, Yiying Li, Xiaoyu Zhang, Chunping Qiu, Jun Wang, Qiangjuan Huang, Ke Yang
Title: Enhancing Remote Sensing Vision-Language Models Through MLLM and LLM-Based High-Quality Image-Text Dataset Generation
Abstract:
The application of Vision-language foundation models (VLFMs) to remote sensing (RS) imagery has garnered significant attention due to their superior capability in various downstream tasks. A key challenge lies in the scarcity of high-quality, large-scale, image-text paired training data. Recently, several works introduced extensive image-text datasets for RS and trained their VLFMs. However, due to the rudimentary methods used for generating captions, the quality of datasets is suboptimal, requiring larger volumes of training data, while only yielding modest performance improvements. In this paper, we propose a two-stage method named MpGI(Multi-Perspective Generation and Integration) for generating high-quality text captions for RS images. Firstly, we generate distinct and detailed descriptions from different perspectives using Rule-MLLM(Multimodal Large Language Model) Relay Generation and MLLMs generation methods. Next, we utilize Large Language Models (LLMs) to integrate these diverse descriptions into comprehensive captions, capturing details from multiple perspectives. Finally, we have created the HQRS-IT-210K dataset, including about 210,000 RS images and 1.3 million captions. We fine-tuned two VLFMs using our dataset: CLIP, a discriminative model, and CoCa, an image-to-text generative model. This process resulted in our proposed HQRS-CLIP and RS-CoCa models. Experimental results demonstrate that HQRS-CLIP surpassed the previous SOTA RS CLIP model in various downstream tasks while using only 4.2\% of the training data. RS-CoCa outperforms other advanced approaches across benchmark datasets and can generate captions for RS images that rival or even exceed manual annotations. Dataset, pre-trained models, and codes will be released at https://github.com/YiguoHe/HQRS-210K-and-HQRS-CLIP.
Chinese: 本文提出了一种名为MpGI的两阶段方法,用于生成高质量的遥感图像文本描述,并创建了HQRS-IT-210K数据集,该数据集以极少的训练数据显著提升了视觉语言基础模型的性能。
English: This paper introduces MpGI, a two-stage method for generating high-quality captions for remote sensing images, and the resulting HQRS-IT-210K dataset, which significantly enhances VLFM performance with minimal training data.

Authors:Meng Lou, Yunxiang Fu, Yizhou Yu
Title: A2Mamba: Attention-augmented State Space Models for Visual Recognition
Abstract:
Transformers and Mamba, initially invented for natural language processing, have inspired backbone architectures for visual recognition. Recent studies integrated Local Attention Transformers with Mamba to capture both local details and global contexts. Despite competitive performance, these methods are limited to simple stacking of Transformer and Mamba layers without any interaction mechanism between them. Thus, deep integration between Transformer and Mamba layers remains an open problem. We address this problem by proposing A2Mamba, a powerful Transformer-Mamba hybrid network architecture, featuring a new token mixer termed Multi-scale Attention-augmented State Space Model (MASS), where multi-scale attention maps are integrated into an attention-augmented SSM (A2SSM). A key step of A2SSM performs a variant of cross-attention by spatially aggregating the SSM's hidden states using the multi-scale attention maps, which enhances spatial dependencies pertaining to a two-dimensional space while improving the dynamic modeling capabilities of SSMs. Our A2Mamba outperforms all previous ConvNet-, Transformer-, and Mamba-based architectures in visual recognition tasks. For instance, A2Mamba-L achieves an impressive 86.1% top-1 accuracy on ImageNet-1K. In semantic segmentation, A2Mamba-B exceeds CAFormer-S36 by 2.5% in mIoU, while exhibiting higher efficiency. In object detection and instance segmentation with Cascade Mask R-CNN, A2Mamba-S surpasses MambaVision-B by 1.2%/0.9% in AP^b/AP^m, while having 40% less parameters. Code is publicly available at https://github.com/LMMMEng/A2Mamba.
中文: 提出的A2Mamba架构通过新型多尺度注意力增强状态空间模型(MASS)实现了Transformer与Mamba层的深度融合,在多项视觉识别任务中以更高效率达到最优性能。
English: The proposed A2Mamba architecture integrates Transformer and Mamba layers through a novel Multi-scale Attention-augmented State Space Model (MASS), achieving state-of-the-art performance across multiple visual recognition tasks with superior efficiency.

Authors:Xueming Fu, Pei Wu, Yingtai Li, Xin Luo, Zihang Jiang, Junhao Mei, Jian Lu, Gao-Jun Teng, S. Kevin Zhou
Title: Dyna3DGR: 4D Cardiac Motion Tracking with Dynamic 3D Gaussian Representation
Abstract:
Accurate analysis of cardiac motion is crucial for evaluating cardiac function. While dynamic cardiac magnetic resonance imaging (CMR) can capture detailed tissue motion throughout the cardiac cycle, the fine-grained 4D cardiac motion tracking remains challenging due to the homogeneous nature of myocardial tissue and the lack of distinctive features. Existing approaches can be broadly categorized into image based and representation-based, each with its limitations. Image-based methods, including both raditional and deep learning-based registration approaches, either struggle with topological consistency or rely heavily on extensive training data. Representation-based methods, while promising, often suffer from loss of image-level details. To address these limitations, we propose Dynamic 3D Gaussian Representation (Dyna3DGR), a novel framework that combines explicit 3D Gaussian representation with implicit neural motion field modeling. Our method simultaneously optimizes cardiac structure and motion in a self-supervised manner, eliminating the need for extensive training data or point-to-point correspondences. Through differentiable volumetric rendering, Dyna3DGR efficiently bridges continuous motion representation with image-space alignment while preserving both topological and temporal consistency. Comprehensive evaluations on the ACDC dataset demonstrate that our approach surpasses state-of-the-art deep learning-based diffeomorphic registration methods in tracking accuracy. The code will be available in https://github.com/windrise/Dyna3DGR.
Chinese: 提出的动态3D高斯表示(Dyna3DGR)框架结合显式3D高斯表示与隐式神经运动场,以自监督方式实现精准心脏运动追踪,在ACDC数据集上超越现有方法且无需大量训练数据。
English: The proposed Dynamic 3D Gaussian Representation (Dyna3DGR) framework combines explicit 3D Gaussian representation with implicit neural motion fields to achieve accurate cardiac motion tracking in a self-supervised manner, outperforming existing methods on the ACDC dataset without requiring extensive training data.

Authors:Xiaojiao Xiao, Qinmin Vivian Hu, Guanghui Wang
Title: Pyramid Hierarchical Masked Diffusion Model for Imaging Synthesis
Abstract:
Medical image synthesis plays a crucial role in clinical workflows, addressing the common issue of missing imaging modalities due to factors such as extended scan times, scan corruption, artifacts, patient motion, and intolerance to contrast agents. The paper presents a novel image synthesis network, the Pyramid Hierarchical Masked Diffusion Model (PHMDiff), which employs a multi-scale hierarchical approach for more detailed control over synthesizing high-quality images across different resolutions and layers. Specifically, this model utilizes randomly multi-scale high-proportion masks to speed up diffusion model training, and balances detail fidelity and overall structure. The integration of a Transformer-based Diffusion model process incorporates cross-granularity regularization, modeling the mutual information consistency across each granularity's latent spaces, thereby enhancing pixel-level perceptual accuracy. Comprehensive experiments on two challenging datasets demonstrate that PHMDiff achieves superior performance in both the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM), highlighting its capability to produce high-quality synthesized images with excellent structural integrity. Ablation studies further confirm the contributions of each component. Furthermore, the PHMDiff model, a multi-scale image synthesis framework across and within medical imaging modalities, shows significant advantages over other methods. The source code is available at https://github.com/xiaojiao929/PHMDiff
中文摘要:本文提出PHMDiff模型,通过多尺度掩码扩散和跨粒度正则化方法,在加速训练的同时提升了医学图像合成的质量与结构完整性,在多个数据集上表现出优越性能。
English Summary: The paper introduces PHMDiff, a pyramid hierarchical masked diffusion model that accelerates training and enhances medical image synthesis quality by employing multi-scale masked diffusion and cross-granularity regularization, achieving superior performance on benchmark datasets.

Authors:Shang Liu, Chenjie Cao, Chaohui Yu, Wen Qian, Jing Wang, Fan Wang
Title: EarthCrafter: Scalable 3D Earth Generation via Dual-Sparse Latent Diffusion
Abstract:
Despite the remarkable developments achieved by recent 3D generation works, scaling these methods to geographic extents, such as modeling thousands of square kilometers of Earth's surface, remains an open challenge. We address this through a dual innovation in data infrastructure and model architecture. First, we introduce Aerial-Earth3D, the largest 3D aerial dataset to date, consisting of 50k curated scenes (each measuring 600m x 600m) captured across the U.S. mainland, comprising 45M multi-view Google Earth frames. Each scene provides pose-annotated multi-view images, depth maps, normals, semantic segmentation, and camera poses, with explicit quality control to ensure terrain diversity. Building on this foundation, we propose EarthCrafter, a tailored framework for large-scale 3D Earth generation via sparse-decoupled latent diffusion. Our architecture separates structural and textural generation: 1) Dual sparse 3D-VAEs compress high-resolution geometric voxels and textural 2D Gaussian Splats (2DGS) into compact latent spaces, largely alleviating the costly computation suffering from vast geographic scales while preserving critical information. 2) We propose condition-aware flow matching models trained on mixed inputs (semantics, images, or neither) to flexibly model latent geometry and texture features independently. Extensive experiments demonstrate that EarthCrafter performs substantially better in extremely large-scale generation. The framework further supports versatile applications, from semantic-guided urban layout generation to unconditional terrain synthesis, while maintaining geographic plausibility through our rich data priors from Aerial-Earth3D. Our project page is available at https://whiteinblue.github.io/earthcrafter/
中文摘要:本研究提出了迄今最大的三维航拍数据集Aerial-Earth3D,并开发了EarthCrafter框架,通过稀疏解耦的潜在扩散技术实现大规模三维地球生成,在提升计算效率的同时保持了地理合理性。
English Summary: The study introduces Aerial-Earth3D, the largest 3D aerial dataset, and EarthCrafter, a novel framework using sparse-decoupled latent diffusion to enable large-scale 3D Earth generation with enhanced computational efficiency and geographic accuracy.

Authors:Xiaoyan Wang, Zeju Li, Yifan Xu, Jiaxing Qi, Zhifei Yang, Ruifei Ma, Xiangde Liu, Chao Zhang
Title: Spatial 3D-LLM: Exploring Spatial Awareness in 3D Vision-Language Models
Abstract:
New era has unlocked exciting possibilities for extending Large Language Models (LLMs) to tackle 3D vision-language tasks. However, most existing 3D multimodal LLMs (MLLMs) rely on compressing holistic 3D scene information or segmenting independent objects to perform these tasks, which limits their spatial awareness due to insufficient representation of the richness inherent in 3D scenes. To overcome these limitations, we propose Spatial 3D-LLM, a 3D MLLM specifically designed to enhance spatial awareness for 3D vision-language tasks by enriching the spatial embeddings of 3D scenes. Spatial 3D-LLM integrates an LLM backbone with a progressive spatial awareness scheme that progressively captures spatial information as the perception field expands, generating location-enriched 3D scene embeddings to serve as visual prompts. Furthermore, we introduce two novel tasks: 3D object distance measurement and 3D layout editing, and construct a 3D instruction dataset, MODEL, to evaluate the model's spatial awareness capabilities. Experimental results demonstrate that Spatial 3D-LLM achieves state-of-the-art performance across a wide range of 3D vision-language tasks, revealing the improvements stemmed from our progressive spatial awareness scheme of mining more profound spatial information. Our code is available at https://github.com/bjshuyuan/Spatial-3D-LLM.
中文: 提出的Spatial 3D-LLM通过渐进式空间嵌入增强3D视觉语言任务的空间感知能力,并借助新任务和专用数据集验证了其领先性能。
English: The proposed Spatial 3D-LLM enhances spatial awareness in 3D vision-language tasks through progressive spatial embeddings and achieves state-of-the-art performance, as validated by novel tasks and a dedicated dataset.

Authors:Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, Jin Xie
Title: VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences
Abstract:
Foundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, extending these models to large-scale RGB stream 3D reconstruction remains challenging due to memory limitations. In this work, we propose VGGT-Long, a simple yet effective system that pushes the limits of monocular 3D reconstruction to kilometer-scale, unbounded outdoor environments. Our approach addresses the scalability bottlenecks of existing models through a chunk-based processing strategy combined with overlapping alignment and lightweight loop closure optimization. Without requiring camera calibration, depth supervision or model retraining, VGGT-Long achieves trajectory and reconstruction performance comparable to traditional methods. We evaluate our method on KITTI, Waymo, and Virtual KITTI datasets. VGGT-Long not only runs successfully on long RGB sequences where foundation models typically fail, but also produces accurate and consistent geometry across various conditions. Our results highlight the potential of leveraging foundation models for scalable monocular 3D scene in real-world settings, especially for autonomous driving scenarios. Code is available at https://github.com/DengKaiCQ/VGGT-Long.
Chinese: VGGT-Long 是一种创新系统,通过分块处理和轻量级优化克服内存限制,实现了户外环境千米级单目三维重建,无需相机标定或模型重训练即可达到与传统方法相当的性能。
English: VGGT-Long is a novel system that overcomes memory limitations to enable kilometer-scale monocular 3D reconstruction in outdoor environments through chunk-based processing and lightweight optimization, achieving performance comparable to traditional methods without requiring camera calibration or retraining.

Authors:Kahim Wong, Jicheng Zhou, Haiwei Wu, Yain-Whar Si, Jiantao Zhou
Title: ADCD-Net: Robust Document Image Forgery Localization via Adaptive DCT Feature and Hierarchical Content Disentanglement
Abstract:
The advancement of image editing tools has enabled malicious manipulation of sensitive document images, underscoring the need for robust document image forgery detection.Though forgery detectors for natural images have been extensively studied, they struggle with document images, as the tampered regions can be seamlessly blended into the uniform document background (BG) and structured text. On the other hand, existing document-specific methods lack sufficient robustness against various degradations, which limits their practical deployment. This paper presents ADCD-Net, a robust document forgery localization model that adaptively leverages the RGB/DCT forensic traces and integrates key characteristics of document images. Specifically, to address the DCT traces' sensitivity to block misalignment, we adaptively modulate the DCT feature contribution based on a predicted alignment score, resulting in much improved resilience to various distortions, including resizing and cropping. Also, a hierarchical content disentanglement approach is proposed to boost the localization performance via mitigating the text-BG disparities. Furthermore, noticing the predominantly pristine nature of BG regions, we construct a pristine prototype capturing traces of untampered regions, and eventually enhance both the localization accuracy and robustness. Our proposed ADCD-Net demonstrates superior forgery localization performance, consistently outperforming state-of-the-art methods by 20.79\% averaged over 5 types of distortions. The code is available at https://github.com/KAHIMWONG/ACDC-Net.
中文: 本文提出ADCD-Net模型,通过自适应融合RGB/DCT取证特征并整合文档图像特性,在多种失真条件下实现卓越的伪造区域定位性能,显著优于现有方法。
English: This paper introduces ADCD-Net, a robust document forgery detection model that adaptively combines RGB/DCT forensic traces with document characteristics to significantly outperform existing methods in localizing tampered regions under various distortions.

Authors:Kuo-Cheng Wu, Guohang Zhuang, Jinyang Huang, Xiang Zhang, Wanli Ouyang, Yan Lu
Title: STAR: A Benchmark for Astronomical Star Fields Super-Resolution
Abstract:
Super-resolution (SR) advances astronomical imaging by enabling cost-effective high-resolution capture, crucial for detecting faraway celestial objects and precise structural analysis. However, existing datasets for astronomical SR (ASR) exhibit three critical limitations: flux inconsistency, object-crop setting, and insufficient data diversity, significantly impeding ASR development. We propose STAR, a large-scale astronomical SR dataset containing 54,738 flux-consistent star field image pairs covering wide celestial regions. These pairs combine Hubble Space Telescope high-resolution observations with physically faithful low-resolution counterparts generated through a flux-preserving data generation pipeline, enabling systematic development of field-level ASR models. To further empower the ASR community, STAR provides a novel Flux Error (FE) to evaluate SR models in physical view. Leveraging this benchmark, we propose a Flux-Invariant Super Resolution (FISR) model that could accurately infer the flux-consistent high-resolution images from input photometry, suppressing several SR state-of-the-art methods by 24.84% on a novel designed flux consistency metric, showing the priority of our method for astrophysics. Extensive experiments demonstrate the effectiveness of our proposed method and the value of our dataset. Code and models are available at https://github.com/GuoCheng12/STAR.
中文: 提出的STAR数据集通过提供通量一致的图像对并引入新的通量误差度量,解决了天文超分辨率中的关键局限,同时配套的FISR模型在保持物理通量特性方面展现出优越性能。
English: The proposed STAR dataset addresses critical limitations in astronomical super-resolution by providing flux-consistent image pairs and introducing a novel Flux Error metric, while the accompanying FISR model demonstrates superior performance in preserving physical flux properties.

Authors:Jinquan Guan, Junhong Guo, Qi Chen, Jian Chen, Yongkang Cai, Yilin He, Zhiquan Huang, Yan Wang, Yutong Xie
Title: A High Magnifications Histopathology Image Dataset for Oral Squamous Cell Carcinoma Diagnosis and Prognosis
Abstract:
Oral Squamous Cell Carcinoma (OSCC) is a prevalent and aggressive malignancy where deep learning-based computer-aided diagnosis and prognosis can enhance clinical assessments.However, existing publicly available OSCC datasets often suffer from limited patient cohorts and a restricted focus on either diagnostic or prognostic tasks, limiting the development of comprehensive and generalizable models. To bridge this gap, we introduce Multi-OSCC, a new histopathology image dataset comprising 1,325 OSCC patients, integrating both diagnostic and prognostic information to expand existing public resources. Each patient is represented by six high resolution histopathology images captured at x200, x400, and x1000 magnifications-two per magnification-covering both the core and edge tumor regions.The Multi-OSCC dataset is richly annotated for six critical clinical tasks: recurrence prediction (REC), lymph node metastasis (LNM), tumor differentiation (TD), tumor invasion (TI), cancer embolus (CE), and perineural invasion (PI). To benchmark this dataset, we systematically evaluate the impact of different visual encoders, multi-image fusion techniques, stain normalization, and multi-task learning frameworks. Our analysis yields several key insights: (1) The top-performing models achieve excellent results, with an Area Under the Curve (AUC) of 94.72% for REC and 81.23% for TD, while all tasks surpass 70% AUC; (2) Stain normalization benefits diagnostic tasks but negatively affects recurrence prediction; (3) Multi-task learning incurs a 3.34% average AUC degradation compared to single-task models in our multi-task benchmark, underscoring the challenge of balancing multiple tasks in our dataset. To accelerate future research, we publicly release the Multi-OSCC dataset and baseline models at https://github.com/guanjinquan/OSCC-PathologyImageDataset.
中文: Multi-OSCC数据集通过整合1325名口腔癌患者的诊断与预后信息,填补了现有公共数据的不足,在复发预测中取得94.72%的AUC优异性能,同时揭示了染色标准化对预后预测的负面影响及多任务学习的平衡难题。
English: The Multi-OSCC dataset addresses limitations in existing oral cancer data by providing comprehensive histopathology images from 1,325 patients with diagnostic and prognostic annotations, achieving up to 94.72% AUC in recurrence prediction while revealing key insights about stain normalization and multi-task learning challenges.

Authors:Joseph De Mathia, Carlos Francisco Moreno-García
Title: Scene Text Detection and Recognition "in light of" Challenging Environmental Conditions using Aria Glasses Egocentric Vision Cameras
Abstract:
In an era where wearable technology is reshaping applications, Scene Text Detection and Recognition (STDR) becomes a straightforward choice through the lens of egocentric vision. Leveraging Meta's Project Aria smart glasses, this paper investigates how environmental variables, such as lighting, distance, and resolution, affect the performance of state-of-the-art STDR algorithms in real-world scenarios. We introduce a novel, custom-built dataset captured under controlled conditions and evaluate two OCR pipelines: EAST with CRNN, and EAST with PyTesseract. Our findings reveal that resolution and distance significantly influence recognition accuracy, while lighting plays a less predictable role. Notably, image upscaling emerged as a key pre-processing technique, reducing Character Error Rate (CER) from 0.65 to 0.48. We further demonstrate the potential of integrating eye-gaze tracking to optimise processing efficiency by focusing on user attention zones. This work not only benchmarks STDR performance under realistic conditions but also lays the groundwork for adaptive, user-aware AR systems. Our contributions aim to inspire future research in robust, context-sensitive text recognition for assistive and research-oriented applications, such as asset inspection and nutrition analysis. The code is available at https://github.com/josepDe/Project_Aria_STR.
中文摘要:本研究利用Meta的Project Aria智能眼镜评估光照、距离和分辨率对场景文本检测与识别的影响,发现分辨率和距离显著影响识别准确率而光照作用较难预测,并证明图像超分辨率和眼动追踪可通过聚焦用户注视区域提升自适应增强现实系统的性能。
English Summary: This study uses Meta's Project Aria smart glasses to evaluate how lighting, distance, and resolution affect Scene Text Detection and Recognition, finding that resolution and distance significantly impact accuracy while lighting is less predictable, and demonstrates that image upscaling and eye-gaze tracking can enhance performance for adaptive AR systems.

Authors:Kailai Zhou, Fuqiang Yang, Shixian Wang, Bihan Wen, Chongde Zi, Linsen Chen, Qiu Shen, Xun Cao
Title: M-SpecGene: Generalized Foundation Model for RGBT Multispectral Vision
Abstract:
RGB-Thermal (RGBT) multispectral vision is essential for robust perception in complex environments. Most RGBT tasks follow a case-by-case research paradigm, relying on manually customized models to learn task-oriented representations. Nevertheless, this paradigm is inherently constrained by artificial inductive bias, modality bias, and data bottleneck. To address these limitations, we make the initial attempt to build a Generalized RGBT MultiSpectral foundation model (M-SpecGene), which aims to learn modality-invariant representations from large-scale broad data in a self-supervised manner. M-SpecGene provides new insights into multispectral fusion and integrates prior case-by-case studies into a unified paradigm. Considering the unique characteristic of information imbalance in RGBT data, we introduce the Cross-Modality Structural Sparsity (CMSS) metric to quantify the information density across two modalities. Then we develop the GMM-CMSS progressive masking strategy to facilitate a flexible, easy-to-hard, and object-centric pre-training process. Comprehensive experiments validate M-SpecGene's generalizability across eleven datasets for four RGBT downstream tasks. The code will be available at https://github.com/CalayZhou/M-SpecGene.
中文: 研究者提出M-SpecGene通用多光谱基础模型,通过跨模态结构稀疏性掩码策略学习模态不变表征,在四大类下游任务的十一个数据集上验证了其泛化能力。
English: Researchers introduce M-SpecGene, a generalized RGB-thermal foundation model that learns modality-invariant representations through a novel cross-modality structural sparsity masking strategy, demonstrating strong performance across diverse downstream tasks.

Authors:Yanchen Liu, Yanan Sun, Zhening Xing, Junyao Gao, Kai Chen, Wenjie Pei
Title: MotionShot: Adaptive Motion Transfer across Arbitrary Objects for Text-to-Video Generation
Abstract:
Existing text-to-video methods struggle to transfer motion smoothly from a reference object to a target object with significant differences in appearance or structure between them. To address this challenge, we introduce MotionShot, a training-free framework capable of parsing reference-target correspondences in a fine-grained manner, thereby achieving high-fidelity motion transfer while preserving coherence in appearance. To be specific, MotionShot first performs semantic feature matching to ensure high-level alignments between the reference and target objects. It then further establishes low-level morphological alignments through reference-to-target shape retargeting. By encoding motion with temporal attention, our MotionShot can coherently transfer motion across objects, even in the presence of significant appearance and structure disparities, demonstrated by extensive experiments. The project page is available at: https://motionshot.github.io/.

Authors:Xianze Fang, Jingnan Gao, Zhe Wang, Zhuo Chen, Xingyu Ren, Jiangjing Lyu, Qiaomu Ren, Zhonglei Yang, Xiaokang Yang, Yichao Yan, Chengfei Lyu
Title: Dens3R: A Foundation Model for 3D Geometry Prediction
Abstract:
Recent advances in dense 3D reconstruction have led to significant progress, yet achieving accurate unified geometric prediction remains a major challenge. Most existing methods are limited to predicting a single geometry quantity from input images. However, geometric quantities such as depth, surface normals, and point maps are inherently correlated, and estimating them in isolation often fails to ensure consistency, thereby limiting both accuracy and practical applicability. This motivates us to explore a unified framework that explicitly models the structural coupling among different geometric properties to enable joint regression. In this paper, we present Dens3R, a 3D foundation model designed for joint geometric dense prediction and adaptable to a wide range of downstream tasks. Dens3R adopts a two-stage training framework to progressively build a pointmap representation that is both generalizable and intrinsically invariant. Specifically, we design a lightweight shared encoder-decoder backbone and introduce position-interpolated rotary positional encoding to maintain expressive power while enhancing robustness to high-resolution inputs. By integrating image-pair matching features with intrinsic invariance modeling, Dens3R accurately regresses multiple geometric quantities such as surface normals and depth, achieving consistent geometry perception from single-view to multi-view inputs. Additionally, we propose a post-processing pipeline that supports geometrically consistent multi-view inference. Extensive experiments demonstrate the superior performance of Dens3R across various dense 3D prediction tasks and highlight its potential for broader applications.
中文: Dens3R作为一种统一的三维基础模型,通过共享编码器-解码器框架联合回归深度与表面法向量等关联几何量,实现了从单视图到多视图输入的一致性几何感知。
English: Dens3R is a unified 3D foundation model that jointly regresses correlated geometric quantities like depth and surface normals through a shared encoder-decoder framework, achieving consistent geometry perception from single to multi-view inputs.

Authors:Yu Wang, Bo Dang, Wanchun Li, Wei Chen, Yansheng Li
Title: HoliTracer: Holistic Vectorization of Geographic Objects from Large-Size Remote Sensing Imagery
Abstract:
With the increasing resolution of remote sensing imagery (RSI), large-size RSI has emerged as a vital data source for high-precision vector mapping of geographic objects. Existing methods are typically constrained to processing small image patches, which often leads to the loss of contextual information and produces fragmented vector outputs. To address these, this paper introduces HoliTracer, the first framework designed to holistically extract vectorized geographic objects from large-size RSI. In HoliTracer, we enhance segmentation of large-size RSI using the Context Attention Net (CAN), which employs a local-to-global attention mechanism to capture contextual dependencies. Furthermore, we achieve holistic vectorization through a robust pipeline that leverages the Mask Contour Reformer (MCR) to reconstruct polygons and the Polygon Sequence Tracer (PST) to trace vertices. Extensive experiments on large-size RSI datasets, including buildings, water bodies, and roads, demonstrate that HoliTracer outperforms state-of-the-art methods. Our code and data are available in https://github.com/vvangfaye/HoliTracer.
中文:HoliTracer是一种创新框架,通过上下文注意力网络增强分割,并采用稳健流程实现精确矢量化,能够从大幅遥感影像中整体提取地理对象矢量,性能优于现有方法。
English: HoliTracer is a novel framework that holistically extracts vectorized geographic objects from large-size remote sensing imagery using Context Attention Net for enhanced segmentation and a robust pipeline for accurate vectorization, outperforming existing methods.

Authors:Chao Zhou, Tianyi Wei, Nenghai Yu
Title: Scale Your Instructions: Enhance the Instruction-Following Fidelity of Unified Image Generation Model by Self-Adaptive Attention Scaling
Abstract:
Recent advancements in unified image generation models, such as OmniGen, have enabled the handling of diverse image generation and editing tasks within a single framework, accepting multimodal, interleaved texts and images in free form. This unified architecture eliminates the need for text encoders, greatly reducing model complexity and standardizing various image generation and editing tasks, making it more user-friendly. However, we found that it suffers from text instruction neglect, especially when the text instruction contains multiple sub-instructions. To explore this issue, we performed a perturbation analysis on the input to identify critical steps and layers. By examining the cross-attention maps of these key steps, we observed significant conflicts between neglected sub-instructions and the activations of the input image. In response, we propose Self-Adaptive Attention Scaling (SaaS), a method that leverages the consistency of cross-attention between adjacent timesteps to dynamically scale the attention activation for each sub-instruction. Our SaaS enhances instruction-following fidelity without requiring additional training or test-time optimization. Experimental results on instruction-based image editing and visual conditional image generation validate the effectiveness of our SaaS, showing superior instruction-following fidelity over existing methods. The code is available https://github.com/zhouchao-ops/SaaS.
Chinese: 提出的自适应注意力缩放(SaaS)方法通过动态调整每个子指令的注意力激活,有效解决了统一图像生成模型中的文本指令忽略问题,无需额外训练即可显著提升指令遵循的准确性。
English: The proposed Self-Adaptive Attention Scaling (SaaS) method effectively resolves text instruction neglect in unified image generation models by dynamically scaling attention activation for each sub-instruction, significantly improving instruction-following fidelity without requiring additional training.

Authors:Shreelekha Revankar, Utkarsh Mall, Cheng Perng Phoo, Kavita Bala, Bharath Hariharan
Title: MONITRS: Multimodal Observations of Natural Incidents Through Remote Sensing
Abstract:
Natural disasters cause devastating damage to communities and infrastructure every year. Effective disaster response is hampered by the difficulty of accessing affected areas during and after events. Remote sensing has allowed us to monitor natural disasters in a remote way. More recently there have been advances in computer vision and deep learning that help automate satellite imagery analysis, However, they remain limited by their narrow focus on specific disaster types, reliance on manual expert interpretation, and lack of datasets with sufficient temporal granularity or natural language annotations for tracking disaster progression. We present MONITRS, a novel multimodal dataset of more than 10,000 FEMA disaster events with temporal satellite imagery and natural language annotations from news articles, accompanied by geotagged locations, and question-answer pairs. We demonstrate that fine-tuning existing MLLMs on our dataset yields significant performance improvements for disaster monitoring tasks, establishing a new benchmark for machine learning-assisted disaster response systems. Code can be found at: https://github.com/ShreelekhaR/MONITRS
中文:MONITRS数据集通过整合多模态卫星图像与自然语言标注,突破了现有灾害监测方法的局限,显著提升了机器学习模型在灾害响应任务中的性能表现。
English: The MONITRS dataset addresses limitations in current disaster monitoring by providing multimodal satellite imagery and natural language annotations, enabling significant performance improvements in machine learning models for disaster response.

Authors:Wentao Xiang, Haoxian Tan, Cong Wei, Yujie Zhong, Dengjie Li, Yujiu Yang
Title: Advancing Visual Large Language Model for Multi-granular Versatile Perception
Abstract:
Perception is a fundamental task in the field of computer vision, encompassing a diverse set of subtasks that can be systematically categorized into four distinct groups based on two dimensions: prediction type and instruction type. Notably, existing researches often focus solely on a limited subset of these potential combinations, which constrains their applicability and versatility across various contexts. In response to this challenge, we present MVP-LM, a Multi-granular and Versatile Perception framework incorporating Visual Large Language Model. Our framework is designed to integrate both word-based and sentence-based perception tasks alongside box and mask predictions within a single architecture. MVP-LM features an innovative multi-granularity decoder in conjunction with a CoT-inspired dataset unification strategy, enabling seamless supervised fine-tuning across a wide spectrum of tasks, including but not limited to panoptic segmentation, detection, grounding, and referring expression segmentation. Furthermore, we introduce a query enhancement strategy aimed at harnessing the decoding and generative capabilities inherent in VLLMs. Extensive experiments conducted across a range of benchmarks in both word-based and sentence-based perception tasks substantiate the efficacy of our framework. The code will be available at https://github.com/xiangwentao666/MVP-LM.
中文: 该摘要提出了MVP-LM框架,通过多粒度解码器和数据集统一策略,将基于词语和句子的感知任务与边界框及掩码预测整合于单一架构,有效提升了计算机视觉任务的通用性和性能表现。
English: The abstract introduces MVP-LM, a unified framework that integrates word- and sentence-based perception tasks with box and mask predictions using a multi-granularity decoder and dataset unification strategy to enhance versatility across computer vision applications.

Authors:Zitong Xu, Huiyu Duan, Bingnan Liu, Guangji Ma, Jiarui Wang, Liu Yang, Shiqi Gao, Xiaoyu Wang, Jia Wang, Xiongkuo Min, Guangtao Zhai, Weisi Lin
Title: LMM4Edit: Benchmarking and Evaluating Multimodal Image Editing with LMMs
Abstract:
The rapid advancement of Text-guided Image Editing (TIE) enables image modifications through text prompts. However, current TIE models still struggle to balance image quality, editing alignment, and consistency with the original image, limiting their practical applications. Existing TIE evaluation benchmarks and metrics have limitations on scale or alignment with human perception. To this end, we introduce EBench-18K, the first large-scale image Editing Benchmark including 18K edited images with fine-grained human preference annotations for evaluating TIE. Specifically, EBench-18K includes 1,080 source images with corresponding editing prompts across 21 tasks, 18K+ edited images produced by 17 state-of-the-art TIE models, 55K+ mean opinion scores (MOSs) assessed from three evaluation dimensions, and 18K+ question-answering (QA) pairs. Based on EBench-18K, we employ outstanding LMMs to assess edited images, while the evaluation results, in turn, provide insights into assessing the alignment between the LMMs' understanding ability and human preferences. Then, we propose LMM4Edit, a LMM-based metric for evaluating image Editing models from perceptual quality, editing alignment, attribute preservation, and task-specific QA accuracy in an all-in-one manner. Extensive experiments show that LMM4Edit achieves outstanding performance and aligns well with human preference. Zero-shot validation on the other datasets also shows the generalization ability of our model. The dataset and code are available at https://github.com/IntMeGroup/LMM4Edit.
中文: 本文提出了包含18K编辑图像和人工标注的EBench-18K基准数据集,并开发了LMM4Edit评估指标,该指标能全面评估编辑效果且与人类偏好高度一致,展现出优秀的泛化能力。
English: This paper introduces EBench-18K, a comprehensive benchmark with 18K edited images and human annotations, and proposes LMM4Edit, an all-in-one evaluation metric that aligns well with human preferences and demonstrates strong generalization.

Authors:Fansheng Zeng, Bineng Zhong, Haiying Xia, Yufei Tan, Xiantao Hu, Liangtao Shi, Shuxiang Song
Title: Explicit Context Reasoning with Supervision for Visual Tracking
Abstract:
Contextual reasoning with constraints is crucial for enhancing temporal consistency in cross-frame modeling for visual tracking. However, mainstream tracking algorithms typically associate context by merely stacking historical information without explicitly supervising the association process, making it difficult to effectively model the target's evolving dynamics. To alleviate this problem, we propose RSTrack, which explicitly models and supervises context reasoning via three core mechanisms. \textit{1) Context Reasoning Mechanism}: Constructs a target state reasoning pipeline, converting unconstrained contextual associations into a temporal reasoning process that predicts the current representation based on historical target states, thereby enhancing temporal consistency. \textit{2) Forward Supervision Strategy}: Utilizes true target features as anchors to constrain the reasoning pipeline, guiding the predicted output toward the true target distribution and suppressing drift in the context reasoning process. \textit{3) Efficient State Modeling}: Employs a compression-reconstruction mechanism to extract the core features of the target, removing redundant information across frames and preventing ineffective contextual associations. These three mechanisms collaborate to effectively alleviate the issue of contextual association divergence in traditional temporal modeling. Experimental results show that RSTrack achieves state-of-the-art performance on multiple benchmark datasets while maintaining real-time running speeds. Our code is available at https://github.com/GXNU-ZhongLab/RSTrack.
中文摘要:RSTrack通过上下文推理机制、前向监督策略和高效状态建模三个核心机制,显式监督上下文关联过程,有效提升视觉跟踪的时序一致性,并在多个基准数据集上取得领先性能。
English Summary: RSTrack enhances visual tracking by explicitly modeling and supervising context reasoning through three mechanisms—context reasoning, forward supervision, and efficient state modeling—to improve temporal consistency and achieve state-of-the-art performance.

Authors:Nand Kumar Yadav, Rodrigue Rizk, William CW Chen, KC Santosh
Title: MLRU++: Multiscale Lightweight Residual UNETR++ with Attention for Efficient 3D Medical Image Segmentation
Abstract:
Accurate and efficient medical image segmentation is crucial but challenging due to anatomical variability and high computational demands on volumetric data. Recent hybrid CNN-Transformer architectures achieve state-of-the-art results but add significant complexity. In this paper, we propose MLRU++, a Multiscale Lightweight Residual UNETR++ architecture designed to balance segmentation accuracy and computational efficiency. It introduces two key innovations: a Lightweight Channel and Bottleneck Attention Module (LCBAM) that enhances contextual feature encoding with minimal overhead, and a Multiscale Bottleneck Block (M2B) in the decoder that captures fine-grained details via multi-resolution feature aggregation. Experiments on four publicly available benchmark datasets (Synapse, BTCV, ACDC, and Decathlon Lung) demonstrate that MLRU++ achieves state-of-the-art performance, with average Dice scores of 87.57% (Synapse), 93.00% (ACDC), and 81.12% (Lung). Compared to existing leading models, MLRU++ improves Dice scores by 5.38% and 2.12% on Synapse and ACDC, respectively, while significantly reducing parameter count and computational cost. Ablation studies evaluating LCBAM and M2B further confirm the effectiveness of the proposed architectural components. Results suggest that MLRU++ offers a practical and high-performing solution for 3D medical image segmentation tasks. Source code is available at: https://github.com/1027865/MLRUPP
中文: 本文提出的MLRU++轻量级混合架构通过创新的注意力模块和多尺度特征聚合,在保持高分割精度的同时显著降低了计算成本,为三维医学图像分割提供了高效解决方案。
English: This paper introduces MLRU++, a lightweight hybrid architecture for medical image segmentation that achieves state-of-the-art accuracy with improved computational efficiency through novel attention mechanisms and multi-scale feature aggregation.

Authors:Yaofang Liu, Yumeng Ren, Aitor Artola, Yuxuan Hu, Xiaodong Cun, Xiaotong Zhao, Alan Zhao, Raymond H. Chan, Suiyun Zhang, Rui Liu, Dandan Tu, Jean-Michel Morel
Title: PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation
Abstract:
The rapid advancement of video diffusion models has been hindered by fundamental limitations in temporal modeling, particularly the rigid synchronization of frame evolution imposed by conventional scalar timestep variables. While task-specific adaptations and autoregressive models have sought to address these challenges, they remain constrained by computational inefficiency, catastrophic forgetting, or narrow applicability. In this work, we present Pusa, a groundbreaking paradigm that leverages vectorized timestep adaptation (VTA) to enable fine-grained temporal control within a unified video diffusion framework. Besides, VTA is a non-destructive adaptation, which means it fully preserves the capabilities of the base model. By finetuning the SOTA Wan2.1-T2V-14B model with VTA, we achieve unprecedented efficiency -- surpassing the performance of Wan-I2V-14B with $\leq$ 1/200 of the training cost (\$500 vs. $\geq$ \$100,000) and $\leq$ 1/2500 of the dataset size (4K vs. $\geq$ 10M samples). Pusa not only sets a new standard for image-to-video (I2V) generation, achieving a VBench-I2V total score of 87.32\% (vs. 86.86\% of Wan-I2V-14B), but also unlocks many zero-shot multi-task capabilities such as start-end frames and video extension -- all without task-specific training. Meanwhile, Pusa can still perform text-to-video generation. Mechanistic analyses reveal that our approach preserves the foundation model's generative priors while surgically injecting temporal dynamics, avoiding the combinatorial explosion inherent to vectorized timesteps. This work establishes a scalable, efficient, and versatile paradigm for next-generation video synthesis, democratizing high-fidelity video generation for research and industry alike. Code is open-sourced at https://github.com/Yaofang-Liu/Pusa-VidGen
中文摘要:Pusa框架通过向量化时间步适配突破视频扩散模型中的时序建模限制,在保持基础模型能力的同时实现了卓越的效率和多功能生成能力。
English Summary: The Pusa framework introduces vectorized timestep adaptation to overcome temporal modeling limitations in video diffusion, achieving superior efficiency and multi-task capabilities while preserving base model performance.

Authors:Rahul Venkatesh, Klemen Kotar, Lilian Naing Chen, Seungwoo Kim, Luca Thomas Wheeler, Jared Watrous, Ashley Xu, Gia Ancone, Wanhee Lee, Honglin Chen, Daniel Bear, Stefan Stojanov, Daniel Yamins
Title: Discovering and using Spelke segments
Abstract:
Segments in computer vision are often defined by semantic considerations and are highly dependent on category-specific conventions. In contrast, developmental psychology suggests that humans perceive the world in terms of Spelke objects--groupings of physical things that reliably move together when acted on by physical forces. Spelke objects thus operate on category-agnostic causal motion relationships which potentially better support tasks like manipulation and planning. In this paper, we first benchmark the Spelke object concept, introducing the SpelkeBench dataset that contains a wide variety of well-defined Spelke segments in natural images. Next, to extract Spelke segments from images algorithmically, we build SpelkeNet, a class of visual world models trained to predict distributions over future motions. SpelkeNet supports estimation of two key concepts for Spelke object discovery: (1) the motion affordance map, identifying regions likely to move under a poke, and (2) the expected-displacement map, capturing how the rest of the scene will move. These concepts are used for "statistical counterfactual probing", where diverse "virtual pokes" are applied on regions of high motion-affordance, and the resultant expected displacement maps are used define Spelke segments as statistical aggregates of correlated motion statistics. We find that SpelkeNet outperforms supervised baselines like SegmentAnything (SAM) on SpelkeBench. Finally, we show that the Spelke concept is practically useful for downstream applications, yielding superior performance on the 3DEditBench benchmark for physical object manipulation when used in a variety of off-the-shelf object manipulation models.

Authors:Shahar Zuler, Gal Lifshitz, Hadar Averbuch-Elor, Dan Raviv
Title: Systole-Conditioned Generative Cardiac Motion
Abstract:
Accurate motion estimation in cardiac computed tomography (CT) imaging is critical for assessing cardiac function and surgical planning. Data-driven methods have become the standard approach for dense motion estimation, but they rely on vast amounts of labeled data with dense ground-truth (GT) motion annotations, which are often unfeasible to obtain. To address this limitation, we present a novel approach that synthesizes realistically looking pairs of cardiac CT frames enriched with dense 3D flow field annotations. Our method leverages a conditional Variational Autoencoder (CVAE), which incorporates a novel multi-scale feature conditioning mechanism and is trained to generate 3D flow fields conditioned on a single CT frame. By applying the generated flow field to warp the given frame, we create pairs of frames that simulate realistic myocardium deformations across the cardiac cycle. These pairs serve as fully annotated data samples, providing optical flow GT annotations. Our data generation pipeline could enable the training and validation of more complex and accurate myocardium motion models, allowing for substantially reducing reliance on manual annotations. Our code, along with animated generated samples and additional material, is available on our project page: https://shaharzuler.github.io/GenerativeCardiacMotion_Page.

Authors:Jiawei Yang, Tianhong Li, Lijie Fan, Yonglong Tian, Yue Wang
Title: Latent Denoising Makes Good Visual Tokenizers
Abstract:
Despite their fundamental role, it remains unclear what properties could make visual tokenizers more effective for generative modeling. We observe that modern generative models share a conceptually similar training objective -- reconstructing clean signals from corrupted inputs such as Gaussian noise or masking -- a process we term denoising. Motivated by this insight, we propose aligning tokenizer embeddings directly with the downstream denoising objective, encouraging latent embeddings to be more easily reconstructed even when heavily corrupted. To achieve this, we introduce the Latent Denoising Tokenizer (l-DeTok), a simple yet effective tokenizer trained to reconstruct clean images from latent embeddings corrupted by interpolative noise and random masking. Extensive experiments on ImageNet 256x256 demonstrate that our tokenizer consistently outperforms standard tokenizers across six representative generative models. Our findings highlight denoising as a fundamental design principle for tokenizer development, and we hope it could motivate new perspectives for future tokenizer design.
Chinese: 本研究提出了潜在去噪分词器(l-DeTok),通过将分词器嵌入与去噪目标对齐,提升从受损输入中重建的能力,在ImageNet 256x256数据集上的多种生成模型中均优于标准分词器。
English: The study introduces the Latent Denoising Tokenizer (l-DeTok), which aligns tokenizer embeddings with the denoising objective to enhance reconstruction from corrupted inputs, consistently outperforming standard tokenizers across various generative models on ImageNet 256x256.

Authors:Zhixiong Zhang, Shuangrui Ding, Xiaoyi Dong, Songxin He, Jianfan Lin, Junsong Tang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang
Title: SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction
Abstract:
Video Object Segmentation (VOS) is a core task in computer vision, requiring models to track and segment target objects across video frames. Despite notable advances with recent efforts, current techniques still lag behind human capabilities in handling drastic visual variations, occlusions, and complex scene changes. This limitation arises from their reliance on appearance matching, neglecting the human-like conceptual understanding of objects that enables robust identification across temporal dynamics. Motivated by this gap, we propose Segment Concept (SeC), a concept-driven segmentation framework that shifts from conventional feature matching to the progressive construction and utilization of high-level, object-centric representations. SeC employs Large Vision-Language Models (LVLMs) to integrate visual cues across diverse frames, constructing robust conceptual priors. During inference, SeC forms a comprehensive semantic representation of the target based on processed frames, realizing robust segmentation of follow-up frames. Furthermore, SeC adaptively balances LVLM-based semantic reasoning with enhanced feature matching, dynamically adjusting computational efforts based on scene complexity. To rigorously assess VOS methods in scenarios demanding high-level conceptual reasoning and robust semantic understanding, we introduce the Semantic Complex Scenarios Video Object Segmentation benchmark (SeCVOS). SeCVOS comprises 160 manually annotated multi-scenario videos designed to challenge models with substantial appearance variations and dynamic scene transformations. In particular, SeC achieves an 11.8-point improvement over SAM 2.1 on SeCVOS, establishing a new state-of-the-art in concept-aware video object segmentation.
中文: 提出的Segment Concept (SeC)框架通过从外观匹配转向概念驱动推理,利用大型视觉语言模型构建鲁棒的对象表征,并在新的SeCVOS基准测试中实现了最先进的性能。
English: The proposed Segment Concept (SeC) framework advances video object segmentation by shifting from appearance-based matching to concept-driven reasoning, leveraging large vision-language models to construct robust object representations and achieving state-of-the-art performance on the new SeCVOS benchmark.

Authors:Ian Chuang, Jinyu Zou, Andrew Lee, Dechen Gao, Iman Soltani
Title: Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers
Abstract:
Human vision is a highly active process driven by gaze, which directs attention to task-relevant regions through foveation, dramatically reducing visual processing. In contrast, robot learning systems typically rely on passive, uniform processing of raw camera images. In this work, we explore how incorporating human-like active gaze into robotic policies can enhance efficiency and robustness. We develop GIAVA (Gaze Integrated Active-Vision ALOHA), a robot vision system that emulates human head and neck movement, and gaze adjustment for foveated processing. Extending the AV-ALOHA robot platform, we introduce a framework for simultaneously collecting eye-tracking, perspective control, and robot manipulation demonstration data from a human operator. We also open-source a simulation benchmark and dataset for training robot policies that incorporate human gaze. Inspired by recent work in foveated image segmentation and given the widespread use of Vision Transformers (ViTs) in robot learning, we integrate gaze information into ViTs using a foveated patch tokenization scheme. Compared to uniform patch tokenization, this significantly reduces the number of tokens, and thus computation. Our results show that our method for foveated robot vision drastically reduces computational overhead, and enhances robustness to background distractors. Notably, on certain high-precision tasks, foveated vision also improves performance, as reflected in higher success rates. Together, these findings suggest that human-inspired foveated visual processing offers untapped potential and should be further considered as a useful inductive bias in robotic vision systems. https://ian-chuang.github.io/gaze-av-aloha/

Authors:Shuo Chen, Jianzhe Liu, Zhen Han, Yan Xia, Daniel Cremers, Philip Torr, Volker Tresp, Jindong Gu
Title: True Multimodal In-Context Learning Needs Attention to the Visual Context
Abstract:
Multimodal Large Language Models (MLLMs), built on powerful language backbones, have enabled Multimodal In-Context Learning (MICL)-adapting to new tasks from a few multimodal demonstrations consisting of images, questions, and answers. Despite showing noticeable improvement on standard vision-language datasets, current MLLMs struggle to leverage visual information in the demonstrations. Specifically, they tend to neglect visual cues and over-rely on textual patterns, leading to mere text imitation rather than genuine multimodal adaptation. This behavior makes MICL still unimodal and largely restricts its practical utility. More importantly, this limitation is often concealed by the improved performance on tasks that do not require understanding the visual context. As a result, how to effectively enhance MICL ability and reliably evaluate the MICL performance remains underexplored. To address these issues, we first introduce Dynamic Attention Reallocation (DARA), an efficient fine-tuning strategy that encourages models to attend to the visual context by rebalancing attention across visual and textual tokens. In addition, we present TrueMICL, an MICL-dedicated dataset with both support and test sets that explicitly requires the integration of multimodal information-particularly visual content-for correct task completion. Extensive experiments demonstrate the effectiveness of our holistic solution, showcasing substantial improvements in the true multimodal in-context learning capabilities. Code and datasets are available at https://chenxshuo.github.io/true-micl-colm .
中文:当前多模态大语言模型在上下文学习中难以有效利用视觉信息,导致仅依赖文本模仿,因此本文提出DARA微调方法和TrueMICL数据集来提升真正的多模态学习能力。
English: Current Multimodal Large Language Models struggle to effectively utilize visual information in in-context learning, leading to text-only adaptation, so this paper introduces DARA fine-tuning and TrueMICL dataset to enhance genuine multimodal capabilities.

Authors:Ghassen Baklouti, Julio Silva-Rodríguez, Jose Dolz, Houda Bahig, Ismail Ben Ayed
Title: Regularized Low-Rank Adaptation for Few-Shot Organ Segmentation
Abstract:
Parameter-efficient fine-tuning (PEFT) of pre-trained foundation models is increasingly attracting interest in medical imaging due to its effectiveness and computational efficiency. Among these methods, Low-Rank Adaptation (LoRA) is a notable approach based on the assumption that the adaptation inherently occurs in a low-dimensional subspace. While it has shown good performance, its implementation requires a fixed and unalterable rank, which might be challenging to select given the unique complexities and requirements of each medical imaging downstream task. Inspired by advancements in natural image processing, we introduce a novel approach for medical image segmentation that dynamically adjusts the intrinsic rank during adaptation. Viewing the low-rank representation of the trainable weight matrices as a singular value decomposition, we introduce an l_1 sparsity regularizer to the loss function, and tackle it with a proximal optimizer. The regularizer could be viewed as a penalty on the decomposition rank. Hence, its minimization enables to find task-adapted ranks automatically. Our method is evaluated in a realistic few-shot fine-tuning setting, where we compare it first to the standard LoRA and then to several other PEFT methods across two distinguishable tasks: base organs and novel organs. Our extensive experiments demonstrate the significant performance improvements driven by our method, highlighting its efficiency and robustness against suboptimal rank initialization. Our code is publicly available: https://github.com/ghassenbaklouti/ARENA
中文摘要:本文提出一种新颖的医学图像分割参数高效微调方法,通过动态调整内在秩实现自适应优化,在多项任务中相比标准LoRA及其他方法展现出更优的性能和自动秩选择能力。
English Summary: This paper introduces a novel parameter-efficient fine-tuning method for medical image segmentation that dynamically adjusts the intrinsic rank during adaptation, outperforming standard LoRA and other methods through automated rank selection and improved performance across different tasks.

Authors:Feng-Qi Cui, Anyang Tong, Jinyang Huang, Jie Zhang, Dan Guo, Zhi Liu, Meng Wang
Title: Learning from Heterogeneity: Generalizing Dynamic Facial Expression Recognition via Distributionally Robust Optimization
Abstract:
Dynamic Facial Expression Recognition (DFER) plays a critical role in affective computing and human-computer interaction. Although existing methods achieve comparable performance, they inevitably suffer from performance degradation under sample heterogeneity caused by multi-source data and individual expression variability. To address these challenges, we propose a novel framework, called Heterogeneity-aware Distributional Framework (HDF), and design two plug-and-play modules to enhance time-frequency modeling and mitigate optimization imbalance caused by hard samples. Specifically, the Time-Frequency Distributional Attention Module (DAM) captures both temporal consistency and frequency robustness through a dual-branch attention design, improving tolerance to sequence inconsistency and visual style shifts. Then, based on gradient sensitivity and information bottleneck principles, an adaptive optimization module Distribution-aware Scaling Module (DSM) is introduced to dynamically balance classification and contrastive losses, enabling more stable and discriminative representation learning. Extensive experiments on two widely used datasets, DFEW and FERV39k, demonstrate that HDF significantly improves both recognition accuracy and robustness. Our method achieves superior weighted average recall (WAR) and unweighted average recall (UAR) while maintaining strong generalization across diverse and imbalanced scenarios. Codes are released at https://github.com/QIcita/HDF_DFER.
中文: 提出的异构感知分布框架(HDF)通过双插件模块增强时频建模并优化损失平衡,在多个数据集上显著提升了动态面部表情识别的准确性和鲁棒性。
English: The proposed Heterogeneity-aware Distributional Framework (HDF) with dual plug-and-play modules enhances time-frequency modeling and optimizes loss balance, significantly improving dynamic facial expression recognition accuracy and robustness across diverse datasets.

Authors:Wenqi Ouyang, Zeqi Xiao, Danni Yang, Yifan Zhou, Shuai Yang, Lei Yang, Jianlou Si, Xingang Pan
Title: TokensGen: Harnessing Condensed Tokens for Long Video Generation
Abstract:
Generating consistent long videos is a complex challenge: while diffusion-based generative models generate visually impressive short clips, extending them to longer durations often leads to memory bottlenecks and long-term inconsistency. In this paper, we propose TokensGen, a novel two-stage framework that leverages condensed tokens to address these issues. Our method decomposes long video generation into three core tasks: (1) inner-clip semantic control, (2) long-term consistency control, and (3) inter-clip smooth transition. First, we train To2V (Token-to-Video), a short video diffusion model guided by text and video tokens, with a Video Tokenizer that condenses short clips into semantically rich tokens. Second, we introduce T2To (Text-to-Token), a video token diffusion transformer that generates all tokens at once, ensuring global consistency across clips. Finally, during inference, an adaptive FIFO-Diffusion strategy seamlessly connects adjacent clips, reducing boundary artifacts and enhancing smooth transitions. Experimental results demonstrate that our approach significantly enhances long-term temporal and content coherence without incurring prohibitive computational overhead. By leveraging condensed tokens and pre-trained short video models, our method provides a scalable, modular solution for long video generation, opening new possibilities for storytelling, cinematic production, and immersive simulations. Please see our project page at https://vicky0522.github.io/tokensgen-webpage/ .

Authors:Wei Sun, Weixia Zhang, Linhan Cao, Jun Jia, Xiangyang Zhu, Dandan Zhu, Xiongkuo Min, Guangtao Zhai
Title: Efficient Face Image Quality Assessment via Self-training and Knowledge Distillation
Abstract:
Face image quality assessment (FIQA) is essential for various face-related applications. Although FIQA has been extensively studied and achieved significant progress, the computational complexity of FIQA algorithms remains a key concern for ensuring scalability and practical deployment in real-world systems. In this paper, we aim to develop a computationally efficient FIQA method that can be easily deployed in real-world applications. Specifically, our method consists of two stages: training a powerful teacher model and distilling a lightweight student model from it. To build a strong teacher model, we adopt a self-training strategy to improve its capacity. We first train the teacher model using labeled face images, then use it to generate pseudo-labels for a set of unlabeled images. These pseudo-labeled samples are used in two ways: (1) to distill knowledge into the student model, and (2) to combine with the original labeled images to further enhance the teacher model through self-training. The enhanced teacher model is used to further pseudo-label another set of unlabeled images for distilling the student models. The student model is trained using a combination of labeled images, pseudo-labeled images from the original teacher model, and pseudo-labeled images from the enhanced teacher model. Experimental results demonstrate that our student model achieves comparable performance to the teacher model with an extremely low computational overhead. Moreover, our method achieved first place in the ICCV 2025 VQualA FIQA Challenge. The code is available at https://github.com/sunwei925/Efficient-FIQA.git.
中文: 本文提出一种高效的人脸图像质量评估方法,通过两阶段的师生蒸馏框架,在保持低计算成本的同时实现高性能,并荣获ICCV 2025 VQualA FIQA挑战赛冠军。
English: This paper presents an efficient face image quality assessment method using a two-stage teacher-student distillation framework, achieving high performance with low computational cost and winning the ICCV 2025 VQualA FIQA Challenge.

Authors:Wenjie Huang, Qi Yang, Shuting Xia, He Huang, Zhu Li, Yiling Xu
Title: LINR-PCGC: Lossless Implicit Neural Representations for Point Cloud Geometry Compression
Abstract:
Existing AI-based point cloud compression methods struggle with dependence on specific training data distributions, which limits their real-world deployment. Implicit Neural Representation (INR) methods solve the above problem by encoding overfitted network parameters to the bitstream, resulting in more distribution-agnostic results. However, due to the limitation of encoding time and decoder size, current INR based methods only consider lossy geometry compression. In this paper, we propose the first INR based lossless point cloud geometry compression method called Lossless Implicit Neural Representations for Point Cloud Geometry Compression (LINR-PCGC). To accelerate encoding speed, we design a group of point clouds level coding framework with an effective network initialization strategy, which can reduce around 60% encoding time. A lightweight coding network based on multiscale SparseConv, consisting of scale context extraction, child node prediction, and model compression modules, is proposed to realize fast inference and compact decoder size. Experimental results show that our method consistently outperforms traditional and AI-based methods: for example, with the convergence time in the MVUB dataset, our method reduces the bitstream by approximately 21.21% compared to G-PCC TMC13v23 and 21.95% compared to SparsePCGC. Our project can be seen on https://huangwenjie2023.github.io/LINR-PCGC/.
Chinese: 本文提出了LINR-PCGC,这是首个基于隐式神经表示的无损点云几何压缩方法,通过优化编码框架和轻量级网络设计,在显著减少编码时间的同时实现了比现有方法更优的压缩性能。
English: This paper introduces LINR-PCGC, the first lossless point cloud geometry compression method using Implicit Neural Representations, which significantly reduces encoding time and achieves superior compression rates compared to existing methods.

Authors:Zuo-Liang Zhu, Jian Yang, Beibei Wang
Title: Gaussian Splatting with Discretized SDF for Relightable Assets
Abstract:
3D Gaussian splatting (3DGS) has shown its detailed expressive ability and highly efficient rendering speed in the novel view synthesis (NVS) task. The application to inverse rendering still faces several challenges, as the discrete nature of Gaussian primitives makes it difficult to apply geometry constraints. Recent works introduce the signed distance field (SDF) as an extra continuous representation to regularize the geometry defined by Gaussian primitives. It improves the decomposition quality, at the cost of increasing memory usage and complicating training. Unlike these works, we introduce a discretized SDF to represent the continuous SDF in a discrete manner by encoding it within each Gaussian using a sampled value. This approach allows us to link the SDF with the Gaussian opacity through an SDF-to-opacity transformation, enabling rendering the SDF via splatting and avoiding the computational cost of ray marching.The key challenge is to regularize the discrete samples to be consistent with the underlying SDF, as the discrete representation can hardly apply the gradient-based constraints (\eg Eikonal loss). For this, we project Gaussians onto the zero-level set of SDF and enforce alignment with the surface from splatting, namely a projection-based consistency loss. Thanks to the discretized SDF, our method achieves higher relighting quality, while requiring no extra memory beyond GS and avoiding complex manually designed optimization. The experiments reveal that our method outperforms existing Gaussian-based inverse rendering methods. Our code is available at https://github.com/NK-CS-ZZL/DiscretizedSDF.
中文: 本文提出了一种离散化的有向距离场(SDF)方法,将其融入3D高斯泼溅技术中,通过投影一致性损失将SDF与高斯不透明度关联,实现了无需额外内存和复杂优化的高质量逆向渲染。
English: This paper introduces a discretized signed distance field (SDF) integrated into 3D Gaussian splatting, enabling high-quality inverse rendering without extra memory or complex optimization by linking SDF to Gaussian opacity through a projection-based consistency loss.

Authors:Zihui Gao, Jia-Wang Bian, Guosheng Lin, Hao Chen, Chunhua Shen
Title: SurfaceSplat: Connecting Surface Reconstruction and Gaussian Splatting
Abstract:
Surface reconstruction and novel view rendering from sparse-view images are challenging. Signed Distance Function (SDF)-based methods struggle with fine details, while 3D Gaussian Splatting (3DGS)-based approaches lack global geometry coherence. We propose a novel hybrid method that combines the strengths of both approaches: SDF captures coarse geometry to enhance 3DGS-based rendering, while newly rendered images from 3DGS refine the details of SDF for accurate surface reconstruction. As a result, our method surpasses state-of-the-art approaches in surface reconstruction and novel view synthesis on the DTU and MobileBrick datasets. Code will be released at https://github.com/aim-uofa/SurfaceSplat.
中文: 本研究提出了一种融合SDF全局几何与3DGS细节渲染的混合方法,在基准数据集上实现了卓越的表面重建和新视角合成效果。
English: This study introduces a hybrid method that integrates SDF for global geometry and 3DGS for detailed rendering, achieving superior surface reconstruction and novel view synthesis on benchmark datasets.

Authors:Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, Zongqing Lu
Title: Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos
Abstract:
We introduce Being-H0, a dexterous Vision-Language-Action model (VLA) trained on large-scale human videos. Existing VLAs struggle with complex manipulation tasks requiring high dexterity and generalize poorly to novel scenarios and tasks, primarily due to their reliance on synthetic data with significant sim-to-real gaps or teleoperated demonstrations lacking scale and diversity. To address this data bottleneck, we propose leveraging human hands as a foundation manipulator, capitalizing on the rich dexterity and scalability present in web data. Our approach centers on physical instruction tuning, a novel training paradigm that combines large-scale VLA pretraining from human videos, physical space alignment for 3D reasoning, and post-training adaptation for robotic tasks. Additionally, we introduce a part-level motion tokenization method which achieves millimeter-level reconstruction accuracy to model precise hand trajectories for action learning. To support our proposed paradigm, we further develop a comprehensive data curation pipeline that integrates heterogeneous sources -- including motion capture, VR, and RGB-only videos -- into a large-scale dataset with millions of motion-based instructional instances. We empirically show the excellence of Being-H0 in hand motion generation and instruction following, and it also scales well with model and data sizes. Importantly, we observe the expected gains of Being-H0 in real-world robotic manipulation as physical instruction tuning is applied. More details are available at https://beingbeyond.github.io/Being-H0.

Authors:Salah Eddine Bekhouche, Gaby Maroun, Fadi Dornaika, Abdenour Hadid
Title: SegDT: A Diffusion Transformer-Based Segmentation Model for Medical Imaging
Abstract:
Medical image segmentation is crucial for many healthcare tasks, including disease diagnosis and treatment planning. One key area is the segmentation of skin lesions, which is vital for diagnosing skin cancer and monitoring patients. In this context, this paper introduces SegDT, a new segmentation model based on diffusion transformer (DiT). SegDT is designed to work on low-cost hardware and incorporates Rectified Flow, which improves the generation quality at reduced inference steps and maintains the flexibility of standard diffusion models. Our method is evaluated on three benchmarking datasets and compared against several existing works, achieving state-of-the-art results while maintaining fast inference speeds. This makes the proposed model appealing for real-world medical applications. This work advances the performance and capabilities of deep learning models in medical image analysis, enabling faster, more accurate diagnostic tools for healthcare professionals. The code is made publicly available at \href{https://github.com/Bekhouche/SegDT}{GitHub}.
中文: 本文提出SegDT模型,这是一种基于扩散变换器的皮肤病变分割方法,在保持低成本硬件快速推理的同时,在多个基准数据集上取得了最先进的性能。
English: This paper introduces SegDT, a diffusion transformer-based model for efficient skin lesion segmentation that achieves state-of-the-art results on benchmark datasets while enabling fast inference on low-cost hardware.

Authors:Hugo Carlesso, Maria Eliza Patulea, Moncef Garouani, Radu Tudor Ionescu, Josiane Mothe
Title: GeMix: Conditional GAN-Based Mixup for Improved Medical Image Augmentation
Abstract:
Mixup has become a popular augmentation strategy for image classification, yet its naive pixel-wise interpolation often produces unrealistic images that can hinder learning, particularly in high-stakes medical applications. We propose GeMix, a two-stage framework that replaces heuristic blending with a learned, label-aware interpolation powered by class-conditional GANs. First, a StyleGAN2-ADA generator is trained on the target dataset. During augmentation, we sample two label vectors from Dirichlet priors biased toward different classes and blend them via a Beta-distributed coefficient. Then, we condition the generator on this soft label to synthesize visually coherent images that lie along a continuous class manifold. We benchmark GeMix on the large-scale COVIDx-CT-3 dataset using three backbones (ResNet-50, ResNet-101, EfficientNet-B0). When combined with real data, our method increases macro-F1 over traditional mixup for all backbones, reducing the false negative rate for COVID-19 detection. GeMix is thus a drop-in replacement for pixel-space mixup, delivering stronger regularization and greater semantic fidelity, without disrupting existing training pipelines. We publicly release our code at https://github.com/hugocarlesso/GeMix to foster reproducibility and further research.
中文摘要:GeMix提出了一种基于类别条件生成对抗网络的两阶段框架,通过生成具有语义连续性的逼真插值图像,在COVID-19检测任务中全面优于传统混合增强方法,显著降低了假阴性率并保持即插即用的特性。
English Summary: GeMix introduces a two-stage framework using class-conditional GANs to generate realistic, label-aware interpolated images for medical image classification, outperforming traditional mixup methods across multiple backbones on COVID-19 detection while reducing false negatives.

Authors:Nicolas Poggi, Shashank Agnihotri, Margret Keuper
Title: Smart Eyes for Silent Threats: VLMs and In-Context Learning for THz Imaging
Abstract:
Terahertz (THz) imaging enables non-invasive analysis for applications such as security screening and material classification, but effective image classification remains challenging due to limited annotations, low resolution, and visual ambiguity. We introduce In-Context Learning (ICL) with Vision-Language Models (VLMs) as a flexible, interpretable alternative that requires no fine-tuning. Using a modality-aligned prompting framework, we adapt two open-weight VLMs to the THz domain and evaluate them under zero-shot and one-shot settings. Our results show that ICL improves classification and interpretability in low-data regimes. This is the first application of ICL-enhanced VLMs to THz imaging, offering a promising direction for resource-constrained scientific domains. Code: \href{https://github.com/Nicolas-Poggi/Project_THz_Classification/tree/main}{GitHub repository}.
中文: 本文提出采用情境学习与视觉语言模型相结合的方法,为太赫兹图像分类提供无需微调的灵活解决方案,在数据稀缺条件下显著提升了分类效果和可解释性。
English: This paper introduces In-Context Learning with Vision-Language Models as a flexible, interpretable method for terahertz image classification, demonstrating improved performance in low-data scenarios without requiring fine-tuning.

Authors:Qinqian Lei, Bo Wang, Robby T. Tan
Title: HOLa: Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation
Abstract:
Zero-shot human-object interaction (HOI) detection remains a challenging task, particularly in generalizing to unseen actions. Existing methods address this challenge by tapping Vision-Language Models (VLMs) to access knowledge beyond the training data. However, they either struggle to distinguish actions involving the same object or demonstrate limited generalization to unseen classes. In this paper, we introduce HOLa (Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation), a novel approach that both enhances generalization to unseen classes and improves action distinction. In training, HOLa decomposes VLM text features for given HOI classes via low-rank factorization, producing class-shared basis features and adaptable weights. These features and weights form a compact HOI representation that preserves shared information across classes, enhancing generalization to unseen classes. Subsequently, we refine action distinction by adapting weights for each HOI class and introducing human-object tokens to enrich visual interaction representations. To further distinguish unseen actions, we guide the weight adaptation with LLM-derived action regularization. Experimental results show that our method sets a new state-of-the-art across zero-shot HOI settings on HICO-DET, achieving an unseen-class mAP of 27.91 in the unseen-verb setting. Our code is available at https://github.com/ChelsieLei/HOLa.
Chinese: 本文提出HOLa方法,通过低秩分解视觉语言模型特征来增强对未见类别的泛化能力并改进动作区分,在零样本人-物交互检测中于HICO-DET数据集上取得了最先进的性能。
English: The paper introduces HOLa, a novel method for zero-shot human-object interaction detection that uses low-rank decomposition of vision-language model features to enhance generalization to unseen classes and improve action distinction, achieving state-of-the-art results on HICO-DET.

Authors:Jongmin Shin, Enki Cho, Ka Young Kim, Jung Yong Kim, Seong Tae Kim, Namkee Oh
Title: Towards Holistic Surgical Scene Graph
Abstract:
Surgical scene understanding is crucial for computer-assisted intervention systems, requiring visual comprehension of surgical scenes that involves diverse elements such as surgical tools, anatomical structures, and their interactions. To effectively represent the complex information in surgical scenes, graph-based approaches have been explored to structurally model surgical entities and their relationships. Previous surgical scene graph studies have demonstrated the feasibility of representing surgical scenes using graphs. However, certain aspects of surgical scenes-such as diverse combinations of tool-action-target and the identity of the hand operating the tool-remain underexplored in graph-based representations, despite their importance. To incorporate these aspects into graph representations, we propose Endoscapes-SG201 dataset, which includes annotations for tool-action-target combinations and hand identity. We also introduce SSG-Com, a graph-based method designed to learn and represent these critical elements. Through experiments on downstream tasks such as critical view of safety assessment and action triplet recognition, we demonstrated the importance of integrating these essential scene graph components, highlighting their significant contribution to surgical scene understanding. The code and dataset are available at https://github.com/ailab-kyunghee/SSG-Com
中文摘要:本研究提出Endoscapes-SG201数据集和SSG-Com方法,通过将工具-动作-目标组合及操作手身份纳入图表示来提升手术场景理解能力,并通过下游任务验证了这些要素的重要价值。
English Summary: This study introduces the Endoscapes-SG201 dataset and SSG-Com method to enhance surgical scene understanding by incorporating tool-action-target combinations and hand identity into graph representations, demonstrating their critical value through downstream task evaluations.

Authors:Simon Winther Albertsen, Hjalte Svaneborg Bjørnstrup, Mostafa Mehdipour Ghazi
Title: RARE-UNet: Resolution-Aligned Routing Entry for Adaptive Medical Image Segmentation
Abstract:
Accurate segmentation is crucial for clinical applications, but existing models often assume fixed, high-resolution inputs and degrade significantly when faced with lower-resolution data in real-world scenarios. To address this limitation, we propose RARE-UNet, a resolution-aware multi-scale segmentation architecture that dynamically adapts its inference path to the spatial resolution of the input. Central to our design are multi-scale blocks integrated at multiple encoder depths, a resolution-aware routing mechanism, and consistency-driven training that aligns multi-resolution features with full-resolution representations. We evaluate RARE-UNet on two benchmark brain imaging tasks for hippocampus and tumor segmentation. Compared to standard UNet, its multi-resolution augmented variant, and nnUNet, our model achieves the highest average Dice scores of 0.84 and 0.65 across resolution, while maintaining consistent performance and significantly reduced inference time at lower resolutions. These results highlight the effectiveness and scalability of our architecture in achieving resolution-robust segmentation. The codes are available at: https://github.com/simonsejse/RARE-UNet.
Chinese: RARE-UNet是一种分辨率自适应的分割架构,能根据输入分辨率动态调整推理路径,相比现有模型在不同分辨率下均获得更高的Dice分数并显著减少推理时间。
English: RARE-UNet is a resolution-aware segmentation architecture that dynamically adapts to input resolutions, achieving superior Dice scores and faster inference across varied resolutions compared to existing models.

Authors:Hanting Li, Fei Zhou, Xin Sun, Yang Hua, Jungong Han, Liang-Jie Zhang
Title: SAIGFormer: A Spatially-Adaptive Illumination-Guided Network for Low-Light Image Enhancement
Abstract:
Recent Transformer-based low-light enhancement methods have made promising progress in recovering global illumination. However, they still struggle with non-uniform lighting scenarios, such as backlit and shadow, appearing as over-exposure or inadequate brightness restoration. To address this challenge, we present a Spatially-Adaptive Illumination-Guided Transformer (SAIGFormer) framework that enables accurate illumination restoration. Specifically, we propose a dynamic integral image representation to model the spatially-varying illumination, and further construct a novel Spatially-Adaptive Integral Illumination Estimator ($\text{SAI}^2\text{E}$). Moreover, we introduce an Illumination-Guided Multi-head Self-Attention (IG-MSA) mechanism, which leverages the illumination to calibrate the lightness-relevant features toward visual-pleased illumination enhancement. Extensive experiments on five standard low-light datasets and a cross-domain benchmark (LOL-Blur) demonstrate that our SAIGFormer significantly outperforms state-of-the-art methods in both quantitative and qualitative metrics. In particular, our method achieves superior performance in non-uniform illumination enhancement while exhibiting strong generalization capabilities across multiple datasets. Code is available at https://github.com/LHTcode/SAIGFormer.git.
中文: 提出的SAIGFormer框架通过空间自适应照明估计器和照明引导注意力机制,有效解决了低光增强中的非均匀照明问题,在多个数据集上实现了最先进的性能表现。
English: The proposed SAIGFormer framework effectively addresses non-uniform lighting challenges in low-light enhancement by employing a spatially-adaptive illumination estimator and an illumination-guided attention mechanism, achieving state-of-the-art performance across multiple datasets.

Authors:Deyu Zhang, Tingting Long, Jinrui Zhang, Ligeng Chen, Ju Ren, Yaoxue Zhang
Title: Prompt-aware of Frame Sampling for Efficient Text-Video Retrieval
Abstract:
Enabling efficient text-video retrieval on edge-end devices is critical for real-world applications. Yet, existing methods face a critical challenge in balancing accuracy and computational efficiency: uniform frame sampling methods ensure content coverage but incur prohibitive computational costs, while salient-frame sampling methods reduce overhead but suffer from query-agnostic frame selection that biases retrieval results. To address this, we propose ProCLIP, a user-centric framework that achieves state-of-the-art accuracy with significantly improved efficiency. We design a prompt-aware frame sampling strategy that dynamically guides lightweight feature extractors using textual prompts to select semantically relevant frames, overcoming the limitations of existing salient-frame sampling methods which rely on static, query-agnostic selection criteria. Moreover, we adopt a two-stage candidate pruning strategy that combines rapid coarse filtering via a lightweight module with CLIP-powered fine-grained re-ranking, enhancing retrieval efficiency while preserving accuracy. Experiments across benchmarks show ProCLIP achieves 75.3% latency reduction versus baselines while maintaining competitive accuracy, i.e., R@1=49.0 in MSR-VTT dataset. Code is available at https://github.com/tiffylong/ProCLIP.
Chinese: ProCLIP框架通过提示感知的帧采样策略和两阶段候选剪枝方法,在保持检索精度的同时将计算延迟显著降低75.3%,有效解决了文本-视频检索中效率与精度的平衡难题。
English: ProCLIP introduces a prompt-aware frame sampling strategy and a two-stage candidate pruning approach to significantly reduce computational latency by 75.3% while maintaining competitive retrieval accuracy in text-video retrieval systems.

Authors:Liang Chen, Ghazi Shazan Ahmad, Tianjun Yao, Lingqiao Liu, Zhiqiang Shen
Title: One Last Attention for Your Vision-Language Model
Abstract:
Pretrained vision-language models (VLMs), such as CLIP, achieve remarkable zero-shot performance, yet their downstream potential hinges on effective fine-tuning. Most adaptation methods typically focus on refining representation from separate modalities (text or vision) but neglect the critical role of their fused representations in the decision-making process, \emph{\ie} rational matrix that drives the final prediction. To bridge the gap, we propose a simple yet effective \textbf{R}ational \textbf{Ada}ptaion ({RAda}) to explicitly exploit the final fused representation during fine-tuning. RAda employs a learned mask, obtained from a lightweight attention layer attached at the end of a VLM, to dynamically calibrate the contribution of each element in the rational matrix, enabling targeted adjustments to the final cross-modal interactions without incurring costly modifications to intermediate features. Experiments in different settings (i.e., updating, or freezing pretrained encoders in adaptation, and test-time training that can only access the unlabeled test data) show that RAda serves as a versatile fine-tuning technique, improving the baseline with minimal code and performing comparably against current arts in most settings. Code is available at \href{https://github.com/khufia/RAda/tree/main}{github.com/khufia/RAda}.
中文: 提出的RAda方法通过可学习的掩码动态校准融合的跨模态表征,以最小计算成本显著提升视觉语言模型在不同适配场景下的微调性能。
English: The proposed RAda method enhances vision-language model fine-tuning by dynamically calibrating the fused cross-modal representation through a learned mask, achieving versatile performance improvements across different adaptation settings with minimal computational cost.

Authors:Ruijie Zhu, Mulin Yu, Linning Xu, Lihan Jiang, Yixuan Li, Tianzhu Zhang, Jiangmiao Pang, Bo Dai
Title: ObjectGS: Object-aware Scene Reconstruction and Scene Understanding via Gaussian Splatting
Abstract:
3D Gaussian Splatting is renowned for its high-fidelity reconstructions and real-time novel view synthesis, yet its lack of semantic understanding limits object-level perception. In this work, we propose ObjectGS, an object-aware framework that unifies 3D scene reconstruction with semantic understanding. Instead of treating the scene as a unified whole, ObjectGS models individual objects as local anchors that generate neural Gaussians and share object IDs, enabling precise object-level reconstruction. During training, we dynamically grow or prune these anchors and optimize their features, while a one-hot ID encoding with a classification loss enforces clear semantic constraints. We show through extensive experiments that ObjectGS not only outperforms state-of-the-art methods on open-vocabulary and panoptic segmentation tasks, but also integrates seamlessly with applications like mesh extraction and scene editing. Project page: https://ruijiezhu94.github.io/ObjectGS_page
中文: ObjectGS提出了一种对象感知框架,通过将语义理解融入3D高斯泼溅技术,在分割任务和场景编辑等应用中实现了更优的性能。
English: ObjectGS introduces an object-aware framework that enhances 3D Gaussian Splatting by incorporating semantic understanding, enabling superior performance in segmentation tasks and practical applications like scene editing.

Authors:Ka Young Kim, Hyeon Bae Kim, Seong Tae Kim
Title: SurgX: Neuron-Concept Association for Explainable Surgical Phase Recognition
Abstract:
Surgical phase recognition plays a crucial role in surgical workflow analysis, enabling various applications such as surgical monitoring, skill assessment, and workflow optimization. Despite significant advancements in deep learning-based surgical phase recognition, these models remain inherently opaque, making it difficult to understand how they make decisions. This lack of interpretability hinders trust and makes it challenging to debug the model. To address this challenge, we propose SurgX, a novel concept-based explanation framework that enhances the interpretability of surgical phase recognition models by associating neurons with relevant concepts. In this paper, we introduce the process of selecting representative example sequences for neurons, constructing a concept set tailored to the surgical video dataset, associating neurons with concepts and identifying neurons crucial for predictions. Through extensive experiments on two surgical phase recognition models, we validate our method and analyze the explanation for prediction. This highlights the potential of our method in explaining surgical phase recognition. The code is available at https://github.com/ailab-kyunghee/SurgX
中文: 手术阶段识别对分析手术流程至关重要,但现有深度学习模型缺乏可解释性,因此作者提出SurgX这一基于概念的解释框架,通过将神经元与相关概念关联来增强模型透明度,并通过实验验证了其有效性。
English: Surgical phase recognition is essential for analyzing surgical workflows but current deep learning models lack interpretability, so the authors propose SurgX, a concept-based explanation framework that enhances model transparency by linking neurons to relevant concepts and validating its effectiveness through experiments.

Authors:Huiyu Zhai, Xingxing Yang, Yalan Ye, Chenyang Li, Bin Fan, Changze Li
Title: Rethinking Occlusion in FER: A Semantic-Aware Perspective and Go Beyond
Abstract:
Facial expression recognition (FER) is a challenging task due to pervasive occlusion and dataset biases. Especially when facial information is partially occluded, existing FER models struggle to extract effective facial features, leading to inaccurate classifications. In response, we present ORSANet, which introduces the following three key contributions: First, we introduce auxiliary multi-modal semantic guidance to disambiguate facial occlusion and learn high-level semantic knowledge, which is two-fold: 1) we introduce semantic segmentation maps as dense semantics prior to generate semantics-enhanced facial representations; 2) we introduce facial landmarks as sparse geometric prior to mitigate intrinsic noises in FER, such as identity and gender biases. Second, to facilitate the effective incorporation of these two multi-modal priors, we customize a Multi-scale Cross-interaction Module (MCM) to adaptively fuse the landmark feature and semantics-enhanced representations within different scales. Third, we design a Dynamic Adversarial Repulsion Enhancement Loss (DARELoss) that dynamically adjusts the margins of ambiguous classes, further enhancing the model's ability to distinguish similar expressions. We further construct the first occlusion-oriented FER dataset to facilitate specialized robustness analysis on various real-world occlusion conditions, dubbed Occlu-FER. Extensive experiments on both public benchmarks and Occlu-FER demonstrate that our proposed ORSANet achieves SOTA recognition performance. Code is publicly available at https://github.com/Wenyuzhy/ORSANet-master.
中文: ORSANet通过引入多模态语义引导、多尺度交互模块和动态对抗排斥损失,有效解决了面部表情识别中的遮挡和数据集偏差问题,在公开基准和新建的Occlu-FER数据集上均实现了最优识别性能。
English: ORSANet addresses facial expression recognition challenges from occlusion and dataset biases by integrating multi-modal semantic guidance, a multi-scale fusion module, and a dynamic loss function, achieving state-of-the-art performance on both public benchmarks and the newly introduced Occlu-FER dataset.

Authors:Etai Sella, Noam Atia, Ron Mokady, Hadar Averbuch-Elor
Title: Blended Point Cloud Diffusion for Localized Text-guided Shape Editing
Abstract:
Natural language offers a highly intuitive interface for enabling localized fine-grained edits of 3D shapes. However, prior works face challenges in preserving global coherence while locally modifying the input 3D shape. In this work, we introduce an inpainting-based framework for editing shapes represented as point clouds. Our approach leverages foundation 3D diffusion models for achieving localized shape edits, adding structural guidance in the form of a partial conditional shape, ensuring that other regions correctly preserve the shape's identity. Furthermore, to encourage identity preservation also within the local edited region, we propose an inference-time coordinate blending algorithm which balances reconstruction of the full shape with inpainting at a progression of noise levels during the inference process. Our coordinate blending algorithm seamlessly blends the original shape with its edited version, enabling a fine-grained editing of 3D shapes, all while circumventing the need for computationally expensive and often inaccurate inversion. Extensive experiments show that our method outperforms alternative techniques across a wide range of metrics that evaluate both fidelity to the original shape and also adherence to the textual description.

Authors:Julia Machnio, Mads Nielsen, Mostafa Mehdipour Ghazi
Title: To Label or Not to Label: PALM -- A Predictive Model for Evaluating Sample Efficiency in Active Learning Models
Abstract:
Active learning (AL) seeks to reduce annotation costs by selecting the most informative samples for labeling, making it particularly valuable in resource-constrained settings. However, traditional evaluation methods, which focus solely on final accuracy, fail to capture the full dynamics of the learning process. To address this gap, we propose PALM (Performance Analysis of Active Learning Models), a unified and interpretable mathematical model that characterizes AL trajectories through four key parameters: achievable accuracy, coverage efficiency, early-stage performance, and scalability. PALM provides a predictive description of AL behavior from partial observations, enabling the estimation of future performance and facilitating principled comparisons across different strategies. We validate PALM through extensive experiments on CIFAR-10/100 and ImageNet-50/100/200, covering a wide range of AL methods and self-supervised embeddings. Our results demonstrate that PALM generalizes effectively across datasets, budgets, and strategies, accurately predicting full learning curves from limited labeled data. Importantly, PALM reveals crucial insights into learning efficiency, data space coverage, and the scalability of AL methods. By enabling the selection of cost-effective strategies and predicting performance under tight budget constraints, PALM lays the basis for more systematic, reproducible, and data-efficient evaluation of AL in both research and real-world applications. The code is available at: https://github.com/juliamachnio/PALM.
中文: PALM提出了一种统一的数学模型,通过四个关键参数描述主动学习轨迹,能在不同数据集和预算下实现性能预测与策略比较。
English: PALM introduces a unified mathematical model to characterize active learning trajectories through four key parameters, enabling performance prediction and strategy comparison across diverse datasets and budgets.

Authors:Zhenyu Li, Haotong Lin, Jiashi Feng, Peter Wonka, Bingyi Kang
Title: BenchDepth: Are We on the Right Way to Evaluate Depth Foundation Models?
Abstract:
Depth estimation is a fundamental task in computer vision with diverse applications. Recent advancements in deep learning have led to powerful depth foundation models (DFMs), yet their evaluation remains challenging due to inconsistencies in existing protocols. Traditional benchmarks rely on alignment-based metrics that introduce biases, favor certain depth representations, and complicate fair comparisons. In this work, we propose BenchDepth, a new benchmark that evaluates DFMs through five carefully selected downstream proxy tasks: depth completion, stereo matching, monocular feed-forward 3D scene reconstruction, SLAM, and vision-language spatial understanding. Unlike conventional evaluation protocols, our approach assesses DFMs based on their practical utility in real-world applications, bypassing problematic alignment procedures. We benchmark eight state-of-the-art DFMs and provide an in-depth analysis of key findings and observations. We hope our work sparks further discussion in the community on best practices for depth model evaluation and paves the way for future research and advancements in depth estimation.

Authors:Zhiyu Pan, Xiongjun Guan, Yongjie Duan, Jianjiang Feng, Jie Zhou
Title: Minutiae-Anchored Local Dense Representation for Fingerprint Matching
Abstract:
Fingerprint matching under diverse capture conditions remains a fundamental challenge in biometric recognition. To achieve robust and accurate performance in such scenarios, we propose DMD, a minutiae-anchored local dense representation which captures both fine-grained ridge textures and discriminative minutiae features in a spatially structured manner. Specifically, descriptors are extracted from local patches centered and oriented on each detected minutia, forming a three-dimensional tensor, where two dimensions represent spatial locations on the fingerprint plane and the third encodes semantic features. This representation explicitly captures abstract features of local image patches, enabling a multi-level, fine-grained description that aggregates information from multiple minutiae and their surrounding ridge structures. Furthermore, thanks to its strong spatial correspondence with the patch image, DMD allows for the use of foreground segmentation masks to identify valid descriptor regions. During matching, comparisons are then restricted to overlapping foreground areas, improving efficiency and robustness. Extensive experiments on rolled, plain, parital, contactless, and latent fingerprint datasets demonstrate the effectiveness and generalizability of the proposed method. It achieves state-of-the-art accuracy across multiple benchmarks while maintaining high computational efficiency, showing strong potential for large-scale fingerprint recognition. Corresponding code is available at https://github.com/Yu-Yy/DMD.
中文: 提出的DMD方法采用基于细节点的密集表示,通过三维张量结构化地捕捉脊线纹理和细节点特征,并利用前景约束匹配在多种指纹数据集上实现了最优的准确率和效率。
English: The proposed DMD method introduces a minutiae-anchored dense representation that captures both ridge textures and minutiae features in a structured 3D tensor, achieving state-of-the-art accuracy and efficiency across diverse fingerprint datasets through foreground-constrained matching.

Authors:An Wang, Rulin Zhou, Mengya Xu, Yiru Ye, Longfei Gou, Yiting Chang, Hao Chen, Chwee Ming Lim, Jiankun Wang, Hongliang Ren
Title: EndoControlMag: Robust Endoscopic Vascular Motion Magnification with Periodic Reference Resetting and Hierarchical Tissue-aware Dual-Mask Control
Abstract:
Visualizing subtle vascular motions in endoscopic surgery is crucial for surgical precision and decision-making, yet remains challenging due to the complex and dynamic nature of surgical scenes. To address this, we introduce EndoControlMag, a training-free, Lagrangian-based framework with mask-conditioned vascular motion magnification tailored to endoscopic environments. Our approach features two key modules: a Periodic Reference Resetting (PRR) scheme that divides videos into short overlapping clips with dynamically updated reference frames to prevent error accumulation while maintaining temporal coherence, and a Hierarchical Tissue-aware Magnification (HTM) framework with dual-mode mask dilation. HTM first tracks vessel cores using a pretrained visual tracking model to maintain accurate localization despite occlusions and view changes. It then applies one of two adaptive softening strategies to surrounding tissues: motion-based softening that modulates magnification strength proportional to observed tissue displacement, or distance-based exponential decay that simulates biomechanical force attenuation. This dual-mode approach accommodates diverse surgical scenarios-motion-based softening excels with complex tissue deformations while distance-based softening provides stability during unreliable optical flow conditions. We evaluate EndoControlMag on our EndoVMM24 dataset spanning four different surgery types and various challenging scenarios, including occlusions, instrument disturbance, view changes, and vessel deformations. Quantitative metrics, visual assessments, and expert surgeon evaluations demonstrate that EndoControlMag significantly outperforms existing methods in both magnification accuracy and visual quality while maintaining robustness across challenging surgical conditions. The code, dataset, and video results are available at https://szupc.github.io/EndoControlMag/.
Chinese: 本文提出EndoControlMag框架,通过自适应掩码调节放大和周期性参考帧重置技术,在无需训练的情况下增强内窥镜手术中细微血管运动的可视化效果,在多种复杂手术场景下均展现出卓越的精度和鲁棒性。
English: This paper introduces EndoControlMag, a training-free framework for magnifying subtle vascular motions in endoscopic surgery through adaptive mask-conditioned magnification and periodic reference resetting, demonstrating superior accuracy and robustness across diverse surgical scenarios.

Authors:Yanbing Zhang, Zhe Wang, Qin Zhou, Mengping Yang
Title: FreeCus: Free Lunch Subject-driven Customization in Diffusion Transformers
Abstract:
In light of recent breakthroughs in text-to-image (T2I) generation, particularly with diffusion transformers (DiT), subject-driven technologies are increasingly being employed for high-fidelity customized production that preserves subject identity from reference inputs, enabling thrilling design workflows and engaging entertainment. Existing alternatives typically require either per-subject optimization via trainable text embeddings or training specialized encoders for subject feature extraction on large-scale datasets. Such dependencies on training procedures fundamentally constrain their practical applications. More importantly, current methodologies fail to fully leverage the inherent zero-shot potential of modern diffusion transformers (e.g., the Flux series) for authentic subject-driven synthesis. To bridge this gap, we propose FreeCus, a genuinely training-free framework that activates DiT's capabilities through three key innovations: 1) We introduce a pivotal attention sharing mechanism that captures the subject's layout integrity while preserving crucial editing flexibility. 2) Through a straightforward analysis of DiT's dynamic shifting, we propose an upgraded variant that significantly improves fine-grained feature extraction. 3) We further integrate advanced Multimodal Large Language Models (MLLMs) to enrich cross-modal semantic representations. Extensive experiments reflect that our method successfully unlocks DiT's zero-shot ability for consistent subject synthesis across diverse contexts, achieving state-of-the-art or comparable results compared to approaches that require additional training. Notably, our framework demonstrates seamless compatibility with existing inpainting pipelines and control modules, facilitating more compelling experiences. Our code is available at: https://github.com/Monalissaa/FreeCus.
中文摘要:FreeCus是一种无需训练的框架,通过注意力共享机制、动态偏移分析和多模态大语言模型集成,激活扩散变换器的零样本能力实现真实的主体驱动图像生成,在不需额外训练的情况下达到领先水平。
English Summary: FreeCus is a training-free framework that activates diffusion transformers' zero-shot capabilities for authentic subject-driven image synthesis through attention sharing, dynamic shifting analysis, and MLLM integration, achieving state-of-the-art results without requiring additional training.

Authors:Naeem Paeedeh, Mahardhika Pratama, Wolfgang Mayer, Jimmy Cao, Ryszard Kowlczyk
Title: Cross-Domain Few-Shot Learning with Coalescent Projections and Latent Space Reservation
Abstract:
Despite the progress in Cross-Domain Few-Shot Learning (CD-FSL), a model pre-trained with DINO combined with a prototypical classifier outperforms the latest SOTA methods. A crucial limitation that needs to be overcome is that updating too many parameters of the transformers leads to overfitting due to the scarcity of labeled samples. To address this challenge, we propose a new concept, Coalescent Projection (CP), as an effective successor to soft prompts. Additionally, we propose a novel pseudo-class generation method combined with Self-Supervised Transformations (SSTs) that relies solely on the base domain to prepare the network for encountering unseen samples from different domains. The proposed method exhibits its effectiveness in comprehensive experiments on the extreme domain shift scenario of the BSCD-FSL benchmark. Our code is published at https://github.com/Naeem-Paeedeh/CPLSR.
中文: 本研究提出融合投影和结合自监督变换的伪类生成方法,以解决跨域少样本学习中的过拟合问题,在BSCD-FSL基准测试中展现出卓越性能。
English: The study introduces Coalescent Projection and a pseudo-class generation method with Self-Supervised Transformations to overcome overfitting in Cross-Domain Few-Shot Learning, demonstrating superior performance on the BSCD-FSL benchmark.

Authors:Krishna Kanth Nakka
Title: Mammo-SAE: Interpreting Breast Cancer Concept Learning with Sparse Autoencoders
Abstract:
Interpretability is critical in high-stakes domains such as medical imaging, where understanding model decisions is essential for clinical adoption. In this work, we introduce Sparse Autoencoder (SAE)-based interpretability to breast imaging by analyzing {Mammo-CLIP}, a vision--language foundation model pretrained on large-scale mammogram image--report pairs. We train a patch-level \texttt{Mammo-SAE} on Mammo-CLIP to identify and probe latent features associated with clinically relevant breast concepts such as \textit{mass} and \textit{suspicious calcification}. Our findings reveal that top activated class level latent neurons in the SAE latent space often tend to align with ground truth regions, and also uncover several confounding factors influencing the model's decision-making process. Additionally, we analyze which latent neurons the model relies on during downstream finetuning for improving the breast concept prediction. This study highlights the promise of interpretable SAE latent representations in providing deeper insight into the internal workings of foundation models at every layer for breast imaging. The code will be released at https://krishnakanthnakka.github.io/MammoSAE/

Authors:Siqi Chen, Guoqing Zhang, Jiahao Lai, Bingzhi Shen, Sihong Zhang, Caixia Dong, Xuejin Chen, Yang Li
Title: Hierarchical Part-based Generative Model for Realistic 3D Blood Vessel
Abstract:
Advancements in 3D vision have increased the impact of blood vessel modeling on medical applications. However, accurately representing the complex geometry and topology of blood vessels remains a challenge due to their intricate branching patterns, curvatures, and irregular shapes. In this study, we propose a hierarchical part-based frame work for 3D vessel generation that separates the global binary tree-like topology from local geometric details. Our approach proceeds in three stages: (1) key graph generation to model the overall hierarchical struc ture, (2) vessel segment generation conditioned on geometric properties, and (3) hierarchical vessel assembly by integrating the local segments according to the global key graph. We validate our framework on real world datasets, demonstrating superior performance over existing methods in modeling complex vascular networks. This work marks the first successful application of a part-based generative approach for 3D vessel modeling, setting a new benchmark for vascular data generation. The code is available at: https://github.com/CybercatChen/PartVessel.git.
中文: 本研究提出了一种分层部件式框架用于三维血管生成,将全局拓扑与局部几何细节分离,在真实数据集上验证了其优越性能,为血管建模设立了新标杆。
English: This study introduces a hierarchical part-based framework for 3D blood vessel generation that separates global topology from local geometry, achieving superior performance on real-world datasets and setting a new benchmark in vascular modeling.

Authors:Mohammad-Maher Nakshbandi, Ziad Sharawy, Sorin Grigorescu
Title: LoopNet: A Multitasking Few-Shot Learning Approach for Loop Closure in Large Scale SLAM
Abstract:
One of the main challenges in the Simultaneous Localization and Mapping (SLAM) loop closure problem is the recognition of previously visited places. In this work, we tackle the two main problems of real-time SLAM systems: 1) loop closure detection accuracy and 2) real-time computation constraints on the embedded hardware. Our LoopNet method is based on a multitasking variant of the classical ResNet architecture, adapted for online retraining on a dynamic visual dataset and optimized for embedded devices. The online retraining is designed using a few-shot learning approach. The architecture provides both an index into the queried visual dataset, and a measurement of the prediction quality. Moreover, by leveraging DISK (DIStinctive Keypoints) descriptors, LoopNet surpasses the limitations of handcrafted features and traditional deep learning methods, offering better performance under varying conditions. Code is available at https://github.com/RovisLab/LoopNet. Additinally, we introduce a new loop closure benchmarking dataset, coined LoopDB, which is available at https://github.com/RovisLab/LoopDB.
中文: LoopNet通过多任务ResNet架构结合少量样本在线学习,利用DISK描述符提升SLAM闭环检测的实时性和准确性,有效应对不同环境变化。
English: LoopNet addresses SLAM loop closure challenges by using a multitasking ResNet for real-time detection and few-shot online retraining, enhanced with DISK descriptors for robust performance under varying conditions.

Authors:Ran Zhang, Xuanhua He, Li Xueheng, Ke Cao, Liu Liu, Wenbo Xu, Fang Jiabin, Yang Qize, Jie Zhang
Title: Rethinking Pan-sharpening: Principled Design, Unified Training, and a Universal Loss Surpass Brute-Force Scaling
Abstract:
The field of pan-sharpening has recently seen a trend towards increasingly large and complex models, often trained on single, specific satellite datasets. This approach, however, leads to high computational overhead and poor generalization on full resolution data, a paradigm we challenge in this paper. In response to this issue, we propose PanTiny, a lightweight, single-step pan-sharpening framework designed for both efficiency and robust performance. More critically, we introduce multiple-in-one training paradigm, where a single, compact model is trained simultaneously on three distinct satellite datasets (WV2, WV3, and GF2) with different resolution and spectral information. Our experiments show that this unified training strategy not only simplifies deployment but also significantly boosts generalization on full-resolution data. Further, we introduce a universally powerful composite loss function that elevates the performance of almost all of models for pan-sharpening, pushing state-of-the-art metrics into a new era. Our PanTiny model, benefiting from these innovations, achieves a superior performance-to-efficiency balance, outperforming most larger, specialized models. Through extensive ablation studies, we validate that principled engineering in model design, training paradigms, and loss functions can surpass brute-force scaling. Our work advocates for a community-wide shift towards creating efficient, generalizable, and data-conscious models for pan-sharpening. The code is available at https://github.com/Zirconium233/PanTiny .
中文: 本文提出PanTiny轻量级全色锐化框架,通过多数据集联合训练和复合损失函数,在实现卓越性能与泛化能力的同时,突破了当前模型庞大且专一化的研究范式。
English: This paper introduces PanTiny, a lightweight and efficient pan-sharpening framework trained on multiple satellite datasets with a novel composite loss function, achieving superior performance and generalization while challenging the trend of large, specialized models.

Authors:Zhaotong Yang, Yuhui Li, Shengfeng He, Xinzhe Li, Yangyang Xu, Junyu Dong, Yong Du
Title: OmniVTON: Training-Free Universal Virtual Try-On
Abstract:
Image-based Virtual Try-On (VTON) techniques rely on either supervised in-shop approaches, which ensure high fidelity but struggle with cross-domain generalization, or unsupervised in-the-wild methods, which improve adaptability but remain constrained by data biases and limited universality. A unified, training-free solution that works across both scenarios remains an open challenge. We propose OmniVTON, the first training-free universal VTON framework that decouples garment and pose conditioning to achieve both texture fidelity and pose consistency across diverse settings. To preserve garment details, we introduce a garment prior generation mechanism that aligns clothing with the body, followed by continuous boundary stitching technique to achieve fine-grained texture retention. For precise pose alignment, we utilize DDIM inversion to capture structural cues while suppressing texture interference, ensuring accurate body alignment independent of the original image textures. By disentangling garment and pose constraints, OmniVTON eliminates the bias inherent in diffusion models when handling multiple conditions simultaneously. Experimental results demonstrate that OmniVTON achieves superior performance across diverse datasets, garment types, and application scenarios. Notably, it is the first framework capable of multi-human VTON, enabling realistic garment transfer across multiple individuals in a single scene. Code is available at https://github.com/Jerome-Young/OmniVTON
中文摘要:OmniVTON是首个无需训练的统一虚拟试穿框架,通过解耦服装与姿态条件实现在多样化场景下的精细纹理保持与精准姿态对齐,首次支持多人场景的逼真服装转换。
English Summary: OmniVTON is a training-free universal virtual try-on framework that decouples garment and pose conditioning to achieve high texture fidelity and pose consistency across diverse scenarios, including multi-human applications.

Authors:Lyes Saad Saoud, Irfan Hussain
Title: EBA-AI: Ethics-Guided Bias-Aware AI for Efficient Underwater Image Enhancement and Coral Reef Monitoring
Abstract:
Underwater image enhancement is vital for marine conservation, particularly coral reef monitoring. However, AI-based enhancement models often face dataset bias, high computational costs, and lack of transparency, leading to potential misinterpretations. This paper introduces EBA-AI, an ethics-guided bias-aware AI framework to address these challenges. EBA-AI leverages CLIP embeddings to detect and mitigate dataset bias, ensuring balanced representation across varied underwater environments. It also integrates adaptive processing to optimize energy efficiency, significantly reducing GPU usage while maintaining competitive enhancement quality. Experiments on LSUI400, Oceanex, and UIEB100 show that while PSNR drops by a controlled 1.0 dB, computational savings enable real-time feasibility for large-scale marine monitoring. Additionally, uncertainty estimation and explainability techniques enhance trust in AI-driven environmental decisions. Comparisons with CycleGAN, FunIEGAN, RAUNENet, WaterNet, UGAN, PUGAN, and UTUIE validate EBA-AI's effectiveness in balancing efficiency, fairness, and interpretability in underwater image processing. By addressing key limitations of AI-driven enhancement, this work contributes to sustainable, bias-aware, and computationally efficient marine conservation efforts. For interactive visualizations, animations, source code, and access to the preprint, visit: https://lyessaadsaoud.github.io/EBA-AI/

Authors:Yuanhan Zhang, Yunice Chew, Yuhao Dong, Aria Leo, Bo Hu, Ziwei Liu
Title: Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding
Abstract:
Human intelligence requires correctness and robustness, with the former being foundational for the latter. In video understanding, correctness ensures the accurate interpretation of visual content, and robustness maintains consistent performance in challenging conditions. Despite advances in video large language models (video LLMs), existing benchmarks inadequately reflect the gap between these models and human intelligence in maintaining correctness and robustness in video interpretation. We introduce the Video Thinking Test (Video-TT), to assess if video LLMs can interpret real-world videos as effectively as humans. Video-TT reflects genuine gaps in understanding complex visual narratives, and evaluates robustness against natural adversarial questions. Video-TT comprises 1,000 YouTube Shorts videos, each with one open-ended question and four adversarial questions that probe visual and narrative complexity. Our evaluation shows a significant gap between video LLMs and human performance.
中文: 视频思维测试(Video-TT)旨在评估视频大语言模型在理解复杂视觉叙事和应对对抗性问题时的表现,结果显示其与人类能力存在显著差距。
English: The Video Thinking Test (Video-TT) is introduced to evaluate video large language models' ability to interpret complex visual narratives and maintain robustness against adversarial questions, revealing a significant performance gap compared to humans.

Authors:Hao Zheng, Shunzhi Yang, Zhuoxin He, Jinfeng Yang, Zhenhua Huang
Title: Hierarchical Cross-modal Prompt Learning for Vision-Language Models
Abstract:
Pre-trained Vision-Language Models (VLMs) such as CLIP have shown excellent generalization abilities. However, adapting these large-scale models to downstream tasks while preserving their generalization capabilities remains challenging. Although prompt learning methods have shown promise, they suffer from two fundamental bottlenecks that limit generalization: (a) modality isolation, and (b) hierarchical semantic decay. To address these limitations, we propose HiCroPL, a Hierarchical Cross-modal Prompt Learning framework that establishes bidirectional knowledge flow between text and vision modalities, enabling them to refine their semantics mutually. HiCroPL routes knowledge flows by leveraging the complementary strengths of text and vision. In early layers, text prompts inject relatively clear semantics into visual prompts through a hierarchical knowledge mapper, enhancing the representation of low-level visual semantics. In later layers, visual prompts encoding specific task-relevant objects flow back to refine text prompts, enabling deeper alignment. Crucially, our hierarchical knowledge mapper allows representations at multi-scales to be fused, ensuring that deeper representations retain transferable shallow semantics thereby enhancing generalization. We further introduce a lightweight layer-specific knowledge proxy to enable efficient cross-modal interactions. Extensive evaluations across four tasks demonstrate HiCroPL's superior performance, achieving state-of-the-art results on 11 benchmarks with significant improvements. Code is available at: https://github.com/zzeoZheng/HiCroPL.
中文: HiCroPL提出了一种分层跨模态提示学习框架,通过建立文本与视觉模态间的双向知识流动实现语义相互增强,在多项基准测试中取得了最优性能。
English: HiCroPL introduces a hierarchical cross-modal prompt learning framework that establishes bidirectional knowledge flow between text and vision modalities, enabling mutual semantic refinement and achieving state-of-the-art performance across multiple benchmarks.

Authors:Saeid Ghafouri, Mohsen Fayyaz, Xiangchen Li, Deepu John, Bo Ji, Dimitrios Nikolopoulos, Hans Vandierendonck
Title: Polymorph: Energy-Efficient Multi-Label Classification for Video Streams on Embedded Devices
Abstract:
Real-time multi-label video classification on embedded devices is constrained by limited compute and energy budgets. Yet, video streams exhibit structural properties such as label sparsity, temporal continuity, and label co-occurrence that can be leveraged for more efficient inference. We introduce Polymorph, a context-aware framework that activates a minimal set of lightweight Low Rank Adapters (LoRA) per frame. Each adapter specializes in a subset of classes derived from co-occurrence patterns and is implemented as a LoRA weight over a shared backbone. At runtime, Polymorph dynamically selects and composes only the adapters needed to cover the active labels, avoiding full-model switching and weight merging. This modular strategy improves scalability while reducing latency and energy overhead. Polymorph achieves 40% lower energy consumption and improves mAP by 9 points over strong baselines on the TAO dataset. Polymorph is open source at https://github.com/inference-serving/polymorph/.
Chinese: Polymorph是一种上下文感知框架,通过动态低秩适配器实现高效的实时多标签视频分类,在嵌入式设备上能耗降低40%,平均精度提升9个百分点。
English: Polymorph is a context-aware framework that uses dynamic Low Rank Adapters for efficient real-time multi-label video classification, reducing energy consumption by 40% and improving mAP by 9 points on embedded devices.

Authors:Hai Huang, Yan Xia, Shulei Wang, Hanting Wang, Minghui Fang, Shengpeng Ji, Sashuai Zhou, Tao Jin, Zhou Zhao
Title: Open-set Cross Modal Generalization via Multimodal Unified Representation
Abstract:
This paper extends Cross Modal Generalization (CMG) to open-set environments by proposing the more challenging Open-set Cross Modal Generalization (OSCMG) task. This task evaluates multimodal unified representations in open-set conditions, addressing the limitations of prior closed-set cross-modal evaluations. OSCMG requires not only cross-modal knowledge transfer but also robust generalization to unseen classes within new modalities, a scenario frequently encountered in real-world applications. Existing multimodal unified representation work lacks consideration for open-set environments. To tackle this, we propose MICU, comprising two key components: Fine-Coarse Masked multimodal InfoNCE (FCMI) and Cross modal Unified Jigsaw Puzzles (CUJP). FCMI enhances multimodal alignment by applying contrastive learning at both holistic semantic and temporal levels, incorporating masking to enhance generalization. CUJP enhances feature diversity and model uncertainty by integrating modality-agnostic feature selection with self-supervised learning, thereby strengthening the model's ability to handle unknown categories in open-set tasks. Extensive experiments on CMG and the newly proposed OSCMG validate the effectiveness of our approach. The code is available at https://github.com/haihuangcode/CMG.
中文摘要:本文提出开放集跨模态泛化任务OSCMG,并设计包含FCMI和CUJP模块的MICU方法,通过增强多模态对齐与特征多样性来提升开放集环境下的泛化能力。
English Summary: This paper introduces the Open-set Cross Modal Generalization (OSCMG) task and proposes MICU with FCMI and CUJP components to enhance multimodal alignment and feature diversity for robust generalization in open-set environments.

Authors:Xiaojie Li, Chu Li, Shi-Zhe Chen, Xi Chen
Title: U-MARVEL: Unveiling Key Factors for Universal Multimodal Retrieval via Embedding Learning with MLLMs
Abstract:
Universal multimodal retrieval (UMR), which aims to address complex retrieval tasks where both queries and candidates span diverse modalities, has been significantly advanced by the emergence of MLLMs. While state-of-the-art MLLM-based methods in the literature predominantly adopt contrastive learning principles, they often differ in their specific training recipes. Despite their success, the mechanisms underlying their retrieval capabilities remain largely unexplored, potentially resulting in suboptimal performance and limited generalization ability. To address these issues, we present a comprehensive study aimed at uncovering the key factors that drive effective embedding learning for UMR using MLLMs. We begin by implementing a general MLLM-based embedding learning pipeline, and systematically analyze the primary contributors to high-performing universal retrieval systems. Based on this, we explore various aspects of the details in embedding generation and training strategies, including progressive transition, hard negative mining and re-ranker distillation. Notably, our findings reveal that often-overlooked factors can have a substantial impact on model performance. Building on these discoveries, we introduce a unified framework termed U-MARVEL (\textbf{U}niversal \textbf{M}ultimod\textbf{A}l \textbf{R}etrie\textbf{V}al via \textbf{E}mbedding \textbf{L}earning), which outperforms state-of-the-art competitors on the M-BEIR benchmark by a large margin in supervised settings, and also exihibits strong zero-shot performance on several tasks such as composed image retrieval and text-to-video retrieval. These results underscore the generalization potential of our framework across various embedding-based retrieval tasks. Code is available at https://github.com/chaxjli/U-MARVEL
中文: 本研究提出U-MARVEL统一框架,通过分析多模态大模型嵌入学习的关键要素,在监督设定下显著超越现有方法,并在零样本任务中展现出强大的泛化能力。
English: This study introduces U-MARVEL, a unified framework that identifies key factors for effective multimodal retrieval using MLLMs and achieves superior performance on benchmarks through optimized embedding strategies.

Authors:Xingshu Chen, Sicheng Yu, Chong Cheng, Hao Wang, Ting Tian
Title: An Uncertainty-aware DETR Enhancement Framework for Object Detection
Abstract:
This paper investigates the problem of object detection with a focus on improving both the localization accuracy of bounding boxes and explicitly modeling prediction uncertainty. Conventional detectors rely on deterministic bounding box regression, ignoring uncertainty in predictions and limiting model robustness. In this paper, we propose an uncertainty-aware enhancement framework for DETR-based object detectors. We model bounding boxes as multivariate Gaussian distributions and incorporate the Gromov-Wasserstein distance into the loss function to better align the predicted and ground-truth distributions. Building on this, we derive a Bayes Risk formulation to filter high-risk information and improve detection reliability. We also propose a simple algorithm to quantify localization uncertainty via confidence intervals. Experiments on the COCO benchmark show that our method can be effectively integrated into existing DETR variants, enhancing their performance. We further extend our framework to leukocyte detection tasks, achieving state-of-the-art results on the LISC and WBCDD datasets. These results confirm the scalability of our framework across both general and domain-specific detection tasks. Code page: https://github.com/ParadiseforAndaChen/An-Uncertainty-aware-DETR-Enhancement-Framework-for-Object-Detection.
中文: 本文提出了一种面向DETR检测器的不确定性增强框架,通过将边界框建模为高斯分布并引入Gromov-Wasserstein距离来提升定位精度和不确定性建模能力,在通用与特定领域检测任务中均取得了最优性能。
English: This paper introduces an uncertainty-aware enhancement framework for DETR-based object detectors, modeling bounding boxes as Gaussian distributions and incorporating Gromov-Wasserstein distance to improve localization accuracy and uncertainty modeling, achieving state-of-the-art results on both general and domain-specific detection tasks.

Authors:Xiang Tang, Ruotong Li, Xiaopeng Fan
Title: Towards Geometric and Textural Consistency 3D Scene Generation via Single Image-guided Model Generation and Layout Optimization
Abstract:
In recent years, 3D generation has made great strides in both academia and industry. However, generating 3D scenes from a single RGB image remains a significant challenge, as current approaches often struggle to ensure both object generation quality and scene coherence in multi-object scenarios. To overcome these limitations, we propose a novel three-stage framework for 3D scene generation with explicit geometric representations and high-quality textural details via single image-guided model generation and spatial layout optimization. Our method begins with an image instance segmentation and inpainting phase, which recovers missing details of occluded objects in the input images, thereby achieving complete generation of foreground 3D assets. Subsequently, our approach captures the spatial geometry of reference image by constructing pseudo-stereo viewpoint for camera parameter estimation and scene depth inference, while employing a model selection strategy to ensure optimal alignment between the 3D assets generated in the previous step and the input. Finally, through model parameterization and minimization of the Chamfer distance between point clouds in 3D and 2D space, our approach optimizes layout parameters to produce an explicit 3D scene representation that maintains precise alignment with input guidance image. Extensive experiments on multi-object scene image sets have demonstrated that our approach not only outperforms state-of-the-art methods in terms of geometric accuracy and texture fidelity of individual generated 3D models, but also has significant advantages in scene layout synthesis.

Authors:Beier Zhu, Ruoyu Wang, Tong Zhao, Hanwang Zhang, Chi Zhang
Title: Distilling Parallel Gradients for Fast ODE Solvers of Diffusion Models
Abstract:
Diffusion models (DMs) have achieved state-of-the-art generative performance but suffer from high sampling latency due to their sequential denoising nature. Existing solver-based acceleration methods often face image quality degradation under a low-latency budget. In this paper, we propose the Ensemble Parallel Direction solver (dubbed as \ours), a novel ODE solver that mitigates truncation errors by incorporating multiple parallel gradient evaluations in each ODE step. Importantly, since the additional gradient computations are independent, they can be fully parallelized, preserving low-latency sampling. Our method optimizes a small set of learnable parameters in a distillation fashion, ensuring minimal training overhead. In addition, our method can serve as a plugin to improve existing ODE samplers. Extensive experiments on various image synthesis benchmarks demonstrate the effectiveness of our \ours~in achieving high-quality and low-latency sampling. For example, at the same latency level of 5 NFE, EPD achieves an FID of 4.47 on CIFAR-10, 7.97 on FFHQ, 8.17 on ImageNet, and 8.26 on LSUN Bedroom, surpassing existing learning-based solvers by a significant margin. Codes are available in https://github.com/BeierZhu/EPD.
中文: 提出的集成并行方向求解器通过并行梯度评估减少截断误差,在保持低延迟采样的同时显著提升图像生成质量,可作为插件优化现有求解器。
English: The proposed Ensemble Parallel Direction solver enhances diffusion model efficiency by using parallel gradient evaluations to reduce truncation errors while maintaining low-latency sampling through full parallelization and minimal training overhead.

Authors:Joseph Raj Vishal, Divesh Basina, Rutuja Patil, Manas Srinivas Gowda, Katha Naik, Yezhou Yang, Bharatesh Chakravarthi
Title: InterAct-Video: Reasoning-Rich Video QA for Urban Traffic
Abstract:
Traffic monitoring is crucial for urban mobility, road safety, and intelligent transportation systems (ITS). Deep learning has advanced video-based traffic monitoring through video question answering (VideoQA) models, enabling structured insight extraction from traffic videos. However, existing VideoQA models struggle with the complexity of real-world traffic scenes, where multiple concurrent events unfold across spatiotemporal dimensions. To address these challenges, this paper introduces \textbf{InterAct VideoQA}, a curated dataset designed to benchmark and enhance VideoQA models for traffic monitoring tasks. The InterAct VideoQA dataset comprises 8 hours of real-world traffic footage collected from diverse intersections, segmented into 10-second video clips, with over 25,000 question-answer (QA) pairs covering spatiotemporal dynamics, vehicle interactions, incident detection, and other critical traffic attributes. State-of-the-art VideoQA models are evaluated on InterAct VideoQA, exposing challenges in reasoning over fine-grained spatiotemporal dependencies within complex traffic scenarios. Additionally, fine-tuning these models on InterAct VideoQA yields notable performance improvements, demonstrating the necessity of domain-specific datasets for VideoQA. InterAct VideoQA is publicly available as a benchmark dataset to facilitate future research in real-world deployable VideoQA models for intelligent transportation systems. GitHub Repo: https://github.com/joe-rabbit/InterAct_VideoQA
中文: 本文提出了InterAct VideoQA数据集,包含真实交通录像和超过2.5万个问答对,专门用于解决现有视频问答模型在处理复杂时空交通场景时的局限性,以推动智能交通系统的发展。
English: This paper introduces the InterAct VideoQA dataset, a specialized benchmark with real-world traffic footage and over 25,000 QA pairs, designed to address the limitations of existing VideoQA models in handling complex spatiotemporal traffic scenarios and to advance intelligent transportation systems.

Authors:Yuchen Duan, Zhe Chen, Yusong Hu, Weiyun Wang, Shenglong Ye, Botian Shi, Lewei Lu, Qibin Hou, Tong Lu, Hongsheng Li, Jifeng Dai, Wenhai Wang
Title: Docopilot: Improving Multimodal Models for Document-Level Understanding
Abstract:
Despite significant progress in multimodal large language models (MLLMs), their performance on complex, multi-page document comprehension remains inadequate, largely due to the lack of high-quality, document-level datasets. While current retrieval-augmented generation (RAG) methods offer partial solutions, they suffer from issues, such as fragmented retrieval contexts, multi-stage error accumulation, and extra time costs of retrieval. In this work, we present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents. This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from the original documents. Building on the dataset, we develop a native multimodal model, Docopilot, which can accurately handle document-level dependencies without relying on RAG. Experiments demonstrate that Docopilot achieves superior coherence, accuracy, and efficiency in document understanding tasks and multi-turn interactions, setting a new baseline for document-level multimodal understanding. Data, code, and models are released at https://github.com/OpenGVLab/Docopilot
Chinese: 本文提出了高质量数据集Doc-750K以解决文档级多模态数据不足的问题,并开发了原生多模态模型Docopilot,该模型无需依赖检索增强生成就能在文档理解任务中实现更优性能。
English: This paper introduces Doc-750K, a high-quality dataset addressing the lack of document-level multimodal data, and presents Docopilot, a native multimodal model that outperforms existing methods in document understanding without relying on retrieval-augmented generation.

Authors:Jifeng Shen, Haibo Zhan, Shaohua Dong, Xin Zuo, Wankou Yang, Haibin Ling
Title: Multispectral State-Space Feature Fusion: Bridging Shared and Cross-Parametric Interactions for Object Detection
Abstract:
Modern multispectral feature fusion for object detection faces two critical limitations: (1) Excessive preference for local complementary features over cross-modal shared semantics adversely affects generalization performance; and (2) The trade-off between the receptive field size and computational complexity present critical bottlenecks for scalable feature modeling. Addressing these issues, a novel Multispectral State-Space Feature Fusion framework, dubbed MS2Fusion, is proposed based on the state space model (SSM), achieving efficient and effective fusion through a dual-path parametric interaction mechanism. More specifically, the first cross-parameter interaction branch inherits the advantage of cross-attention in mining complementary information with cross-modal hidden state decoding in SSM. The second shared-parameter branch explores cross-modal alignment with joint embedding to obtain cross-modal similar semantic features and structures through parameter sharing in SSM. Finally, these two paths are jointly optimized with SSM for fusing multispectral features in a unified framework, allowing our MS2Fusion to enjoy both functional complementarity and shared semantic space. In our extensive experiments on mainstream benchmarks including FLIR, M3FD and LLVIP, our MS2Fusion significantly outperforms other state-of-the-art multispectral object detection methods, evidencing its superiority. Moreover, MS2Fusion is general and applicable to other multispectral perception tasks. We show that, even without specific design, MS2Fusion achieves state-of-the-art results on RGB-T semantic segmentation and RGBT salient object detection, showing its generality. The source code will be available at https://github.com/61s61min/MS2Fusion.git.
中文:MS2Fusion框架通过状态空间模型中的双路径参数交互机制,解决了多光谱特征融合中的关键限制,在多个基准测试的对象检测及其他多光谱任务中实现了卓越性能。
English: The proposed MS2Fusion framework addresses limitations in multispectral feature fusion by employing a dual-path parametric interaction mechanism within a state space model, achieving superior performance in object detection and other multispectral tasks across multiple benchmarks.

Authors:Guoping Xu, Christopher Kabat, You Zhang
Title: Depthwise-Dilated Convolutional Adapters for Medical Object Tracking and Segmentation Using the Segment Anything Model 2
Abstract:
Recent advances in medical image segmentation have been driven by deep learning; however, most existing methods remain limited by modality-specific designs and exhibit poor adaptability to dynamic medical imaging scenarios. The Segment Anything Model 2 (SAM2) and its related variants, which introduce a streaming memory mechanism for real-time video segmentation, present new opportunities for prompt-based, generalizable solutions. Nevertheless, adapting these models to medical video scenarios typically requires large-scale datasets for retraining or transfer learning, leading to high computational costs and the risk of catastrophic forgetting. To address these challenges, we propose DD-SAM2, an efficient adaptation framework for SAM2 that incorporates a Depthwise-Dilated Adapter (DD-Adapter) to enhance multi-scale feature extraction with minimal parameter overhead. This design enables effective fine-tuning of SAM2 on medical videos with limited training data. Unlike existing adapter-based methods focused solely on static images, DD-SAM2 fully exploits SAM2's streaming memory for medical video object tracking and segmentation. Comprehensive evaluations on TrackRad2025 (tumor segmentation) and EchoNet-Dynamic (left ventricle tracking) datasets demonstrate superior performance, achieving Dice scores of 0.93 and 0.97, respectively. To the best of our knowledge, this work provides an initial attempt at systematically exploring adapter-based SAM2 fine-tuning for medical video segmentation and tracking. Code, datasets, and models will be publicly available at https://github.com/apple1986/DD-SAM2.
中文: DD-SAM2框架通过引入深度可分离膨胀适配器,有效优化了Segment Anything Model 2在医学视频分割中的应用,在专业数据集上以最小计算成本实现了卓越性能。
English: The DD-SAM2 framework efficiently adapts the Segment Anything Model 2 for medical video segmentation by incorporating a Depthwise-Dilated Adapter, achieving superior performance with minimal computational overhead on specialized datasets.

Authors:Andrea Moschetto, Lemuel Puglisi, Alec Sargood, Pierluigi Dell'Acqua, Francesco Guarnera, Sebastiano Battiato, Daniele Ravì
Title: Benchmarking GANs, Diffusion Models, and Flow Matching for T1w-to-T2w MRI Translation
Abstract:
Magnetic Resonance Imaging (MRI) enables the acquisition of multiple image contrasts, such as T1-weighted (T1w) and T2-weighted (T2w) scans, each offering distinct diagnostic insights. However, acquiring all desired modalities increases scan time and cost, motivating research into computational methods for cross-modal synthesis. To address this, recent approaches aim to synthesize missing MRI contrasts from those already acquired, reducing acquisition time while preserving diagnostic quality. Image-to-image (I2I) translation provides a promising framework for this task. In this paper, we present a comprehensive benchmark of generative models$\unicode{x2013}$specifically, Generative Adversarial Networks (GANs), diffusion models, and flow matching (FM) techniques$\unicode{x2013}$for T1w-to-T2w 2D MRI I2I translation. All frameworks are implemented with comparable settings and evaluated on three publicly available MRI datasets of healthy adults. Our quantitative and qualitative analyses show that the GAN-based Pix2Pix model outperforms diffusion and FM-based methods in terms of structural fidelity, image quality, and computational efficiency. Consistent with existing literature, these results suggest that flow-based models are prone to overfitting on small datasets and simpler tasks, and may require more data to match or surpass GAN performance. These findings offer practical guidance for deploying I2I translation techniques in real-world MRI workflows and highlight promising directions for future research in cross-modal medical image synthesis. Code and models are publicly available at https://github.com/AndreaMoschetto/medical-I2I-benchmark.
中文: 本文对跨模态磁共振图像合成的生成模型进行基准测试,发现基于GAN的Pix2Pix在图像质量和效率上优于扩散模型与流匹配方法,同时揭示了流模型存在数据依赖性问题。
English: This paper benchmarks generative models for cross-modal MRI synthesis, finding that GAN-based Pix2Pix outperforms diffusion and flow matching methods in image quality and efficiency while highlighting flow models' data dependency issues.

Authors:Sujata Gaihre, Amir Thapa Magar, Prasuna Pokharel, Laxmi Tiwari
Title: Multimodal AI for Gastrointestinal Diagnostics: Tackling VQA in MEDVQA-GI 2025
Abstract:
This paper describes our approach to Subtask 1 of the ImageCLEFmed MEDVQA 2025 Challenge, which targets visual question answering (VQA) for gastrointestinal endoscopy. We adopt the Florence model-a large-scale multimodal foundation model-as the backbone of our VQA pipeline, pairing a powerful vision encoder with a text encoder to interpret endoscopic images and produce clinically relevant answers. To improve generalization, we apply domain-specific augmentations that preserve medical features while increasing training diversity. Experiments on the KASVIR dataset show that fine-tuning Florence yields accurate responses on the official challenge metrics. Our results highlight the potential of large multimodal models in medical VQA and provide a strong baseline for future work on explainability, robustness, and clinical integration. The code is publicly available at: https://github.com/TiwariLaxuu/VQA-Florence.git
中文: 本文采用Florence多模态基础模型构建胃肠内镜视觉问答系统,通过领域特定数据增强和微调在MEDVQA 2025挑战中取得优异表现,为医疗VQA应用提供了重要基准。
English: This paper presents a Florence-based VQA system for gastrointestinal endoscopy that uses domain-specific augmentation and fine-tuning to achieve strong performance on the MEDVQA 2025 challenge, demonstrating the potential of large multimodal models in medical applications.

Authors:Jiahao Ma, Tianyu Wang, Miaomiao Liu, David Ahmedt-Aristizabal, Chuong Nguyen
Title: DCHM: Depth-Consistent Human Modeling for Multiview Detection
Abstract:
Multiview pedestrian detection typically involves two stages: human modeling and pedestrian localization. Human modeling represents pedestrians in 3D space by fusing multiview information, making its quality crucial for detection accuracy. However, existing methods often introduce noise and have low precision. While some approaches reduce noise by fitting on costly multiview 3D annotations, they often struggle to generalize across diverse scenes. To eliminate reliance on human-labeled annotations and accurately model humans, we propose Depth-Consistent Human Modeling (DCHM), a framework designed for consistent depth estimation and multiview fusion in global coordinates. Specifically, our proposed pipeline with superpixel-wise Gaussian Splatting achieves multiview depth consistency in sparse-view, large-scaled, and crowded scenarios, producing precise point clouds for pedestrian localization. Extensive validations demonstrate that our method significantly reduces noise during human modeling, outperforming previous state-of-the-art baselines. Additionally, to our knowledge, DCHM is the first to reconstruct pedestrians and perform multiview segmentation in such a challenging setting. Code is available on the \href{https://jiahao-ma.github.io/DCHM/}{project page}.

Authors:Chi Wan, Yixin Cui, Jiatong Du, Shuo Yang, Yulong Bai, Peng Yi, Nan Li, Yanjun Huang
Title: GEMINUS: Dual-aware Global and Scene-Adaptive Mixture-of-Experts for End-to-End Autonomous Driving
Abstract:
End-to-end autonomous driving requires adaptive and robust handling of complex and diverse traffic environments. However, prevalent single-mode planning methods attempt to learn an overall policy while struggling to acquire diversified driving skills to handle diverse scenarios. Therefore, this paper proposes GEMINUS, a Mixture-of-Experts end-to-end autonomous driving framework featuring a Global Expert and a Scene-Adaptive Experts Group, equipped with a Dual-aware Router. Specifically, the Global Expert is trained on the overall dataset, possessing robust performance. The Scene-Adaptive Experts are trained on corresponding scene subsets, achieving adaptive performance. The Dual-aware Router simultaneously considers scenario-level features and routing uncertainty to dynamically activate expert modules. Through the effective coupling of the Global Expert and the Scene-Adaptive Experts Group via the Dual-aware Router, GEMINUS achieves both adaptability and robustness across diverse scenarios. GEMINUS outperforms existing methods in the Bench2Drive closed-loop benchmark and achieves state-of-the-art performance in Driving Score and Success Rate, even with only monocular vision input. The code is available at https://github.com/newbrains1/GEMINUS.
中文: 本文提出GEMINUS框架,通过双感知路由器将全局专家与场景自适应专家相结合,实现了自动驾驶在多样化场景中的卓越适应性和鲁棒性,并在基准测试中达到最优性能。
English: This paper introduces GEMINUS, a Mixture-of-Experts framework that enhances autonomous driving by combining a robust Global Expert with scene-specific experts through a Dual-aware Router, achieving superior adaptability and state-of-the-art performance on benchmarks.

Authors:Weikang Gu, Mingyue Han, Li Xue, Heng Dong, Changcai Yang, Riqing Chen, Lifang Wei
Title: GPI-Net: Gestalt-Guided Parallel Interaction Network via Orthogonal Geometric Consistency for Robust Point Cloud Registration
Abstract:
The accurate identification of high-quality correspondences is a prerequisite task in feature-based point cloud registration. However, it is extremely challenging to handle the fusion of local and global features due to feature redundancy and complex spatial relationships. Given that Gestalt principles provide key advantages in analyzing local and global relationships, we propose a novel Gestalt-guided Parallel Interaction Network via orthogonal geometric consistency (GPI-Net) in this paper. It utilizes Gestalt principles to facilitate complementary communication between local and global information. Specifically, we introduce an orthogonal integration strategy to optimally reduce redundant information and generate a more compact global structure for high-quality correspondences. To capture geometric features in correspondences, we leverage a Gestalt Feature Attention (GFA) block through a hybrid utilization of self-attention and cross-attention mechanisms. Furthermore, to facilitate the integration of local detail information into the global structure, we design an innovative Dual-path Multi-Granularity parallel interaction aggregation (DMG) block to promote information exchange across different granularities. Extensive experiments on various challenging tasks demonstrate the superior performance of our proposed GPI-Net in comparison to existing methods. The code will be released at https://github.com/gwk429/GPI-Net.
中文摘要:本文提出GPI-Net,通过格式塔原则指导的并行交互网络,利用正交几何一致性优化局部与全局特征融合,在点云配准任务中展现出优于现有方法的性能。
English Summary: This paper introduces GPI-Net, a Gestalt-guided network that enhances point cloud registration by integrating local and global features through orthogonal geometric consistency and attention mechanisms, demonstrating superior performance in various tasks.

Authors:Praneeth Namburi, Roger Pallarès-López, Jessica Rosendorf, Duarte Folgado, Brian W. Anthony
Title: DUSTrack: Semi-automated point tracking in ultrasound videos
Abstract:
Ultrasound technology enables safe, non-invasive imaging of dynamic tissue behavior, making it a valuable tool in medicine, biomechanics, and sports science. However, accurately tracking tissue motion in B-mode ultrasound remains challenging due to speckle noise, low edge contrast, and out-of-plane movement. These challenges complicate the task of tracking anatomical landmarks over time, which is essential for quantifying tissue dynamics in many clinical and research applications. This manuscript introduces DUSTrack (Deep learning and optical flow-based toolkit for UltraSound Tracking), a semi-automated framework for tracking arbitrary points in B-mode ultrasound videos. We combine deep learning with optical flow to deliver high-quality and robust tracking across diverse anatomical structures and motion patterns. The toolkit includes a graphical user interface that streamlines the generation of high-quality training data and supports iterative model refinement. It also implements a novel optical-flow-based filtering technique that reduces high-frequency frame-to-frame noise while preserving rapid tissue motion. DUSTrack demonstrates superior accuracy compared to contemporary zero-shot point trackers and performs on par with specialized methods, establishing its potential as a general and foundational tool for clinical and biomechanical research. We demonstrate DUSTrack's versatility through three use cases: cardiac wall motion tracking in echocardiograms, muscle deformation analysis during reaching tasks, and fascicle tracking during ankle plantarflexion. As an open-source solution, DUSTrack offers a powerful, flexible framework for point tracking to quantify tissue motion from ultrasound videos. DUSTrack is available at https://github.com/praneethnamburi/DUSTrack.
中文: DUSTrack是一种结合深度学习与光流的半自动化工具包,能精准追踪B超视频中的任意点位,有效克服散斑噪声等难题,在临床和生物力学应用中性能优于现有方法。
English: DUSTrack is a semi-automated toolkit combining deep learning and optical flow to accurately track arbitrary points in B-mode ultrasound videos, overcoming challenges like speckle noise and outperforming existing methods in clinical and biomechanical applications.

Authors:Qiyu Xu, Zhanxuan Hu, Yu Duan, Ercheng Pei, Yonghang Tai
Title: A Hidden Stumbling Block in Generalized Category Discovery: Distracted Attention
Abstract:
Generalized Category Discovery (GCD) aims to classify unlabeled data from both known and unknown categories by leveraging knowledge from labeled known categories. While existing methods have made notable progress, they often overlook a hidden stumbling block in GCD: distracted attention. Specifically, when processing unlabeled data, models tend to focus not only on key objects in the image but also on task-irrelevant background regions, leading to suboptimal feature extraction. To remove this stumbling block, we propose Attention Focusing (AF), an adaptive mechanism designed to sharpen the model's focus by pruning non-informative tokens. AF consists of two simple yet effective components: Token Importance Measurement (TIME) and Token Adaptive Pruning (TAP), working in a cascade. TIME quantifies token importance across multiple scales, while TAP prunes non-informative tokens by utilizing the multi-scale importance scores provided by TIME. AF is a lightweight, plug-and-play module that integrates seamlessly into existing GCD methods with minimal computational overhead. When incorporated into one prominent GCD method, SimGCD, AF achieves up to 15.4% performance improvement over the baseline with minimal computational overhead. The implementation code is provided in https://github.com/Afleve/AFGCD.
中文: 摘要提出注意力聚焦(AF)机制,通过剪除非信息性标记来增强广义类别发现,有效聚焦关键对象并减少背景干扰,以极低计算成本显著提升模型性能。
English: The abstract introduces Attention Focusing (AF), an adaptive mechanism that enhances Generalized Category Discovery by pruning non-informative tokens to sharpen model focus and reduce background distraction, achieving significant performance improvements with minimal computational cost.

Authors:Wan-Cyuan Fan, Yen-Chun Chen, Mengchen Liu, Alexander Jacobson, Lu Yuan, Leonid Sigal
Title: In-Depth and In-Breadth: Pre-training Multimodal Language Models Customized for Comprehensive Chart Understanding
Abstract:
Recent methods for customizing Large Vision Language Models (LVLMs) for domain-specific tasks have shown promising results in scientific chart comprehension. However, existing approaches face two major limitations: First, they rely on paired data from only a few chart types, limiting generalization to wide range of chart types. Secondly, they lack targeted pre-training for chart-data alignment, which hampers the model's understanding of underlying data. In this paper, we introduce ChartScope, an LVLM optimized for in-depth chart comprehension across diverse chart types. We propose an efficient data generation pipeline that synthesizes paired data for a wide range of chart types, along with a novel Dual-Path training strategy that enabling the model to succinctly capture essential data details while preserving robust reasoning capabilities by incorporating reasoning over the underlying data. Lastly, we establish ChartDQA, a new benchmark for evaluating not only question-answering at different levels but also underlying data understanding. Experimental results demonstrate that ChartScope significantly enhances comprehension on a wide range of chart types. The code and data are available at https://davidhalladay.github.io/chartscope_demo.

Authors:Boyuan Zheng, Zeyi Liao, Scott Salisbury, Zeyuan Liu, Michael Lin, Qinyuan Zheng, Zifan Wang, Xiang Deng, Dawn Song, Huan Sun, Yu Su
Title: WebGuard: Building a Generalizable Guardrail for Web Agents
Abstract:
The rapid development of autonomous web agents powered by Large Language Models (LLMs), while greatly elevating efficiency, exposes the frontier risk of taking unintended or harmful actions. This situation underscores an urgent need for effective safety measures, akin to access controls for human users. To address this critical challenge, we introduce WebGuard, the first comprehensive dataset designed to support the assessment of web agent action risks and facilitate the development of guardrails for real-world online environments. In doing so, WebGuard specifically focuses on predicting the outcome of state-changing actions and contains 4,939 human-annotated actions from 193 websites across 22 diverse domains, including often-overlooked long-tail websites. These actions are categorized using a novel three-tier risk schema: SAFE, LOW, and HIGH. The dataset includes designated training and test splits to support evaluation under diverse generalization settings. Our initial evaluations reveal a concerning deficiency: even frontier LLMs achieve less than 60% accuracy in predicting action outcomes and less than 60% recall in lagging HIGH-risk actions, highlighting the risks of deploying current-generation agents without dedicated safeguards. We therefore investigate fine-tuning specialized guardrail models using WebGuard. We conduct comprehensive evaluations across multiple generalization settings and find that a fine-tuned Qwen2.5VL-7B model yields a substantial improvement in performance, boosting accuracy from 37% to 80% and HIGH-risk action recall from 20% to 76%. Despite these improvements, the performance still falls short of the reliability required for high-stakes deployment, where guardrails must approach near-perfect accuracy and recall.
中文: 大型语言模型驱动的自主网络代理快速发展带来了意外有害行为的高风险,为此开发的WebGuard数据集通过评估行动风险和训练防护模型,揭示了现有模型的安全缺陷,尽管微调后性能显著提升,但仍需近乎完美的保障措施才能满足高风险部署要求。
English: The rapid advancement of LLM-powered autonomous web agents introduces significant risks of unintended harmful actions, prompting the development of WebGuard, a comprehensive dataset for assessing action risks and training guardrail models, which reveals current models' safety deficiencies and the need for near-perfect safeguards despite performance improvements from fine-tuning.

Authors:Shashanka Venkataramanan, Valentinos Pariza, Mohammadreza Salehi, Lukas Knobel, Spyros Gidaris, Elias Ramzi, Andrei Bursuc, Yuki M. Asano
Title: Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning
Abstract:
We present Franca (pronounced Fran-ka): free one; the first fully open-source (data, code, weights) vision foundation model that matches and in many cases surpasses the performance of state-of-the-art proprietary models, e.g., DINOv2, CLIP, SigLIPv2, etc. Our approach is grounded in a transparent training pipeline inspired by Web-SSL and uses publicly available data: ImageNet-21K and a subset of ReLAION-2B. Beyond model release, we tackle critical limitations in SSL clustering methods. While modern models rely on assigning image features to large codebooks via clustering algorithms like Sinkhorn-Knopp, they fail to account for the inherent ambiguity in clustering semantics. To address this, we introduce a parameter-efficient, multi-head clustering projector based on nested Matryoshka representations. This design progressively refines features into increasingly fine-grained clusters without increasing the model size, enabling both performance and memory efficiency. Additionally, we propose a novel positional disentanglement strategy that explicitly removes positional biases from dense representations, thereby improving the encoding of semantic content. This leads to consistent gains on several downstream benchmarks, demonstrating the utility of cleaner feature spaces. Our contributions establish a new standard for transparent, high-performance vision models and open a path toward more reproducible and generalizable foundation models for the broader AI community. The code and model checkpoints are available at https://github.com/valeoai/Franca.
中文摘要:Franca是首个完全开源的视觉基础模型,通过透明训练流程及多头部聚类、位置解耦等创新技术,在性能上超越主流专有模型,为高性能、可复现的人工智能设立了新标杆。
English Summary: Franca is the first fully open-source vision foundation model that outperforms leading proprietary models through a transparent training pipeline and innovative techniques like multi-head clustering and positional disentanglement, setting a new standard for high-performance, reproducible AI.

Authors:Shravan Venkatraman, Pavan Kumar S, Rakesh Raj Madavan, Chandrakala S
Title: UGPL: Uncertainty-Guided Progressive Learning for Evidence-Based Classification in Computed Tomography
Abstract:
Accurate classification of computed tomography (CT) images is essential for diagnosis and treatment planning, but existing methods often struggle with the subtle and spatially diverse nature of pathological features. Current approaches typically process images uniformly, limiting their ability to detect localized abnormalities that require focused analysis. We introduce UGPL, an uncertainty-guided progressive learning framework that performs a global-to-local analysis by first identifying regions of diagnostic ambiguity and then conducting detailed examination of these critical areas. Our approach employs evidential deep learning to quantify predictive uncertainty, guiding the extraction of informative patches through a non-maximum suppression mechanism that maintains spatial diversity. This progressive refinement strategy, combined with an adaptive fusion mechanism, enables UGPL to integrate both contextual information and fine-grained details. Experiments across three CT datasets demonstrate that UGPL consistently outperforms state-of-the-art methods, achieving improvements of 3.29%, 2.46%, and 8.08% in accuracy for kidney abnormality, lung cancer, and COVID-19 detection, respectively. Our analysis shows that the uncertainty-guided component provides substantial benefits, with performance dramatically increasing when the full progressive learning pipeline is implemented. Our code is available at: https://github.com/shravan-18/UGPL
Chinese: UGPL提出了一种不确定性引导的渐进学习框架,通过识别诊断模糊区域并进行精细局部分析来提升CT图像分类效果,在多个医疗数据集上实现了显著准确率提升。
English: UGPL introduces an uncertainty-guided progressive learning framework that enhances CT image classification by identifying ambiguous regions and conducting detailed local analysis, achieving significant accuracy improvements across multiple medical datasets.

Authors:Haoran Li, Yuli Tian, Kun Lan, Yong Liao, Lin Wang, Pan Hui, Peng Yuan Zhou
Title: DreamScene: 3D Gaussian-based End-to-end Text-to-3D Scene Generation
Abstract:
Generating 3D scenes from natural language holds great promise for applications in gaming, film, and design. However, existing methods struggle with automation, 3D consistency, and fine-grained control. We present DreamScene, an end-to-end framework for high-quality and editable 3D scene generation from text or dialogue. DreamScene begins with a scene planning module, where a GPT-4 agent infers object semantics and spatial constraints to construct a hybrid graph. A graph-based placement algorithm then produces a structured, collision-free layout. Based on this layout, Formation Pattern Sampling (FPS) generates object geometry using multi-timestep sampling and reconstructive optimization, enabling fast and realistic synthesis. To ensure global consistent, DreamScene employs a progressive camera sampling strategy tailored to both indoor and outdoor settings. Finally, the system supports fine-grained scene editing, including object movement, appearance changes, and 4D dynamic motion. Experiments demonstrate that DreamScene surpasses prior methods in quality, consistency, and flexibility, offering a practical solution for open-domain 3D content creation. Code and demos are available at https://jahnsonblack.github.io/DreamScene-Full/.

Authors:Pablo Marcos-Manchón, Lluís Fuentemilla
Title: Convergent transformations of visual representation in brains and models
Abstract:
A fundamental question in cognitive neuroscience is what shapes visual perception: the external world's structure or the brain's internal architecture. Although some perceptual variability can be traced to individual differences, brain responses to naturalistic stimuli evoke similar activity patterns across individuals, suggesting a convergent representational principle. Here, we test if this stimulus-driven convergence follows a common trajectory across people and deep neural networks (DNNs) during its transformation from sensory to high-level internal representations. We introduce a unified framework that traces representational flow by combining inter-subject similarity with alignment to model hierarchies. Applying this framework to three independent fMRI datasets of visual scene perception, we reveal a cortex-wide network, conserved across individuals, organized into two pathways: a medial-ventral stream for scene structure and a lateral-dorsal stream tuned for social and biological content. This functional organization is captured by the hierarchies of vision DNNs but not language models, reinforcing the specificity of the visual-to-semantic transformation. These findings show a convergent computational solution for visual encoding in both human and artificial vision, driven by the structure of the external world.
中文摘要:该研究揭示了人类大脑中存在一个保守的全皮层网络,分为处理场景结构的腹内侧通路和感知社会内容的背外侧通路,这一功能结构与深度神经网络的层级处理相吻合,表明外部世界结构驱动了视觉编码的趋同计算方案。
English Summary: This study reveals a conserved cortex-wide network in humans, organized into two visual pathways for scene structure and social content, which aligns with the hierarchical processing in deep neural networks, demonstrating a convergent computational solution for visual encoding driven by external world structure.

Authors:Lei Xu, Torkel B Brismar
Title: Software architecture and manual for novel versatile CT image analysis toolbox -- AnatomyArchive
Abstract:
We have developed a novel CT image analysis package named AnatomyArchive, built on top of the recent full body segmentation model TotalSegmentator. It provides automatic target volume selection and deselection capabilities according to user-configured anatomies for volumetric upper- and lower-bounds. It has a knowledge graph-based and time efficient tool for anatomy segmentation mask management and medical image database maintenance. AnatomyArchive enables automatic body volume cropping, as well as automatic arm-detection and exclusion, for more precise body composition analysis in both 2D and 3D formats. It provides robust voxel-based radiomic feature extraction, feature visualization, and an integrated toolchain for statistical tests and analysis. A python-based GPU-accelerated nearly photo-realistic segmentation-integrated composite cinematic rendering is also included. We present here its software architecture design, illustrate its workflow and working principle of algorithms as well provide a few examples on how the software can be used to assist development of modern machine learning models. Open-source codes will be released at https://github.com/lxu-medai/AnatomyArchive for only research and educational purposes.
中文摘要:AnatomyArchive是基于TotalSegmentator开发的新型CT图像分析工具包,提供自动体积选择、身体成分分析、影像组学特征提取和电影级渲染功能,旨在辅助现代机器学习模型的开发。
English Summary: AnatomyArchive is a novel CT image analysis package built on TotalSegmentator, offering automated volume selection, body composition analysis, radiomic feature extraction, and cinematic rendering to support modern machine learning model development.

Authors:Huu-Phu Do, Yu-Wei Chen, Yi-Cheng Liao, Chi-Wei Hsiao, Han-Yang Wang, Wei-Chen Chiu, Ching-Chun Huang
Title: DynFaceRestore: Balancing Fidelity and Quality in Diffusion-Guided Blind Face Restoration with Dynamic Blur-Level Mapping and Guidance
Abstract:
Blind Face Restoration aims to recover high-fidelity, detail-rich facial images from unknown degraded inputs, presenting significant challenges in preserving both identity and detail. Pre-trained diffusion models have been increasingly used as image priors to generate fine details. Still, existing methods often use fixed diffusion sampling timesteps and a global guidance scale, assuming uniform degradation. This limitation and potentially imperfect degradation kernel estimation frequently lead to under- or over-diffusion, resulting in an imbalance between fidelity and quality. We propose DynFaceRestore, a novel blind face restoration approach that learns to map any blindly degraded input to Gaussian blurry images. By leveraging these blurry images and their respective Gaussian kernels, we dynamically select the starting timesteps for each blurry image and apply closed-form guidance during the diffusion sampling process to maintain fidelity. Additionally, we introduce a dynamic guidance scaling adjuster that modulates the guidance strength across local regions, enhancing detail generation in complex areas while preserving structural fidelity in contours. This strategy effectively balances the trade-off between fidelity and quality. DynFaceRestore achieves state-of-the-art performance in both quantitative and qualitative evaluations, demonstrating robustness and effectiveness in blind face restoration. Project page at https://nycu-acm.github.io/DynFaceRestore/

Authors:Tongtong Su, Chengyu Wang, Bingyan Liu, Jun Huang, Dongming Lu
Title: Encapsulated Composition of Text-to-Image and Text-to-Video Models for High-Quality Video Synthesis
Abstract:
In recent years, large text-to-video (T2V) synthesis models have garnered considerable attention for their abilities to generate videos from textual descriptions. However, achieving both high imaging quality and effective motion representation remains a significant challenge for these T2V models. Existing approaches often adapt pre-trained text-to-image (T2I) models to refine video frames, leading to issues such as flickering and artifacts due to inconsistencies across frames. In this paper, we introduce EVS, a training-free Encapsulated Video Synthesizer that composes T2I and T2V models to enhance both visual fidelity and motion smoothness of generated videos. Our approach utilizes a well-trained diffusion-based T2I model to refine low-quality video frames by treating them as out-of-distribution samples, effectively optimizing them with noising and denoising steps. Meanwhile, we employ T2V backbones to ensure consistent motion dynamics. By encapsulating the T2V temporal-only prior into the T2I generation process, EVS successfully leverages the strengths of both types of models, resulting in videos of improved imaging and motion quality. Experimental results validate the effectiveness of our approach compared to previous approaches. Our composition process also leads to a significant improvement of 1.6x-4.5x speedup in inference time. Source codes: https://github.com/Tonniia/EVS.
中文: 本文提出EVS,一种无需训练的方法,通过结合文本到图像和文本到视频模型来提升视频画质和运动流畅度,同时大幅加快推理速度。
English: The paper introduces EVS, a training-free method that combines text-to-image and text-to-video models to enhance video quality and motion smoothness while significantly speeding up inference.

Authors:Xiao Wang, Qian Zhu, Shujuan Wu, Bo Jiang, Shiliang Zhang, Yaowei Wang, Yonghong Tian, Bin Luo
Title: When Person Re-Identification Meets Event Camera: A Benchmark Dataset and An Attribute-guided Re-Identification Framework
Abstract:
Recent researchers have proposed using event cameras for person re-identification (ReID) due to their promising performance and better balance in terms of privacy protection, event camera-based person ReID has attracted significant attention. Currently, mainstream event-based person ReID algorithms primarily focus on fusing visible light and event stream, as well as preserving privacy. Although significant progress has been made, these methods are typically trained and evaluated on small-scale or simulated event camera datasets, making it difficult to assess their real identification performance and generalization ability. To address the issue of data scarcity, this paper introduces a large-scale RGB-event based person ReID dataset, called EvReID. The dataset contains 118,988 image pairs and covers 1200 pedestrian identities, with data collected across multiple seasons, scenes, and lighting conditions. We also evaluate 15 state-of-the-art person ReID algorithms, laying a solid foundation for future research in terms of both data and benchmarking. Based on our newly constructed dataset, this paper further proposes a pedestrian attribute-guided contrastive learning framework to enhance feature learning for person re-identification, termed TriPro-ReID. This framework not only effectively explores the visual features from both RGB frames and event streams, but also fully utilizes pedestrian attributes as mid-level semantic features. Extensive experiments on the EvReID dataset and MARS datasets fully validated the effectiveness of our proposed RGB-Event person ReID framework. The benchmark dataset and source code will be released on https://github.com/Event-AHU/Neuromorphic_ReID
中文摘要:本文针对事件相机行人重识别数据稀缺问题,提出了大规模RGB-事件数据集EvReID,并开发了TriPro-ReID对比学习框架,有效融合RGB帧、事件流和行人属性以提升特征学习能力。
English Summary: This paper introduces EvReID, a large-scale RGB-event person re-identification dataset addressing data scarcity issues, and proposes TriPro-ReID, a contrastive learning framework that effectively integrates RGB frames, event streams, and pedestrian attributes to enhance feature learning.

Authors:Seungjun Moon, Sangjoon Yu, Gyeong-Moon Park
Title: EPSilon: Efficient Point Sampling for Lightening of Hybrid-based 3D Avatar Generation
Abstract:
The rapid advancement of neural radiance fields (NeRF) has paved the way to generate animatable human avatars from a monocular video. However, the sole usage of NeRF suffers from a lack of details, which results in the emergence of hybrid representation that utilizes SMPL-based mesh together with NeRF representation. While hybrid-based models show photo-realistic human avatar generation qualities, they suffer from extremely slow inference due to their deformation scheme: to be aligned with the mesh, hybrid-based models use the deformation based on SMPL skinning weights, which needs high computational costs on each sampled point. We observe that since most of the sampled points are located in empty space, they do not affect the generation quality but result in inference latency with deformation. In light of this observation, we propose EPSilon, a hybrid-based 3D avatar generation scheme with novel efficient point sampling strategies that boost both training and inference. In EPSilon, we propose two methods to omit empty points at rendering; empty ray omission (ERO) and empty interval omission (EIO). In ERO, we wipe out rays that progress through the empty space. Then, EIO narrows down the sampling interval on the ray, which wipes out the region not occupied by either clothes or mesh. The delicate sampling scheme of EPSilon enables not only great computational cost reduction during deformation but also the designation of the important regions to be sampled, which enables a single-stage NeRF structure without hierarchical sampling. Compared to existing methods, EPSilon maintains the generation quality while using only 3.9% of sampled points and achieves around 20 times faster inference, together with 4 times faster training convergence. We provide video results on https://github.com/seungjun-moon/epsilon.
Chinese Summary: EPSilon通过创新的高效点采样策略ERO和EIO,在混合三维化身生成中剔除空域采样,仅用3.9%采样点就实现20倍推理加速和4倍训练收敛提速,同时保持生成质量。
English Summary: EPSilon introduces efficient point sampling strategies, ERO and EIO, to eliminate empty space sampling in hybrid 3D avatar generation, achieving 20x faster inference and 4x quicker training while maintaining quality with only 3.9% of sampled points.

Authors:Dmitrii Mikhailov, Aleksey Letunovskiy, Maria Kovaleva, Vladimir Arkhipkin, Vladimir Korviakov, Vladimir Polovnikov, Viacheslav Vasilev, Evelina Sidorova, Denis Dimitrov
Title: $\nabla$NABLA: Neighborhood Adaptive Block-Level Attention
Abstract:
Recent progress in transformer-based architectures has demonstrated remarkable success in video generation tasks. However, the quadratic complexity of full attention mechanisms remains a critical bottleneck, particularly for high-resolution and long-duration video sequences. In this paper, we propose NABLA, a novel Neighborhood Adaptive Block-Level Attention mechanism that dynamically adapts to sparsity patterns in video diffusion transformers (DiTs). By leveraging block-wise attention with adaptive sparsity-driven threshold, NABLA reduces computational overhead while preserving generative quality. Our method does not require custom low-level operator design and can be seamlessly integrated with PyTorch's Flex Attention operator. Experiments demonstrate that NABLA achieves up to 2.7x faster training and inference compared to baseline almost without compromising quantitative metrics (CLIP score, VBench score, human evaluation score) and visual quality drop. The code and model weights are available here: https://github.com/gen-ai-team/Wan2.1-NABLA
中文: 本文提出NABLA,一种邻域自适应块级注意力机制,在视频扩散变换器中通过自适应稀疏阈值降低计算复杂度,同时保持生成质量,实现高达2.7倍的训练和推理加速且性能损失极小。
English: The paper introduces NABLA, a neighborhood adaptive block-level attention mechanism that reduces computational complexity in video diffusion transformers while maintaining generative quality, achieving up to 2.7x faster training and inference with minimal performance loss.

Authors:Seyyed Saeid Cheshmi, Buyao Lyu, Thomas Lisko, Rajesh Rajamani, Robert A. McGovern, Yogatheesan Varatharajah
Title: Improving Out-of-distribution Human Activity Recognition via IMU-Video Cross-modal Representation Learning
Abstract:
Human Activity Recognition (HAR) based on wearable inertial sensors plays a critical role in remote health monitoring. In patients with movement disorders, the ability to detect abnormal patient movements in their home environments can enable continuous optimization of treatments and help alert caretakers as needed. Machine learning approaches have been proposed for HAR tasks using Inertial Measurement Unit (IMU) data; however, most rely on application-specific labels and lack generalizability to data collected in different environments or populations. To address this limitation, we propose a new cross-modal self-supervised pretraining approach to learn representations from large-sale unlabeled IMU-video data and demonstrate improved generalizability in HAR tasks on out of distribution (OOD) IMU datasets, including a dataset collected from patients with Parkinson's disease. Specifically, our results indicate that the proposed cross-modal pretraining approach outperforms the current state-of-the-art IMU-video pretraining approach and IMU-only pretraining under zero-shot and few-shot evaluations. Broadly, our study provides evidence that in highly dynamic data modalities, such as IMU signals, cross-modal pretraining may be a useful tool to learn generalizable data representations. Our software is available at https://github.com/scheshmi/IMU-Video-OOD-HAR.
中文: 本研究提出了一种利用未标记IMU-视频数据进行跨模态自监督预训练的方法,旨在提高人体活动识别的泛化能力,特别是在帕金森病患者等分布外数据集上,该方法的零样本和少样本评估表现优于当前最先进技术。
English: This study introduces a cross-modal self-supervised pretraining method using unlabeled IMU-video data to improve generalizability in human activity recognition, particularly for out-of-distribution datasets like Parkinson's disease patients, outperforming current state-of-the-art approaches in zero-shot and few-shot evaluations.

Authors:Chihiro Noguchi, Takaki Yamamoto
Title: From Binary to Semantic: Utilizing Large-Scale Binary Occupancy Data for 3D Semantic Occupancy Prediction
Abstract:
Accurate perception of the surrounding environment is essential for safe autonomous driving. 3D occupancy prediction, which estimates detailed 3D structures of roads, buildings, and other objects, is particularly important for vision-centric autonomous driving systems that do not rely on LiDAR sensors. However, in 3D semantic occupancy prediction -- where each voxel is assigned a semantic label -- annotated LiDAR point clouds are required, making data acquisition costly. In contrast, large-scale binary occupancy data, which only indicate occupied or free space without semantic labels, can be collected at a lower cost. Despite their availability, the potential of leveraging such data remains unexplored. In this study, we investigate the utilization of large-scale binary occupancy data from two perspectives: (1) pre-training and (2) learning-based auto-labeling. We propose a novel binary occupancy-based framework that decomposes the prediction process into binary and semantic occupancy modules, enabling effective use of binary occupancy data. Our experimental results demonstrate that the proposed framework outperforms existing methods in both pre-training and auto-labeling tasks, highlighting its effectiveness in enhancing 3D semantic occupancy prediction. The code is available at https://github.com/ToyotaInfoTech/b2s-occupancy
中文: 本研究提出了一种新框架,通过预训练和自动标注利用低成本二元占据数据来提升3D语义占据预测性能,在视觉为中心的自动驾驶系统中优于现有方法。
English: This study introduces a novel framework that leverages low-cost binary occupancy data through pre-training and auto-labeling to enhance 3D semantic occupancy prediction, outperforming existing methods in vision-centric autonomous driving systems.

Authors:Xiaojian Lin, Wenxin Zhang, Yuchu Jiang, Wangyu Wu, Yiran Guo, Kangxu Wang, Zongzheng Zhang, Guijin Wang, Lei Jin, Hao Zhao
Title: Butter: Frequency Consistency and Hierarchical Fusion for Autonomous Driving Object Detection
Abstract:
Hierarchical feature representations play a pivotal role in computer vision, particularly in object detection for autonomous driving. Multi-level semantic understanding is crucial for accurately identifying pedestrians, vehicles, and traffic signs in dynamic environments. However, existing architectures, such as YOLO and DETR, struggle to maintain feature consistency across different scales while balancing detection precision and computational efficiency. To address these challenges, we propose Butter, a novel object detection framework designed to enhance hierarchical feature representations for improving detection robustness. Specifically, Butter introduces two key innovations: Frequency-Adaptive Feature Consistency Enhancement (FAFCE) Component, which refines multi-scale feature consistency by leveraging adaptive frequency filtering to enhance structural and boundary precision, and Progressive Hierarchical Feature Fusion Network (PHFFNet) Module, which progressively integrates multi-level features to mitigate semantic gaps and strengthen hierarchical feature learning. Through extensive experiments on BDD100K, KITTI, and Cityscapes, Butter demonstrates superior feature representation capabilities, leading to notable improvements in detection accuracy while reducing model complexity. By focusing on hierarchical feature refinement and integration, Butter provides an advanced approach to object detection that achieves a balance between accuracy, deployability, and computational efficiency in real-time autonomous driving scenarios. Our model and implementation are publicly available at https://github.com/Aveiro-Lin/Butter, facilitating further research and validation within the autonomous driving community.
中文: Butter框架通过频率自适应特征一致性增强和渐进式分层特征融合网络,提升了自动驾驶目标检测中的分层特征表示能力,在多个基准测试中实现了更高的精度与效率。
English: The Butter framework enhances hierarchical feature representations in object detection for autonomous driving through its Frequency-Adaptive Feature Consistency Enhancement and Progressive Hierarchical Feature Fusion Network, achieving improved accuracy and efficiency across multiple benchmarks.

Authors:Atharv Goel, Mehar Khurana
Title: Just Add Geometry: Gradient-Free Open-Vocabulary 3D Detection Without Human-in-the-Loop
Abstract:
Modern 3D object detection datasets are constrained by narrow class taxonomies and costly manual annotations, limiting their ability to scale to open-world settings. In contrast, 2D vision-language models trained on web-scale image-text pairs exhibit rich semantic understanding and support open-vocabulary detection via natural language prompts. In this work, we leverage the maturity and category diversity of 2D foundation models to perform open-vocabulary 3D object detection without any human-annotated 3D labels. Our pipeline uses a 2D vision-language detector to generate text-conditioned proposals, which are segmented with SAM and back-projected into 3D using camera geometry and either LiDAR or monocular pseudo-depth. We introduce a geometric inflation strategy based on DBSCAN clustering and Rotating Calipers to infer 3D bounding boxes without training. To simulate adverse real-world conditions, we construct Pseudo-nuScenes, a fog-augmented, RGB-only variant of the nuScenes dataset. Experiments demonstrate that our method achieves competitive localization performance across multiple settings, including LiDAR-based and purely RGB-D inputs, all while remaining training-free and open-vocabulary. Our results highlight the untapped potential of 2D foundation models for scalable 3D perception. We open-source our code and resources at https://github.com/atharv0goel/open-world-3D-det.
中文摘要:本研究提出一种无需训练的开词汇3D物体检测方法,通过利用2D视觉语言模型生成提案并结合几何策略推断3D边界框,在无人标注3D标签的情况下实现了具有竞争力的检测性能。
English Summary: This work introduces a training-free method for open-vocabulary 3D object detection by leveraging 2D vision-language models to generate proposals and geometric strategies to infer 3D bounding boxes, achieving competitive performance without human-annotated 3D labels.

Authors:Binbin Ji, Siddharth Agrawal, Qiance Tang, Yvonne Wu
Title: Enhancing Spatial Reasoning in Vision-Language Models via Chain-of-Thought Prompting and Reinforcement Learning
Abstract:
This study investigates the spatial reasoning capabilities of vision-language models (VLMs) through Chain-of-Thought (CoT) prompting and reinforcement learning. We begin by evaluating the impact of different prompting strategies and find that simple CoT formats, where the model generates a reasoning step before the answer, not only fail to help, but can even harm the model's original performance. In contrast, structured multi-stage prompting based on scene graphs (SceneGraph CoT) significantly improves spatial reasoning accuracy. Furthermore, to improve spatial reasoning ability, we fine-tune models using Group Relative Policy Optimization (GRPO) on the SAT dataset and evaluate their performance on CVBench. Compared to supervised fine-tuning (SFT), GRPO achieves higher accuracy on Pass@1 evaluations and demonstrates superior robustness under out-of-distribution (OOD) conditions. In particular, we find that SFT overfits to surface-level linguistic patterns and may degrade performance when test-time phrasing changes (e.g., from "closer to" to "farther from"). GRPO, on the other hand, generalizes more reliably and maintains stable performance under such shifts. Our findings provide insights into how reinforcement learning and structured prompting improve the spatial reasoning capabilities and generalization behavior of modern VLMs. All code is open source at: https://github.com/Yvonne511/spatial-vlm-investigator
中文摘要:本研究通过场景图结构化提示和GRPO强化学习方法,显著提升了视觉语言模型的空间推理准确性和泛化能力,优于简单思维链方法和监督微调。
English Summary: This research demonstrates that structured scene graph-based prompting and reinforcement learning with GRPO significantly enhance vision-language models' spatial reasoning accuracy and generalization, outperforming simple Chain-of-Thought methods and supervised fine-tuning.

Authors:Le-Anh Tran, Chung Nguyen Tran, Ngoc-Luu Nguyen, Nhan Cach Dang, Jordi Carrabina, David Castells-Rufas, Minh Son Nguyen
Title: Low-Light Enhancement via Encoder-Decoder Network with Illumination Guidance
Abstract:
This paper introduces a novel deep learning framework for low-light image enhancement, named the Encoder-Decoder Network with Illumination Guidance (EDNIG). Building upon the U-Net architecture, EDNIG integrates an illumination map, derived from Bright Channel Prior (BCP), as a guidance input. This illumination guidance helps the network focus on underexposed regions, effectively steering the enhancement process. To further improve the model's representational power, a Spatial Pyramid Pooling (SPP) module is incorporated to extract multi-scale contextual features, enabling better handling of diverse lighting conditions. Additionally, the Swish activation function is employed to ensure smoother gradient propagation during training. EDNIG is optimized within a Generative Adversarial Network (GAN) framework using a composite loss function that combines adversarial loss, pixel-wise mean squared error (MSE), and perceptual loss. Experimental results show that EDNIG achieves competitive performance compared to state-of-the-art methods in quantitative metrics and visual quality, while maintaining lower model complexity, demonstrating its suitability for real-world applications. The source code for this work is available at https://github.com/tranleanh/ednig.
中文摘要:本文提出EDNIG深度学习框架,通过光照引导和多尺度特征提取在GAN框架中增强低光照图像,以更低复杂度实现了业界领先的增强效果。
English Summary: This paper presents EDNIG, a deep learning framework that enhances low-light images using illumination guidance and multi-scale feature extraction within a GAN framework, achieving state-of-the-art performance with reduced complexity.

Authors:Yang Zhou, Junjie Li, CongYang Ou, Dawei Yan, Haokui Zhang, Xizhe Xue
Title: Open-Vocabulary Object Detection in UAV Imagery: A Review and Future Perspectives
Abstract:
Due to its extensive applications, aerial image object detection has long been a hot topic in computer vision. In recent years, advancements in Unmanned Aerial Vehicles (UAV) technology have further propelled this field to new heights, giving rise to a broader range of application requirements. However, traditional UAV aerial object detection methods primarily focus on detecting predefined categories, which significantly limits their applicability. The advent of cross-modal text-image alignment (e.g., CLIP) has overcome this limitation, enabling open-vocabulary object detection (OVOD), which can identify previously unseen objects through natural language descriptions. This breakthrough significantly enhances the intelligence and autonomy of UAVs in aerial scene understanding. This paper presents a comprehensive survey of OVOD in the context of UAV aerial scenes. We begin by aligning the core principles of OVOD with the unique characteristics of UAV vision, setting the stage for a specialized discussion. Building on this foundation, we construct a systematic taxonomy that categorizes existing OVOD methods for aerial imagery and provides a comprehensive overview of the relevant datasets. This structured review enables us to critically dissect the key challenges and open problems at the intersection of these fields. Finally, based on this analysis, we outline promising future research directions and application prospects. This survey aims to provide a clear road map and a valuable reference for both newcomers and seasoned researchers, fostering innovation in this rapidly evolving domain. We keep tracing related works at https://github.com/zhouyang2002/OVOD-in-UVA-imagery
中文摘要:本文系统综述了无人机航拍图像中的开放词汇目标检测技术,通过分析现有方法、数据集及核心挑战,为通过自然语言描述实现自主场景理解提供研究路线图。
English Summary: This paper surveys open-vocabulary object detection in UAV aerial imagery, analyzing methods, datasets, and challenges to enhance autonomous scene understanding through natural language descriptions.

Authors:Yichi Zhang, Yici Yan, Alex Schwing, Zhizhen Zhao
Title: Hierarchical Rectified Flow Matching with Mini-Batch Couplings
Abstract:
Flow matching has emerged as a compelling generative modeling approach that is widely used across domains. To generate data via a flow matching model, an ordinary differential equation (ODE) is numerically solved via forward integration of the modeled velocity field. To better capture the multi-modality that is inherent in typical velocity fields, hierarchical flow matching was recently introduced. It uses a hierarchy of ODEs that are numerically integrated when generating data. This hierarchy of ODEs captures the multi-modal velocity distribution just like vanilla flow matching is capable of modeling a multi-modal data distribution. While this hierarchy enables to model multi-modal velocity distributions, the complexity of the modeled distribution remains identical across levels of the hierarchy. In this paper, we study how to gradually adjust the complexity of the distributions across different levels of the hierarchy via mini-batch couplings. We show the benefits of mini-batch couplings in hierarchical rectified flow matching via compelling results on synthetic and imaging data. Code is available at https://riccizz.github.io/HRF_coupling.

Authors:Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia
Title: VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
Abstract:
Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token compression, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Compared to existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case. As a result, it demonstrates strong fine-grained visual understanding capability on OCR-related tasks, and meanwhile saves substantial visual tokens on simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judge strategy to successfully apply RL to general VQA tasks. Moreover, we carefully design a reward function and penalty mechanism to achieve a stable and reasonable image resize call ratio. Extensive experiments demonstrate the superiority, efficiency, and effectiveness of our method. Our code is available at https://github.com/dvlab-research/VisionThink.
中文: VisionThink提出了一种动态视觉令牌压缩方法,通过自适应调整图像分辨率,在保证多数视觉问答任务性能的同时显著提升效率,并强化了OCR任务的细粒度识别能力。
English: VisionThink introduces a dynamic visual token compression method that adaptively processes images at different resolutions, enhancing efficiency while maintaining strong performance across most VQA tasks and improving fine-grained OCR capabilities.

Authors:Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, Tong He
Title: $π^3$: Permutation-Equivariant Visual Geometry Learning
Abstract:
We introduce $π^3$, a feed-forward neural network that offers a novel approach to visual geometry reconstruction, breaking the reliance on a conventional fixed reference view. Previous methods often anchor their reconstructions to a designated viewpoint, an inductive bias that can lead to instability and failures if the reference is suboptimal. In contrast, $π^3$ employs a fully permutation-equivariant architecture to predict affine-invariant camera poses and scale-invariant local point maps without any reference frames. This design not only makes our model inherently robust to input ordering, but also leads to higher accuracy and performance. These advantages enable our simple and bias-free approach to achieve state-of-the-art performance on a wide range of tasks, including camera pose estimation, monocular/video depth estimation, and dense point map reconstruction. Code and models are publicly available.

Authors:Minghao Chen, Jianyuan Wang, Roman Shapovalov, Tom Monnier, Hyunyoung Jung, Dilin Wang, Rakesh Ranjan, Iro Laina, Andrea Vedaldi
Title: AutoPartGen: Autogressive 3D Part Generation and Discovery
Abstract:
We introduce AutoPartGen, a model that generates objects composed of 3D parts in an autoregressive manner. This model can take as input an image of an object, 2D masks of the object's parts, or an existing 3D object, and generate a corresponding compositional 3D reconstruction. Our approach builds upon 3DShape2VecSet, a recent latent 3D representation with powerful geometric expressiveness. We observe that this latent space exhibits strong compositional properties, making it particularly well-suited for part-based generation tasks. Specifically, AutoPartGen generates object parts autoregressively, predicting one part at a time while conditioning on previously generated parts and additional inputs, such as 2D images, masks, or 3D objects. This process continues until the model decides that all parts have been generated, thus determining automatically the type and number of parts. The resulting parts can be seamlessly assembled into coherent objects or scenes without requiring additional optimization. We evaluate both the overall 3D generation capabilities and the part-level generation quality of AutoPartGen, demonstrating that it achieves state-of-the-art performance in 3D part generation.
AutoPartGen是一种自回归模型,能够根据图像、部件掩码或现有3D对象逐部件生成组合式三维重建,在部件级生成任务中达到最优性能。
AutoPartGen is an autoregressive model that generates 3D objects part by part from inputs like images, masks, or existing 3D shapes, achieving state-of-the-art performance in compositional reconstruction.

Authors:Dechen Gao, Boqi Zhao, Andrew Lee, Ian Chuang, Hanchu Zhou, Hang Wang, Zhe Zhao, Junshan Zhang, Iman Soltani
Title: VITA: Vision-to-Action Flow Matching Policy
Abstract:
We present VITA, a Vision-To-Action flow matching policy that evolves latent visual representations into latent actions for visuomotor control. Traditional flow matching and diffusion policies sample from standard source distributions (e.g., Gaussian noise) and require additional conditioning mechanisms like cross-attention to condition action generation on visual information, creating time and space overheads. VITA proposes a novel paradigm that treats latent images as the flow source, learning an inherent mapping from vision to action while eliminating separate conditioning modules and preserving generative modeling capabilities. Learning flows between fundamentally different modalities like vision and action is challenging due to sparse action data lacking semantic structures and dimensional mismatches between high-dimensional visual representations and raw actions. We address this by creating a structured action latent space via an autoencoder as the flow matching target, up-sampling raw actions to match visual representation shapes. Crucially, we supervise flow matching with both encoder targets and final action outputs through flow latent decoding, which backpropagates action reconstruction loss through sequential flow matching ODE solving steps for effective end-to-end learning. Implemented as simple MLP layers, VITA is evaluated on challenging bi-manual manipulation tasks on the ALOHA platform, including 5 simulation and 2 real-world tasks. Despite its simplicity, MLP-only VITA outperforms or matches state-of-the-art generative policies while reducing inference latency by 50-130% compared to conventional flow matching policies requiring different conditioning mechanisms or complex architectures. To our knowledge, VITA is the first MLP-only flow matching policy capable of solving complex bi-manual manipulation tasks like those in ALOHA benchmarks.
中文摘要:VITA是一种创新的视觉到动作策略框架,它通过流匹配直接将视觉表征映射到潜在动作,无需迭代去噪和条件机制,在实现卓越性能的同时显著提升了推理速度。
English Summary: VITA is a novel vision-to-action policy framework that eliminates iterative denoising and conditioning mechanisms by directly mapping visual representations to latent actions through flow matching, achieving superior performance with significantly faster inference speeds.

Authors:Arian Mousakhan, Sudhanshu Mittal, Silvio Galesso, Karim Farid, Thomas Brox
Title: Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models
Abstract:
Existing world models for autonomous driving struggle with long-horizon generation and generalization to challenging scenarios. In this work, we develop a model using simple design choices, and without additional supervision or sensors, such as maps, depth, or multiple cameras. We show that our model yields state-of-the-art performance, despite having only 469M parameters and being trained on 280h of video data. It particularly stands out in difficult scenarios like turning maneuvers and urban traffic. We test whether discrete token models possibly have advantages over continuous models based on flow matching. To this end, we set up a hybrid tokenizer that is compatible with both approaches and allows for a side-by-side comparison. Our study concludes in favor of the continuous autoregressive model, which is less brittle on individual design choices and more powerful than the model built on discrete tokens. Code, models and qualitative results are publicly available at https://lmb-freiburg.github.io/orbis.github.io/.

Authors:Alicia Durrer, Florentin Bieder, Paul Friedrich, Bjoern Menze, Philippe C. Cattin, Florian Kofler
Title: fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting
Abstract:
Healthy tissue inpainting has significant applications, including the generation of pseudo-healthy baselines for tumor growth models and the facilitation of image registration. In previous editions of the BraTS Local Synthesis of Healthy Brain Tissue via Inpainting Challenge, denoising diffusion probabilistic models (DDPMs) demonstrated qualitatively convincing results but suffered from low sampling speed. To mitigate this limitation, we adapted a 2D image generation approach, combining DDPMs with generative adversarial networks (GANs) and employing a variance-preserving noise schedule, for the task of 3D inpainting. Our experiments showed that the variance-preserving noise schedule and the selected reconstruction losses can be effectively utilized for high-quality 3D inpainting in a few time steps without requiring adversarial training. We applied our findings to a different architecture, a 3D wavelet diffusion model (WDM3D) that does not include a GAN component. The resulting model, denoted as fastWDM3D, obtained a SSIM of 0.8571, a MSE of 0.0079, and a PSNR of 22.26 on the BraTS inpainting test set. Remarkably, it achieved these scores using only two time steps, completing the 3D inpainting process in 1.81 s per image. When compared to other DDPMs used for healthy brain tissue inpainting, our model is up to 800 x faster while still achieving superior performance metrics. Our proposed method, fastWDM3D, represents a promising approach for fast and accurate healthy tissue inpainting. Our code is available at https://github.com/AliciaDurrer/fastWDM3D.
Chinese: 本研究提出的fastWDM3D模型采用三维小波扩散方法,仅需两个时间步就能完成高质量健康脑组织修复,在保持优异性能指标的同时,比先前方法提速高达800倍。
English: The study introduces fastWDM3D, a 3D wavelet diffusion model that achieves high-quality healthy brain tissue inpainting with superior performance metrics while being up to 800 times faster than previous methods by completing the process in just two time steps.

Authors:Xiaohan Guo, Yusong Cai, Zejia Liu, Zhengning Wang, Lili Pan, Hongliang Li
Title: R^2MoE: Redundancy-Removal Mixture of Experts for Lifelong Concept Learning
Abstract:
Enabling large-scale generative models to continuously learn new visual concepts is essential for personalizing pre-trained models to meet individual user preferences. Existing approaches for continual visual concept learning are constrained by two fundamental challenges: catastrophic forgetting and parameter expansion. In this paper, we propose Redundancy-Removal Mixture of Experts (R^2MoE), a parameter-efficient framework for lifelong visual concept learning that effectively learns new concepts while incurring minimal parameter overhead. Our framework includes three key innovative contributions: First, we propose a mixture-of-experts framework with a routing distillation mechanism that enables experts to acquire concept-specific knowledge while preserving the gating network's routing capability, thereby effectively mitigating catastrophic forgetting. Second, we propose a strategy for eliminating redundant layer-wise experts that reduces the number of expert parameters by fully utilizing previously learned experts. Third, we employ a hierarchical local attention-guided inference approach to mitigate interference between generated visual concepts. Extensive experiments have demonstrated that our method generates images with superior conceptual fidelity compared to the state-of-the-art (SOTA) method, achieving an impressive 87.8\% reduction in forgetting rates and 63.3\% fewer parameters on the CustomConcept 101 dataset. Our code is available at {https://github.com/learninginvision/R2MoE}
中文: R^2MoE框架通过路由蒸馏、冗余专家消除和分层局部注意力机制,以最小参数量实现终身视觉概念学习,在显著降低遗忘率和参数量的同时达到最优性能。
English: The proposed R^2MoE framework enables lifelong visual concept learning with minimal parameter overhead by combining routing distillation, redundant expert elimination, and hierarchical local attention, achieving state-of-the-art performance with significantly reduced forgetting rates and parameters.

Authors:Han Zhang, Xiangde Luo, Yong Chen, Kang Li
Title: DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model
Abstract:
Annotation variability remains a substantial challenge in medical image segmentation, stemming from ambiguous imaging boundaries and diverse clinical expertise. Traditional deep learning methods producing single deterministic segmentation predictions often fail to capture these annotator biases. Although recent studies have explored multi-rater segmentation, existing methods typically focus on a single perspective -- either generating a probabilistic ``gold standard'' consensus or preserving expert-specific preferences -- thus struggling to provide a more omni view. In this study, we propose DiffOSeg, a two-stage diffusion-based framework, which aims to simultaneously achieve both consensus-driven (combining all experts' opinions) and preference-driven (reflecting experts' individual assessments) segmentation. Stage I establishes population consensus through a probabilistic consensus strategy, while Stage II captures expert-specific preference via adaptive prompts. Demonstrated on two public datasets (LIDC-IDRI and NPC-170), our model outperforms existing state-of-the-art methods across all evaluated metrics. Source code is available at https://github.com/string-ellipses/DiffOSeg .
中文摘要:DiffOSeg是一种基于扩散模型的双阶段框架,通过建立群体共识和捕捉专家特定偏好,同时实现共识驱动和偏好驱动的医学图像分割,在公开数据集上表现优于现有方法。
English Summary: DiffOSeg is a diffusion-based framework that simultaneously achieves both consensus-driven and preference-driven medical image segmentation by establishing population consensus and capturing expert-specific preferences, outperforming existing methods on public datasets.

Authors:Yi Xin, Le Zhuo, Qi Qin, Siqi Luo, Yuewen Cao, Bin Fu, Yangfan He, Hongsheng Li, Guangtao Zhai, Xiaohong Liu, Peng Gao
Title: Resurrect Mask AutoRegressive Modeling for Efficient and Scalable Image Generation
Abstract:
AutoRegressive (AR) models have made notable progress in image generation, with Masked AutoRegressive (MAR) models gaining attention for their efficient parallel decoding. However, MAR models have traditionally underperformed when compared to standard AR models. This study refines the MAR architecture to improve image generation quality. We begin by evaluating various image tokenizers to identify the most effective one. Subsequently, we introduce an improved Bidirectional LLaMA architecture by replacing causal attention with bidirectional attention and incorporating 2D RoPE, which together form our advanced model, MaskGIL. Scaled from 111M to 1.4B parameters, MaskGIL achieves a FID score of 3.71, matching state-of-the-art AR models in the ImageNet 256x256 benchmark, while requiring only 8 inference steps compared to the 256 steps of AR models. Furthermore, we develop a text-driven MaskGIL model with 775M parameters for generating images from text at various resolutions. Beyond image generation, MaskGIL extends to accelerate AR-based generation and enable real-time speech-to-image conversion. Our codes and models are available at https://github.com/synbol/MaskGIL.
中文摘要:本研究提出改进的掩码自回归模型MaskGIL,仅需8次推理步骤即可达到与自回归模型相当的图像生成质量,并扩展至文本到图像和语音到图像的实时转换应用。
English Summary: This study introduces MaskGIL, an enhanced Masked AutoRegressive model that achieves state-of-the-art image generation quality with only 8 inference steps while matching AR models' performance, and extends to text-to-image and speech-to-image applications.

Authors:Liuyi Wang, Xinyuan Xia, Hui Zhao, Hanqing Wang, Tai Wang, Yilun Chen, Chengju Liu, Qijun Chen, Jiangmiao Pang
Title: Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities
Abstract:
Recent Vision-and-Language Navigation (VLN) advancements are promising, but their idealized assumptions about robot movement and control fail to reflect physically embodied deployment challenges. To bridge this gap, we introduce VLN-PE, a physically realistic VLN platform supporting humanoid, quadruped, and wheeled robots. For the first time, we systematically evaluate several ego-centric VLN methods in physical robotic settings across different technical pipelines, including classification models for single-step discrete action prediction, a diffusion model for dense waypoint prediction, and a train-free, map-based large language model (LLM) integrated with path planning. Our results reveal significant performance degradation due to limited robot observation space, environmental lighting variations, and physical challenges like collisions and falls. This also exposes locomotion constraints for legged robots in complex environments. VLN-PE is highly extensible, allowing seamless integration of new scenes beyond MP3D, thereby enabling more comprehensive VLN evaluation. Despite the weak generalization of current models in physical deployment, VLN-PE provides a new pathway for improving cross-embodiment's overall adaptability. We hope our findings and tools inspire the community to rethink VLN limitations and advance robust, practical VLN models. The code is available at https://crystalsixone.github.io/vln_pe.github.io/.
中文: 针对当前视觉语言导航在物理部署中的理想化假设,VLN-PE平台首次系统评估了多类机器人导航方法,揭示了观测受限、光照变化和物理碰撞导致的性能下降问题,为提升跨载体适应性开辟了新途径。
English: Recent VLN advancements overlook physical deployment challenges, so VLN-PE introduces a realistic platform for humanoid, quadruped, and wheeled robots, revealing performance issues due to limited observations, lighting variations, and physical constraints while offering a pathway for improved adaptability.

Authors:Zihua Zhao, Feng Hong, Mengxi Chen, Pengyi Chen, Benyuan Liu, Jiangchao Yao, Ya Zhang, Yanfeng Wang
Title: Differential-informed Sample Selection Accelerates Multimodal Contrastive Learning
Abstract:
The remarkable success of contrastive-learning-based multimodal models has been greatly driven by training on ever-larger datasets with expensive compute consumption. Sample selection as an alternative efficient paradigm plays an important direction to accelerate the training process. However, recent advances on sample selection either mostly rely on an oracle model to offline select a high-quality coreset, which is limited in the cold-start scenarios, or focus on online selection based on real-time model predictions, which has not sufficiently or efficiently considered the noisy correspondence. To address this dilemma, we propose a novel Differential-Informed Sample Selection (DISSect) method, which accurately and efficiently discriminates the noisy correspondence for training acceleration. Specifically, we rethink the impact of noisy correspondence on contrastive learning and propose that the differential between the predicted correlation of the current model and that of a historical model is more informative to characterize sample quality. Based on this, we construct a robust differential-based sample selection and analyze its theoretical insights. Extensive experiments on three benchmark datasets and various downstream tasks demonstrate the consistent superiority of DISSect over current state-of-the-art methods. Source code is available at: https://github.com/MediaBrain-SJTU/DISSect.
中文: DISSect方法通过比较当前模型与历史模型的预测差异来精准识别噪声样本,在多个基准测试和下游任务中均展现出优于现有技术的训练加速效果。
English: The proposed DISSect method addresses noisy correspondence in contrastive learning by using the differential between current and historical model predictions to efficiently select high-quality samples, demonstrating superior performance across benchmarks and tasks.

Authors:Qiang Wang, Mengchao Wang, Fan Jiang, Yaqi Fan, Yonggang Qi, Mu Xu
Title: FantasyPortrait: Enhancing Multi-Character Portrait Animation with Expression-Augmented Diffusion Transformers
Abstract:
Producing expressive facial animations from static images is a challenging task. Prior methods relying on explicit geometric priors (e.g., facial landmarks or 3DMM) often suffer from artifacts in cross reenactment and struggle to capture subtle emotions. Furthermore, existing approaches lack support for multi-character animation, as driving features from different individuals frequently interfere with one another, complicating the task. To address these challenges, we propose FantasyPortrait, a diffusion transformer based framework capable of generating high-fidelity and emotion-rich animations for both single- and multi-character scenarios. Our method introduces an expression-augmented learning strategy that utilizes implicit representations to capture identity-agnostic facial dynamics, enhancing the model's ability to render fine-grained emotions. For multi-character control, we design a masked cross-attention mechanism that ensures independent yet coordinated expression generation, effectively preventing feature interference. To advance research in this area, we propose the Multi-Expr dataset and ExprBench, which are specifically designed datasets and benchmarks for training and evaluating multi-character portrait animations. Extensive experiments demonstrate that FantasyPortrait significantly outperforms state-of-the-art methods in both quantitative metrics and qualitative evaluations, excelling particularly in challenging cross reenactment and multi-character contexts. Our project page is https://fantasy-amap.github.io/fantasy-portrait/.

Authors:Jiaxiu Jiang, Wenbo Li, Jingjing Ren, Yuping Qiu, Yong Guo, Xiaogang Xu, Han Wu, Wangmeng Zuo
Title: LoViC: Efficient Long Video Generation with Context Compression
Abstract:
Despite recent advances in diffusion transformers (DiTs) for text-to-video generation, scaling to long-duration content remains challenging due to the quadratic complexity of self-attention. While prior efforts -- such as sparse attention and temporally autoregressive models -- offer partial relief, they often compromise temporal coherence or scalability. We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos, designed to produce long, coherent videos through a segment-wise generation process. At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations. It supports variable-length inputs with linearly adjustable compression rates, enabled by a single query token design based on the Q-Former architecture. Additionally, by encoding temporal context through position-aware mechanisms, our model seamlessly supports prediction, retradiction, interpolation, and multi-shot generation within a unified paradigm. Extensive experiments across diverse tasks validate the effectiveness and versatility of our approach.
Chinese: LoViC是一种基于扩散变换器的框架,通过FlexFormer将视频和文本压缩为统一潜在表示,有效解决了自注意力二次复杂度带来的挑战,生成长且连贯的视频内容。
English: LoViC is a diffusion transformer framework that generates long, coherent videos by using FlexFormer to compress video and text into unified latent representations, overcoming the limitations of quadratic complexity in self-attention.

Authors:Yucheng Tang, Yunguan Fu, Weixi Yi, Yipei Wang, Daniel C. Alexander, Rhodri Davies, Yipeng Hu
Title: Analysis of Image-and-Text Uncertainty Propagation in Multimodal Large Language Models with Cardiac MR-Based Applications
Abstract:
Multimodal large language models (MLLMs) can process and integrate information from multimodality sources, such as text and images. However, interrelationship among input modalities, uncertainties due to individual uni-modal data and potential clinical applications following such an uncertainty decomposition are yet fully understood in the context of large-scale MLLMs. In this work, we propose a multimodal uncertainty propagation model (MUPM) based on uncertainty propagation, to characterise the relationship among the uncertainties arising from image-only, text-only, and joint image-text variations in MLLM inputs. Using real clinical data consisting of cardiac MR scans and digital health records, we describe that MUPMs can be optimised robustly with a few samples. We then show that the fitted MUPMs are generalisable across different input data distributions and, perhaps surprisingly, across different downstream tasks. Such a transferability may be explained by the shared pretraining, comparatively light MLLM fine-tuning, along with the low-dimensional nature of the MUPMs. More importantly, this learned transferability, quantifying the relationship between these uncertainties, led to direct clinical applications in which uncertainties may be estimated and thus analysed robustly for varying data or even a novel set of cardiac disease prediction tasks. In addition, we show experimentally the efficiency in multimodal data required for estimating the overall uncertainty and its ability to identify redundant factors, both of which are considered practical yet clinically useful applications with the proposed MUPMs. Codes are available at https://github.com/yucheng722/MUPM.
中文: 本研究提出了一种多模态不确定性传播模型(MUPM),能有效表征并量化多模态大语言模型中图像与文本输入的不确定性,在数据分布和临床任务间展现出强大泛化能力,并成功应用于心脏疾病预测等实际场景。
English: This study introduces a multimodal uncertainty propagation model (MUPM) that effectively characterizes and quantifies uncertainties from individual and combined image-text inputs in MLLMs, demonstrating robust generalizability across data distributions and clinical tasks with practical applications in cardiac disease prediction.

Authors:Caixia Dong, Duwei Dai, Xinyi Han, Fan Liu, Xu Yang, Zongfang Li, Songhua Xu
Title: Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion
Abstract:
Accurate coronary artery segmentation is critical for computeraided diagnosis of coronary artery disease (CAD), yet it remains challenging due to the small size, complex morphology, and low contrast with surrounding tissues. To address these challenges, we propose a novel segmentation framework that leverages the power of vision foundation models (VFMs) through a parallel encoding architecture. Specifically, a vision transformer (ViT) encoder within the VFM captures global structural features, enhanced by the activation of the final two ViT blocks and the integration of an attention-guided enhancement (AGE) module, while a convolutional neural network (CNN) encoder extracts local details. These complementary features are adaptively fused using a cross-branch variational fusion (CVF) module, which models latent distributions and applies variational attention to assign modality-specific weights. Additionally, we introduce an evidential-learning uncertainty refinement (EUR) module, which quantifies uncertainty using evidence theory and refines uncertain regions by incorporating multi-scale feature aggregation and attention mechanisms, further enhancing segmentation accuracy. Extensive evaluations on one in-house and two public datasets demonstrate that the proposed framework significantly outperforms state-of-the-art methods, achieving superior performance in accurate coronary artery segmentation and showcasing strong generalization across multiple datasets. The code is available at https://github.com/d1c2x3/CAseg.
中文摘要:本文提出了一种新颖的冠状动脉分割框架,通过并行编码架构融合视觉基础模型,并引入不确定性优化模块,在多个数据集上实现了卓越的分割精度和泛化能力。
English Summary: This paper introduces a novel coronary artery segmentation framework that combines vision foundation models with parallel encoding and uncertainty refinement, achieving superior accuracy and generalization across multiple datasets.

Authors:Dongyeun Lee, Jiwan Hur, Hyounguk Shon, Jae Young Lee, Junmo Kim
Title: DMQ: Dissecting Outliers of Diffusion Models for Post-Training Quantization
Abstract:
Diffusion models have achieved remarkable success in image generation but come with significant computational costs, posing challenges for deployment in resource-constrained environments. Recent post-training quantization (PTQ) methods have attempted to mitigate this issue by focusing on the iterative nature of diffusion models. However, these approaches often overlook outliers, leading to degraded performance at low bit-widths. In this paper, we propose a DMQ which combines Learned Equivalent Scaling (LES) and channel-wise Power-of-Two Scaling (PTS) to effectively address these challenges. Learned Equivalent Scaling optimizes channel-wise scaling factors to redistribute quantization difficulty between weights and activations, reducing overall quantization error. Recognizing that early denoising steps, despite having small quantization errors, crucially impact the final output due to error accumulation, we incorporate an adaptive timestep weighting scheme to prioritize these critical steps during learning. Furthermore, identifying that layers such as skip connections exhibit high inter-channel variance, we introduce channel-wise Power-of-Two Scaling for activations. To ensure robust selection of PTS factors even with small calibration set, we introduce a voting algorithm that enhances reliability. Extensive experiments demonstrate that our method significantly outperforms existing works, especially at low bit-widths such as W4A6 (4-bit weight, 6-bit activation) and W4A8, maintaining high image generation quality and model stability. The code is available at https://github.com/LeeDongYeun/dmq.
中文: 本文提出DMQ方法,结合学习等效缩放和通道级二次幂缩放技术,通过自适应时间步加权有效处理扩散模型中的异常值和误差累积问题,在低比特量化下显著优于现有方法并保持图像生成质量。
English: This paper introduces DMQ, a novel quantization method combining Learned Equivalent Scaling and channel-wise Power-of-Two Scaling with adaptive timestep weighting to effectively handle outliers and error accumulation in diffusion models, achieving superior performance at low bit-widths while maintaining generation quality.

Authors:Tomohiro Suzuki, Ryota Tanaka, Calvin Yeung, Keisuke Fujii
Title: AthleticsPose: Authentic Sports Motion Dataset on Athletic Field and Evaluation of Monocular 3D Pose Estimation Ability
Abstract:
Monocular 3D pose estimation is a promising, flexible alternative to costly motion capture systems for sports analysis. However, its practical application is hindered by two factors: a lack of realistic sports datasets and unclear reliability for sports tasks. To address these challenges, we introduce the AthleticsPose dataset, a new public dataset featuring ``real'' motions captured from 23 athletes performing various athletics events on an athletic field. Using this dataset, we trained a representative 3D pose estimation model and performed a comprehensive evaluation. Our results show that the model trained on AthleticsPose significantly outperforms a baseline model trained on an imitated sports motion dataset, reducing MPJPE by approximately 75 %. These results show the importance of training on authentic sports motion data, as models based on imitated motions do not effectively transfer to real-world motions. Further analysis reveals that estimation accuracy is sensitive to camera view and subject scale. In case studies of kinematic indicators, the model demonstrated the potential to capture individual differences in knee angles but struggled with higher-speed metrics, such as knee-drive velocity, due to prediction biases. This work provides the research community with a valuable dataset and clarifies the potential and practical limitations of using monocular 3D pose estimation for sports motion analysis. Our dataset, code, and checkpoints are available at https://github.com/SZucchini/AthleticsPose.
中文: 单目三维姿态估计为运动分析提供了一种经济灵活的替代方案,但受限于缺乏真实数据集和可靠性不明确的问题;AthleticsPose数据集通过提供真实运动数据显著提升了模型性能,同时揭示了该方法在捕捉运动特征方面的潜力与局限。
English: Monocular 3D pose estimation offers a cost-effective alternative for sports analysis but faces limitations due to the scarcity of realistic datasets and unclear reliability, which the AthleticsPose dataset addresses by providing authentic motion data that significantly improves model performance and highlights its potential and limitations in capturing sports movements.

Authors:Shiqi Huang, Shuting He, Huaiyuan Qin, Bihan Wen
Title: SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation
Abstract:
Most existing remote sensing instance segmentation approaches are designed for close-vocabulary prediction, limiting their ability to recognize novel categories or generalize across datasets. This restricts their applicability in diverse Earth observation scenarios. To address this, we introduce open-vocabulary (OV) learning for remote sensing instance segmentation. While current OV segmentation models perform well on natural image datasets, their direct application to remote sensing faces challenges such as diverse landscapes, seasonal variations, and the presence of small or ambiguous objects in aerial imagery. To overcome these challenges, we propose $\textbf{SCORE}$ ($\textbf{S}$cene $\textbf{C}$ontext matters in $\textbf{O}$pen-vocabulary $\textbf{RE}$mote sensing instance segmentation), a framework that integrates multi-granularity scene context, i.e., regional context and global context, to enhance both visual and textual representations. Specifically, we introduce Region-Aware Integration, which refines class embeddings with regional context to improve object distinguishability. Additionally, we propose Global Context Adaptation, which enriches naive text embeddings with remote sensing global context, creating a more adaptable and expressive linguistic latent space for the classifier. We establish new benchmarks for OV remote sensing instance segmentation across diverse datasets. Experimental results demonstrate that, our proposed method achieves SOTA performance, which provides a robust solution for large-scale, real-world geospatial analysis. Our code is available at https://github.com/HuangShiqi128/SCORE.
中文摘要:提出的SCORE框架通过整合多粒度场景上下文来增强视觉和文本表示,解决了现有遥感实例分割方法的局限性,在面向多样化地球观测场景的开放词汇学习中实现了最先进的性能。
English Summary: The proposed SCORE framework addresses the limitations of existing remote sensing instance segmentation methods by integrating multi-granularity scene context to enhance both visual and textual representations, achieving state-of-the-art performance in open-vocabulary learning for diverse Earth observation scenarios.

Authors:Ziyi Wang, Zhi Gao, Jin Chen, Qingjie Zhao, Xinxiao Wu, Jiebo Luo
Title: Simulate, Refocus and Ensemble: An Attention-Refocusing Scheme for Domain Generalization
Abstract:
Domain generalization (DG) aims to learn a model from source domains and apply it to unseen target domains with out-of-distribution data. Owing to CLIP's strong ability to encode semantic concepts, it has attracted increasing interest in domain generalization. However, CLIP often struggles to focus on task-relevant regions across domains, i.e., domain-invariant regions, resulting in suboptimal performance on unseen target domains. To address this challenge, we propose an attention-refocusing scheme, called Simulate, Refocus and Ensemble (SRE), which learns to reduce the domain shift by aligning the attention maps in CLIP via attention refocusing. SRE first simulates domain shifts by performing augmentation on the source data to generate simulated target domains. SRE then learns to reduce the domain shifts by refocusing the attention in CLIP between the source and simulated target domains. Finally, SRE utilizes ensemble learning to enhance the ability to capture domain-invariant attention maps between the source data and the simulated target data. Extensive experimental results on several datasets demonstrate that SRE generally achieves better results than state-of-the-art methods. The code is available at: https://github.com/bitPrincy/SRE-DG.
Chinese: 针对CLIP在领域泛化中难以聚焦跨领域不变区域的问题,本文提出的SRE方法通过模拟领域偏移和注意力重聚焦机制,结合集成学习显著提升了模型在未知目标领域的性能表现。
English: To address CLIP's limitation in focusing on domain-invariant regions for domain generalization, the proposed SRE method employs attention refocusing through domain shift simulation and ensemble learning, demonstrating superior performance across multiple datasets.

Authors:Hoang-Son Vo, Quang-Vinh Nguyen, Seungwon Kim, Hyung-Jeong Yang, Soonja Yeom, Soo-Hyung Kim
Title: ATL-Diff: Audio-Driven Talking Head Generation with Early Landmarks-Guide Noise Diffusion
Abstract:
Audio-driven talking head generation requires precise synchronization between facial animations and audio signals. This paper introduces ATL-Diff, a novel approach addressing synchronization limitations while reducing noise and computational costs. Our framework features three key components: a Landmark Generation Module converting audio to facial landmarks, a Landmarks-Guide Noise approach that decouples audio by distributing noise according to landmarks, and a 3D Identity Diffusion network preserving identity characteristics. Experiments on MEAD and CREMA-D datasets demonstrate that ATL-Diff outperforms state-of-the-art methods across all metrics. Our approach achieves near real-time processing with high-quality animations, computational efficiency, and exceptional preservation of facial nuances. This advancement offers promising applications for virtual assistants, education, medical communication, and digital platforms. The source code is available at: \href{https://github.com/sonvth/ATL-Diff}{https://github.com/sonvth/ATL-Diff}
中文: 本文提出ATL-Diff这一创新方法,通过关键点引导噪声分布和3D身份扩散网络,在提升音频驱动说话头部生成的同步性、降低噪声和计算成本的同时,有效保留面部特征,实现了卓越性能与近实时处理。
English: This paper presents ATL-Diff, an innovative audio-driven talking head generation method that enhances synchronization, reduces noise and computational costs, and preserves facial identity through landmark-guided noise distribution and a 3D identity diffusion network, achieving superior performance and near real-time processing.

Authors:Junjie Gao, Runze Liu, Yingzhe Peng, Shujian Yang, Jin Zhang, Kai Yang, Zhiyuan You
Title: DeQA-Doc: Adapting DeQA-Score to Document Image Quality Assessment
Abstract:
Document quality assessment is critical for a wide range of applications including document digitization, OCR, and archival. However, existing approaches often struggle to provide accurate and robust quality scores, limiting their applicability in practical scenarios. With the rapid progress in Multi-modal Large Language Models (MLLMs), recent MLLM-based methods have achieved remarkable performance in image quality assessment. In this work, we extend this success to the document domain by adapting DeQA-Score, a state-of-the-art MLLM-based image quality scorer, for document quality assessment. We propose DeQA-Doc, a framework that leverages the visual language capabilities of MLLMs and a soft label strategy to regress continuous document quality scores. To adapt DeQA-Score to DeQA-Doc, we adopt two complementary solutions to construct soft labels without the variance information. Also, we relax the resolution constrains to support the large resolution of document images. Finally, we introduce ensemble methods to further enhance the performance. Extensive experiments demonstrate that DeQA-Doc significantly outperforms existing baselines, offering accurate and generalizable document quality assessment across diverse degradation types. Codes and model weights are available in https://github.com/Junjie-Gao19/DeQA-Doc.
中文摘要:本文提出了DeQA-Doc框架,通过采用多模态大语言模型的视觉语言能力和软标签策略来评估文档质量,在多种退化类型上显著超越了现有基线方法。
English Summary: This paper introduces DeQA-Doc, a framework that adapts a state-of-the-art multimodal large language model for document quality assessment using soft labels and ensemble methods, significantly outperforming existing baselines across various degradation types.

Authors:Hengkai Tan, Yao Feng, Xinyi Mao, Shuhe Huang, Guodong Liu, Zhongkai Hao, Hang Su, Jun Zhu
Title: AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation
Abstract:
Vision-language-action (VLA) models have shown promise on task-conditioned control in complex settings such as bimanual manipulation. However, the heavy reliance on task-specific human demonstrations limits their generalization and incurs high data acquisition costs. In this work, we present a new notion of task-agnostic action paradigm that decouples action execution from task-specific conditioning, enhancing scalability, efficiency, and cost-effectiveness. To address the data collection challenges posed by this paradigm -- such as low coverage density, behavioral redundancy, and safety risks -- we introduce ATARA (Automated Task-Agnostic Random Actions), a scalable self-supervised framework that accelerates collection by over $ 30\times $ compared to human teleoperation. To further enable effective learning from task-agnostic data, which often suffers from distribution mismatch and irrelevant trajectories, we propose AnyPos, an inverse dynamics model equipped with Arm-Decoupled Estimation and a Direction-Aware Decoder (DAD). We additionally integrate a video-conditioned action validation module to verify the feasibility of learned policies across diverse manipulation tasks. Extensive experiments show that the AnyPos-ATARA pipeline yields a 51% improvement in test accuracy and achieves 30-40% higher success rates in downstream tasks such as lifting, pick-and-place, and clicking, using replay-based video validation. Project Page: https://embodiedfoundation.github.io/vidar_anypos
Chinese Summary: 本研究提出了ATARA框架,通过自监督方式将任务无关动作数据收集效率提升30倍以上,并开发AnyPos逆动力学模型优化从该数据中的学习效果,在多类操作任务中实现了性能的显著提升。
English Summary: This research introduces ATARA, a scalable self-supervised framework that accelerates task-agnostic action data collection by over 30 times, and AnyPos, an inverse dynamics model that enhances learning efficiency from such data, achieving significant improvements in manipulation task performance.

Authors:Peijun Wang, Jinhua Zhao
Title: SOD-YOLO: Enhancing YOLO-Based Detection of Small Objects in UAV Imagery
Abstract:
Small object detection remains a challenging problem in the field of object detection. To address this challenge, we propose an enhanced YOLOv8-based model, SOD-YOLO. This model integrates an ASF mechanism in the neck to enhance multi-scale feature fusion, adds a Small Object Detection Layer (named P2) to provide higher-resolution feature maps for better small object detection, and employs Soft-NMS to refine confidence scores and retain true positives. Experimental results demonstrate that SOD-YOLO significantly improves detection performance, achieving a 36.1% increase in mAP$_{50:95}$ and 20.6% increase in mAP$_{50}$ on the VisDrone2019-DET dataset compared to the baseline model. These enhancements make SOD-YOLO a practical and efficient solution for small object detection in UAV imagery. Our source code, hyper-parameters, and model weights are available at https://github.com/iamwangxiaobai/SOD-YOLO.
Chinese: 提出的SOD-YOLO模型通过融合多尺度特征、增加高分辨率检测层和使用Soft-NMS,显著提升了小目标检测性能,在VisDrone2019-DET数据集上mAP$_{50:95}$和mAP$_{50}$分别提高了36.1%和20.6%。
English: The proposed SOD-YOLO model enhances small object detection by integrating multi-scale feature fusion, adding a high-resolution detection layer, and employing Soft-NMS, achieving significant performance improvements of 36.1% in mAP$_{50:95}$ and 20.6% in mAP$_{50}$ on the VisDrone2019-DET dataset.

Authors:Yang Yang, Dongni Mao, Hiroaki Santo, Yasuyuki Matsushita, Fumio Okura
Title: NeuraLeaf: Neural Parametric Leaf Models with Shape and Deformation Disentanglement
Abstract:
We develop a neural parametric model for 3D leaves for plant modeling and reconstruction that are essential for agriculture and computer graphics. While neural parametric models are actively studied for humans and animals, plant leaves present unique challenges due to their diverse shapes and flexible deformation. To this problem, we introduce a neural parametric model for leaves, NeuraLeaf. Capitalizing on the fact that flattened leaf shapes can be approximated as a 2D plane, NeuraLeaf disentangles the leaves' geometry into their 2D base shapes and 3D deformations. This representation allows learning from rich sources of 2D leaf image datasets for the base shapes, and also has the advantage of simultaneously learning textures aligned with the geometry. To model the 3D deformation, we propose a novel skeleton-free skinning model and create a newly captured 3D leaf dataset called DeformLeaf. We show that NeuraLeaf successfully generates a wide range of leaf shapes with deformation, resulting in accurate model fitting to 3D observations like depth maps and point clouds. Our implementation and dataset are available at https://neuraleaf-yang.github.io/.

Authors:Zahra TehraniNasab, Hujun Ni, Amar Kumar, Tal Arbel
Title: Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images
Abstract:
Medical image synthesis presents unique challenges due to the inherent complexity and high-resolution details required in clinical contexts. Traditional generative architectures such as Generative Adversarial Networks (GANs) or Variational Auto Encoder (VAEs) have shown great promise for high-resolution image generation but struggle with preserving fine-grained details that are key for accurate diagnosis. To address this issue, we introduce Pixel Perfect MegaMed, the first vision-language foundation model to synthesize images at resolutions of 1024x1024. Our method deploys a multi-scale transformer architecture designed specifically for ultra-high resolution medical image generation, enabling the preservation of both global anatomical context and local image-level details. By leveraging vision-language alignment techniques tailored to medical terminology and imaging modalities, Pixel Perfect MegaMed bridges the gap between textual descriptions and visual representations at unprecedented resolution levels. We apply our model to the CheXpert dataset and demonstrate its ability to generate clinically faithful chest X-rays from text prompts. Beyond visual quality, these high-resolution synthetic images prove valuable for downstream tasks such as classification, showing measurable performance gains when used for data augmentation, particularly in low-data regimes. Our code is accessible through the project website - https://tehraninasab.github.io/pixelperfect-megamed.

Authors:Rajesh Sureddi, Saman Zadtootaghaj, Nabajeet Barman, Alan C. Bovik
Title: TRIQA: Image Quality Assessment by Contrastive Pretraining on Ordered Distortion Triplets
Abstract:
Image Quality Assessment (IQA) models aim to predict perceptual image quality in alignment with human judgments. No-Reference (NR) IQA remains particularly challenging due to the absence of a reference image. While deep learning has significantly advanced this field, a major hurdle in developing NR-IQA models is the limited availability of subjectively labeled data. Most existing deep learning-based NR-IQA approaches rely on pre-training on large-scale datasets before fine-tuning for IQA tasks. To further advance progress in this area, we propose a novel approach that constructs a custom dataset using a limited number of reference content images and introduces a no-reference IQA model that incorporates both content and quality features for perceptual quality prediction. Specifically, we train a quality-aware model using contrastive triplet-based learning, enabling efficient training with fewer samples while achieving strong generalization performance across publicly available datasets. Our repository is available at https://github.com/rajeshsureddi/triqa.
Chinese: 本研究提出了一种新的无参考图像质量评估模型,通过对比三元组学习和定制数据集,利用有限样本有效预测感知质量,并在公开数据集上展现出强大的泛化性能。
English: This study introduces a novel no-reference image quality assessment (NR-IQA) model that leverages contrastive triplet-based learning and a custom dataset to effectively predict perceptual quality with limited training samples, demonstrating strong generalization across public datasets.

Authors:Christina Thrainer, Md Meftahul Ferdaus, Mahdi Abdelguerfi, Christian Guetl, Steven Sloan, Kendall N. Niles, Ken Pathak
Title: FORTRESS: Function-composition Optimized Real-Time Resilient Structural Segmentation via Kolmogorov-Arnold Enhanced Spatial Attention Networks
Abstract:
Automated structural defect segmentation in civil infrastructure faces a critical challenge: achieving high accuracy while maintaining computational efficiency for real-time deployment. This paper presents FORTRESS (Function-composition Optimized Real-Time Resilient Structural Segmentation), a new architecture that balances accuracy and speed by using a special method that combines depthwise separable convolutions with adaptive Kolmogorov-Arnold Network integration. FORTRESS incorporates three key innovations: a systematic depthwise separable convolution framework achieving a 3.6x parameter reduction per layer, adaptive TiKAN integration that selectively applies function composition transformations only when computationally beneficial, and multi-scale attention fusion combining spatial, channel, and KAN-enhanced features across decoder levels. The architecture achieves remarkable efficiency gains with 91% parameter reduction (31M to 2.9M), 91% computational complexity reduction (13.7 to 1.17 GFLOPs), and 3x inference speed improvement while delivering superior segmentation performance. Evaluation on benchmark infrastructure datasets demonstrates state-of-the-art results with an F1- score of 0.771 and a mean IoU of 0.677, significantly outperforming existing methods including U-Net, SA-UNet, and U- KAN. The dual optimization strategy proves essential for optimal performance, establishing FORTRESS as a robust solution for practical structural defect segmentation in resource-constrained environments where both accuracy and computational efficiency are paramount. Comprehensive architectural specifications are provided in the Supplemental Material. Source code is available at URL: https://github.com/faeyelab/fortress-paper-code.
中文: 本文提出FORTRESS架构,通过深度可分离卷积与自适应KAN集成,在保持高精度的同时大幅降低参数与计算量,实现了实时结构缺陷分割的最优性能。
English: This paper introduces FORTRESS, an efficient architecture for structural defect segmentation that significantly reduces parameters and computational complexity while achieving superior accuracy and real-time performance.

Authors:Said Ohamouddou, Abdellatif El Afia, Hanaa El Afia, Raddouane Chiheb
Title: MS-DGCNN++: A Multi-Scale Fusion Dynamic Graph Neural Network with Biological Knowledge Integration for LiDAR Tree Species Classification
Abstract:
Tree species classification from terrestrial LiDAR point clouds is challenging because of the complex multi-scale geometric structures in forest environments. Existing approaches using multi-scale dynamic graph convolutional neural networks (MS-DGCNN) employ parallel multi-scale processing, which fails to capture the semantic relationships between the hierarchical levels of the tree architecture. We present MS-DGCNN++, a hierarchical multiscale fusion dynamic graph convolutional network that uses semantically meaningful feature extraction at local, branch, and canopy scales with cross-scale information propagation. Our method employs scale-specific feature engineering, including standard geometric features for the local scale, normalized relative vectors for the branch scale, and distance information for the canopy scale. This hierarchical approach replaces uniform parallel processing with semantically differentiated representations that are aligned with the natural tree structure. Under the same proposed tree species data augmentation strategy for all experiments, MS-DGCNN++ achieved an accuracy of 94.96 \% on STPCTLS, outperforming DGCNN, MS-DGCNN, and the state-of-the-art model PPT. On FOR-species20K, it achieves 67.25\% accuracy (6.1\% improvement compared to MS-DGCNN). For standard 3D object recognition, our method outperformed DGCNN and MS-DGCNN with overall accuracies of 93.15\% on ModelNet40 and 94.05\% on ModelNet10. With lower parameters and reduced complexity compared to state-of-the-art transformer approaches, our method is suitable for resource-constrained applications while maintaining a competitive accuracy. Beyond tree classification, the method generalizes to standard 3D object recognition, establishing it as a versatile solution for diverse point cloud processing applications. The implementation code is publicly available at https://github.com/said-ohamouddou/MS-DGCNN2.
Chinese: MS-DGCNN++ 提出了一种层次化多尺度融合网络,通过捕捉局部、枝干和冠层尺度的语义关系,提升了从LiDAR数据中进行树种分类的准确性,并以更低的复杂度实现了卓越性能。
English: MS-DGCNN++ introduces a hierarchical multiscale fusion network that enhances tree species classification from LiDAR data by capturing semantic relationships across local, branch, and canopy scales, achieving superior accuracy with reduced complexity.

Authors:Gen Luo, Wenhan Dou, Wenhao Li, Zhaokai Wang, Xue Yang, Changyao Tian, Hao Li, Weiyun Wang, Wenhai Wang, Xizhou Zhu, Yu Qiao, Jifeng Dai
Title: Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models
Abstract:
This paper focuses on monolithic Multimodal Large Language Models (MLLMs), which integrate visual encoding and language decoding into a single model. Existing structures and pre-training strategies for monolithic MLLMs often suffer from unstable optimization and catastrophic forgetting. To address these challenges, our key idea is to embed a new visual parameter space into a pre-trained LLM, enabling stable learning of visual knowledge from noisy data via delta tuning. Based on this principle, we first introduce Mono-InternVL, an advanced monolithic MLLM that incorporates a set of visual experts through a multimodal mixture-of-experts architecture. In addition, we design an innovative Endogenous Visual Pre-training (EViP) for Mono-InternVL to maximize its visual capabilities via progressive learning. Mono-InternVL achieves competitive performance against existing MLLMs but also leads to relatively expensive data cost. Therefore, we further present Mono-InternVL-1.5, a cheaper and stronger monolithic MLLM equipped with an improved EViP (EViP++). EViP++ introduces additional visual attention experts to Mono-InternVL-1.5 and re-organizes the pre-training process in an efficient manner. During inference, it includes a fused CUDA kernel to speed up its MoE operations. With these designs, Mono-InternVL-1.5 significantly reduces training and inference costs, while still maintaining competitive performance with Mono-InternVL. To evaluate our approach, we conduct extensive experiments across 15 benchmarks. Results demonstrate that Mono-InternVL outperforms existing monolithic MLLMs on 12 out of 15 benchmarks, e.g., +114-point improvement over Emu3 on OCRBench. Compared to its modular counterpart, i.e., InternVL-1.5, Mono-InternVL-1.5 achieves similar multimodal performance while reducing first-token latency by up to 69%. Code and models are released at https://github.com/OpenGVLab/Mono-InternVL.
中文: 本文提出Mono-InternVL这一单模态多模态大语言模型,通过创新的视觉参数空间和内生视觉预训练方法解决优化不稳定与灾难性遗忘问题,在多个基准测试中表现优异并显著降低延迟。
English: This paper introduces Mono-InternVL, a monolithic Multimodal Large Language Model that addresses optimization instability and catastrophic forgetting through a novel visual parameter space and Endogenous Visual Pre-training, achieving superior performance on multiple benchmarks while reducing latency.

Authors:Yuncong Yang, Jiageng Liu, Zheyuan Zhang, Siyuan Zhou, Reuben Tan, Jianwei Yang, Yilun Du, Chuang Gan
Title: MindJourney: Test-Time Scaling with World Models for Spatial Reasoning
Abstract:
Spatial reasoning in 3D space is central to human cognition and indispensable for embodied tasks such as navigation and manipulation. However, state-of-the-art vision-language models (VLMs) struggle frequently with tasks as simple as anticipating how a scene will look after an egocentric motion: they perceive 2D images but lack an internal model of 3D dynamics. We therefore propose MindJourney, a test-time scaling framework that grants a VLM with this missing capability by coupling it to a controllable world model based on video diffusion. The VLM iteratively sketches a concise camera trajectory, while the world model synthesizes the corresponding view at each step. The VLM then reasons over this multi-view evidence gathered during the interactive exploration. Without any fine-tuning, our MindJourney achieves over an average 8% performance boost on the representative spatial reasoning benchmark SAT, showing that pairing VLMs with world models for test-time scaling offers a simple, plug-and-play route to robust 3D reasoning. Meanwhile, our method also improves upon the test-time inference VLMs trained through reinforcement learning, which demonstrates the potential of our method that utilizes world models for test-time scaling.
中文摘要:MindJourney通过将视觉语言模型与视频扩散世界模型相结合,无需微调即可显著提升三维空间推理能力,在基准测试中平均性能提高超过8%。
English Summary: MindJourney enhances vision-language models' 3D spatial reasoning by integrating them with video diffusion world models, achieving over 8% performance improvement on benchmarks without requiring fine-tuning.

Authors:Richard Marcus, Marc Stamminger
Title: Physically Based Neural LiDAR Resimulation
Abstract:
Methods for Novel View Synthesis (NVS) have recently found traction in the field of LiDAR simulation and large-scale 3D scene reconstruction. While solutions for faster rendering or handling dynamic scenes have been proposed, LiDAR specific effects remain insufficiently addressed. By explicitly modeling sensor characteristics such as rolling shutter, laser power variations, and intensity falloff, our method achieves more accurate LiDAR simulation compared to existing techniques. We demonstrate the effectiveness of our approach through quantitative and qualitative comparisons with state-of-the-art methods, as well as ablation studies that highlight the importance of each sensor model component. Beyond that, we show that our approach exhibits advanced resimulation capabilities, such as generating high resolution LiDAR scans in the camera perspective. Our code and the resulting dataset are available at https://github.com/richardmarcus/PBNLiDAR.
Chinese: 我们的方法通过模拟滚动快门和激光功率等传感器特性,提高了LiDAR仿真的准确性,在定量比较和生成相机视角高分辨率扫描等高级应用中均优于现有技术。
English: Our method enhances LiDAR simulation accuracy by modeling sensor characteristics like rolling shutter and laser power, outperforming existing techniques in both quantitative comparisons and advanced applications such as generating high-resolution scans from camera perspectives.

Authors:Muhammed Furkan Dasdelen, Hyesu Lim, Michele Buck, Katharina S. Götze, Carsten Marr, Steffen Schneider
Title: CytoSAE: Interpretable Cell Embeddings for Hematology
Abstract:
Sparse autoencoders (SAEs) emerged as a promising tool for mechanistic interpretability of transformer-based foundation models. Very recently, SAEs were also adopted for the visual domain, enabling the discovery of visual concepts and their patch-wise attribution to tokens in the transformer model. While a growing number of foundation models emerged for medical imaging, tools for explaining their inferences are still lacking. In this work, we show the applicability of SAEs for hematology. We propose CytoSAE, a sparse autoencoder which is trained on over 40,000 peripheral blood single-cell images. CytoSAE generalizes to diverse and out-of-domain datasets, including bone marrow cytology, where it identifies morphologically relevant concepts which we validated with medical experts. Furthermore, we demonstrate scenarios in which CytoSAE can generate patient-specific and disease-specific concepts, enabling the detection of pathognomonic cells and localized cellular abnormalities at the patch level. We quantified the effect of concepts on a patient-level AML subtype classification task and show that CytoSAE concepts reach performance comparable to the state-of-the-art, while offering explainability on the sub-cellular level. Source code and model weights are available at https://github.com/dynamical-inference/cytosae.
中文: 本研究提出CytoSAE稀疏自编码器,基于4万余张血细胞图像训练,可识别并验证血液学中具有临床意义的视觉概念,实现可解释的疾病分类与细胞异常检测。
English: This study introduces CytoSAE, a sparse autoencoder trained on over 40,000 blood cell images, which identifies and validates clinically relevant visual concepts for hematology, enabling explainable disease classification and cellular abnormality detection.

Authors:Yuxi Xiao, Jianyuan Wang, Nan Xue, Nikita Karaev, Yuri Makarov, Bingyi Kang, Xing Zhu, Hujun Bao, Yujun Shen, Xiaowei Zhou
Title: SpatialTrackerV2: 3D Point Tracking Made Easy
Abstract:
We present SpatialTrackerV2, a feed-forward 3D point tracking method for monocular videos. Going beyond modular pipelines built on off-the-shelf components for 3D tracking, our approach unifies the intrinsic connections between point tracking, monocular depth, and camera pose estimation into a high-performing and feedforward 3D point tracker. It decomposes world-space 3D motion into scene geometry, camera ego-motion, and pixel-wise object motion, with a fully differentiable and end-to-end architecture, allowing scalable training across a wide range of datasets, including synthetic sequences, posed RGB-D videos, and unlabeled in-the-wild footage. By learning geometry and motion jointly from such heterogeneous data, SpatialTrackerV2 outperforms existing 3D tracking methods by 30%, and matches the accuracy of leading dynamic 3D reconstruction approaches while running 50$\times$ faster.
中文:SpatialTrackerV2是一种前馈式三维点跟踪方法,通过端到端架构统一场景几何、相机运动和物体位移,性能比现有方法提升30%,在保持顶尖动态重建精度的同时处理速度加快50倍。
English: SpatialTrackerV2 is a unified, feed-forward 3D point tracking method that integrates scene geometry, camera motion, and object movement into an end-to-end architecture, achieving 30% higher performance than existing methods and matching top dynamic reconstruction accuracy with 50x faster processing.

Authors:Shangpin Peng, Senqiao Yang, Li Jiang, Zhuotao Tian
Title: Mitigating Object Hallucinations via Sentence-Level Early Intervention
Abstract:
Multimodal large language models (MLLMs) have revolutionized cross-modal understanding but continue to struggle with hallucinations - fabricated content contradicting visual inputs. Existing hallucination mitigation methods either incur prohibitive computational costs or introduce distribution mismatches between training data and model outputs. We identify a critical insight: hallucinations predominantly emerge at the early stages of text generation and propagate through subsequent outputs. To address this, we propose SENTINEL (Sentence-level Early iNtervention Through IN-domain prEference Learning), a framework that eliminates dependency on human annotations. Specifically, we first bootstrap high-quality in-domain preference pairs by iteratively sampling model outputs, validating object existence through cross-checking with two open-vocabulary detectors, and classifying sentences into hallucinated/non-hallucinated categories. Subsequently, we use context-coherent positive samples and hallucinated negative samples to build context-aware preference data iteratively. Finally, we train models using a context-aware preference loss (C-DPO) that emphasizes discriminative learning at the sentence level where hallucinations initially manifest. Experimental results show that SENTINEL can reduce hallucinations by over 90% compared to the original model and outperforms the previous state-of-the-art method on both hallucination benchmarks and general capabilities benchmarks, demonstrating its superiority and generalization ability. The models, datasets, and code are available at https://github.com/pspdada/SENTINEL.
中文摘要:SENTINEL框架通过在文本生成的早期阶段进行句子级干预,利用自举构建的领域内偏好数据训练模型,有效减少了多模态大语言模型中的幻觉现象,在无需人工标注的情况下将幻觉降低超过90%,并在各项基准测试中保持卓越性能。
English Summary: The SENTINEL framework effectively reduces hallucinations in multimodal language models by detecting and correcting fabricated content at the sentence level during early text generation, achieving over 90% reduction without human annotations while maintaining superior performance on benchmark tests.

Authors:Yen-Linh Vu, Dinh-Thang Duong, Truong-Binh Duong, Anh-Khoi Nguyen, Thanh-Huy Nguyen, Le Thien Phuc Nguyen, Jianhua Xing, Xingjian Li, Tianyang Wang, Ulas Bagci, Min Xu
Title: Describe Anything Model for Visual Question Answering on Text-rich Images
Abstract:
Recent progress has been made in region-aware vision-language modeling, particularly with the emergence of the Describe Anything Model (DAM). DAM is capable of generating detailed descriptions of any specific image areas or objects without the need for additional localized image-text alignment supervision. We hypothesize that such region-level descriptive capability is beneficial for the task of Visual Question Answering (VQA), especially in challenging scenarios involving images with dense text. In such settings, the fine-grained extraction of textual information is crucial to producing correct answers. Motivated by this, we introduce DAM-QA, a framework with a tailored evaluation protocol, developed to investigate and harness the region-aware capabilities from DAM for the text-rich VQA problem that requires reasoning over text-based information within images. DAM-QA incorporates a mechanism that aggregates answers from multiple regional views of image content, enabling more effective identification of evidence that may be tied to text-related elements. Experiments on six VQA benchmarks show that our approach consistently outperforms the baseline DAM, with a notable 7+ point gain on DocVQA. DAM-QA also achieves the best overall performance among region-aware models with fewer parameters, significantly narrowing the gap with strong generalist VLMs. These results highlight the potential of DAM-like models for text-rich and broader VQA tasks when paired with efficient usage and integration strategies. Our code is publicly available at https://github.com/Linvyl/DAM-QA.git.
中文摘要:DAM-QA框架利用描述任意模型(DAM)的区域感知能力,通过聚合图像多个区域的答案来增强视觉问答,尤其在文本密集图像上表现卓越,以更少参数在多项基准测试中取得领先性能。
English Summary: The DAM-QA framework leverages the region-aware capabilities of the Describe Anything Model to enhance Visual Question Answering, particularly for text-rich images, by aggregating answers from multiple regional views and achieving superior performance on benchmarks with fewer parameters.

Authors:Ruihan Yang, Qinxi Yu, Yecheng Wu, Rui Yan, Borui Li, An-Chieh Cheng, Xueyan Zou, Yunhao Fang, Xuxin Cheng, Ri-Zhao Qiu, Hongxu Yin, Sifei Liu, Song Han, Yao Lu, Xiaolong Wang
Title: EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos
Abstract:
Real robot data collection for imitation learning has led to significant advancements in robotic manipulation. However, the requirement for robot hardware in the process fundamentally constrains the scale of the data. In this paper, we explore training Vision-Language-Action (VLA) models using egocentric human videos. The benefit of using human videos is not only for their scale but more importantly for the richness of scenes and tasks. With a VLA trained on human video that predicts human wrist and hand actions, we can perform Inverse Kinematics and retargeting to convert the human actions to robot actions. We fine-tune the model using a few robot manipulation demonstrations to obtain the robot policy, namely EgoVLA. We propose a simulation benchmark called Ego Humanoid Manipulation Benchmark, where we design diverse bimanual manipulation tasks with demonstrations. We fine-tune and evaluate EgoVLA with Ego Humanoid Manipulation Benchmark and show significant improvements over baselines and ablate the importance of human data. Videos can be found on our website: https://rchalyang.github.io/EgoVLA
Chinese: 本文提出EgoVLA模型,通过人类视频训练视觉-语言-动作模型预测动作,再经逆运动学转换为机器人动作,并利用少量机器人演示进行微调,显著提升了操作性能。
English: This paper introduces EgoVLA, a Vision-Language-Action model trained on human videos to predict actions, which are then converted to robot actions through inverse kinematics and fine-tuned with minimal robot demonstrations for improved manipulation performance.

Authors:Jaehyun Kwak, Ramahdani Muhammad Izaaz Inhar, Se-Young Yun, Sung-Ju Lee
Title: QuRe: Query-Relevant Retrieval through Hard Negative Sampling in Composed Image Retrieval
Abstract:
Composed Image Retrieval (CIR) retrieves relevant images based on a reference image and accompanying text describing desired modifications. However, existing CIR methods only focus on retrieving the target image and disregard the relevance of other images. This limitation arises because most methods employing contrastive learning-which treats the target image as positive and all other images in the batch as negatives-can inadvertently include false negatives. This may result in retrieving irrelevant images, reducing user satisfaction even when the target image is retrieved. To address this issue, we propose Query-Relevant Retrieval through Hard Negative Sampling (QuRe), which optimizes a reward model objective to reduce false negatives. Additionally, we introduce a hard negative sampling strategy that selects images positioned between two steep drops in relevance scores following the target image, to effectively filter false negatives. In order to evaluate CIR models on their alignment with human satisfaction, we create Human-Preference FashionIQ (HP-FashionIQ), a new dataset that explicitly captures user preferences beyond target retrieval. Extensive experiments demonstrate that QuRe achieves state-of-the-art performance on FashionIQ and CIRR datasets while exhibiting the strongest alignment with human preferences on the HP-FashionIQ dataset. The source code is available at https://github.com/jackwaky/QuRe.
Chinese: 提出的通过硬负样本采样的查询相关检索方法QuRe,通过优化奖励模型和采用硬负样本采样策略,解决了组合图像检索中的假阴性问题,在新旧数据集上均实现了最优性能,并与人类偏好更佳对齐。
English: The proposed Query-Relevant Retrieval through Hard Negative Sampling (QuRe) method addresses the issue of false negatives in composed image retrieval by optimizing a reward model and employing a hard negative sampling strategy, achieving state-of-the-art performance and better alignment with human preferences on new and existing datasets.

Authors:Kaiwen Huang, Yi Zhou, Huazhu Fu, Yizhe Zhang, Chen Gong, Tao Zhou
Title: Text-driven Multiplanar Visual Interaction for Semi-supervised Medical Image Segmentation
Abstract:
Semi-supervised medical image segmentation is a crucial technique for alleviating the high cost of data annotation. When labeled data is limited, textual information can provide additional context to enhance visual semantic understanding. However, research exploring the use of textual data to enhance visual semantic embeddings in 3D medical imaging tasks remains scarce. In this paper, we propose a novel text-driven multiplanar visual interaction framework for semi-supervised medical image segmentation (termed Text-SemiSeg), which consists of three main modules: Text-enhanced Multiplanar Representation (TMR), Category-aware Semantic Alignment (CSA), and Dynamic Cognitive Augmentation (DCA). Specifically, TMR facilitates text-visual interaction through planar mapping, thereby enhancing the category awareness of visual features. CSA performs cross-modal semantic alignment between the text features with introduced learnable variables and the intermediate layer of visual features. DCA reduces the distribution discrepancy between labeled and unlabeled data through their interaction, thus improving the model's robustness. Finally, experiments on three public datasets demonstrate that our model effectively enhances visual features with textual information and outperforms other methods. Our code is available at https://github.com/taozh2017/Text-SemiSeg.
中文摘要:本文提出Text-SemiSeg框架,通过文本驱动的多平面视觉交互和语义对齐技术提升半监督医学图像分割效果,在三个公开数据集上验证了其优越性。
English Summary: This paper introduces Text-SemiSeg, a novel text-driven framework that enhances 3D medical image segmentation through multiplanar visual-text interaction and semantic alignment, demonstrating superior performance over existing methods.

Authors:M. Anwar Ma'sum, Mahardhika Pratama, Savitha Ramasamy, Lin Liu, Habibullah Habibullah, Ryszard Kowalczyk
Title: PROL : Rehearsal Free Continual Learning in Streaming Data via Prompt Online Learning
Abstract:
The data privacy constraint in online continual learning (OCL), where the data can be seen only once, complicates the catastrophic forgetting problem in streaming data. A common approach applied by the current SOTAs in OCL is with the use of memory saving exemplars or features from previous classes to be replayed in the current task. On the other hand, the prompt-based approach performs excellently in continual learning but with the cost of a growing number of trainable parameters. The first approach may not be applicable in practice due to data openness policy, while the second approach has the issue of throughput associated with the streaming data. In this study, we propose a novel prompt-based method for online continual learning that includes 4 main components: (1) single light-weight prompt generator as a general knowledge, (2) trainable scaler-and-shifter as specific knowledge, (3) pre-trained model (PTM) generalization preserving, and (4) hard-soft updates mechanism. Our proposed method achieves significantly higher performance than the current SOTAs in CIFAR100, ImageNet-R, ImageNet-A, and CUB dataset. Our complexity analysis shows that our method requires a relatively smaller number of parameters and achieves moderate training time, inference time, and throughput. For further study, the source code of our method is available at https://github.com/anwarmaxsum/PROL.
Chinese: 本研究提出了一种新颖的在线持续学习提示方法,结合轻量级提示生成器、可训练的缩放移位器、预训练模型保持和硬软更新机制,以较少参数和适中计算成本实现了卓越性能。
English: This study introduces a novel prompt-based method for online continual learning that integrates a lightweight prompt generator, trainable scaler-shifter, pre-trained model preservation, and a hard-soft update mechanism, achieving superior performance with fewer parameters and moderate computational demands.

Authors:Felix Nützel, Mischa Dombrowski, Bernhard Kainz
Title: Generate to Ground: Multimodal Text Conditioning Boosts Phrase Grounding in Medical Vision-Language Models
Abstract:
Phrase grounding, i.e., mapping natural language phrases to specific image regions, holds significant potential for disease localization in medical imaging through clinical reports. While current state-of-the-art methods rely on discriminative, self-supervised contrastive models, we demonstrate that generative text-to-image diffusion models, leveraging cross-attention maps, can achieve superior zero-shot phrase grounding performance. Contrary to prior assumptions, we show that fine-tuning diffusion models with a frozen, domain-specific language model, such as CXR-BERT, substantially outperforms domain-agnostic counterparts. This setup achieves remarkable improvements, with mIoU scores doubling those of current discriminative methods. These findings highlight the underexplored potential of generative models for phrase grounding tasks. To further enhance performance, we introduce Bimodal Bias Merging (BBM), a novel post-processing technique that aligns text and image biases to identify regions of high certainty. BBM refines cross-attention maps, achieving even greater localization accuracy. Our results establish generative approaches as a more effective paradigm for phrase grounding in the medical imaging domain, paving the way for more robust and interpretable applications in clinical practice. The source code and model weights are available at https://github.com/Felix-012/generate_to_ground.
中文: 基于生成式文本到图像扩散模型,结合跨注意力机制和领域特定语言模型微调,在医学影像短语定位任务中实现了卓越的零样本性能,其mIoU分数较现有判别方法提升一倍,并通过新型双模态偏置融合技术进一步提升了定位精度。
English: Generative text-to-image diffusion models, enhanced with cross-attention maps and fine-tuned using domain-specific language models, achieve superior zero-shot phrase grounding performance in medical imaging, doubling mIoU scores over current discriminative methods and introducing Bimodal Bias Merging for further accuracy improvements.

Authors:Shuangli Du, Siming Yan, Zhenghao Shi, Zhenzhen You, Lu Sun
Title: Wavelet-based Decoupling Framework for low-light Stereo Image Enhancement
Abstract:
Low-light images suffer from complex degradation, and existing enhancement methods often encode all degradation factors within a single latent space. This leads to highly entangled features and strong black-box characteristics, making the model prone to shortcut learning. To mitigate the above issues, this paper proposes a wavelet-based low-light stereo image enhancement method with feature space decoupling. Our insight comes from the following findings: (1) Wavelet transform enables the independent processing of low-frequency and high-frequency information. (2) Illumination adjustment can be achieved by adjusting the low-frequency component of a low-light image, extracted through multi-level wavelet decomposition. Thus, by using wavelet transform the feature space is decomposed into a low-frequency branch for illumination adjustment and multiple high-frequency branches for texture enhancement. Additionally, stereo low-light image enhancement can extract useful cues from another view to improve enhancement. To this end, we propose a novel high-frequency guided cross-view interaction module (HF-CIM) that operates within high-frequency branches rather than across the entire feature space, effectively extracting valuable image details from the other view. Furthermore, to enhance the high-frequency information, a detail and texture enhancement module (DTEM) is proposed based on cross-attention mechanism. The model is trained on a dataset consisting of images with uniform illumination and images with non-uniform illumination. Experimental results on both real and synthetic images indicate that our algorithm offers significant advantages in light adjustment while effectively recovering high-frequency information. The code and dataset are publicly available at: https://github.com/Cherisherr/WDCI-Net.git.
中文摘要:本文提出了一种基于小波变换的低光立体图像增强方法,通过将特征空间解耦为低频和高频分支,并利用跨视角交互与纹理增强模块,有效改善了光照调整与细节恢复效果。
English Summary: This paper introduces a wavelet-based method that decouples feature space into low-frequency and high-frequency branches for stereo low-light image enhancement, utilizing cross-view interaction and texture enhancement modules to improve illumination adjustment and detail recovery.

Authors:Sergey Linok, Gleb Naumov
Title: Open-Vocabulary Indoor Object Grounding with 3D Hierarchical Scene Graph
Abstract:
We propose OVIGo-3DHSG method - Open-Vocabulary Indoor Grounding of objects using 3D Hierarchical Scene Graph. OVIGo-3DHSG represents an extensive indoor environment over a Hierarchical Scene Graph derived from sequences of RGB-D frames utilizing a set of open-vocabulary foundation models and sensor data processing. The hierarchical representation explicitly models spatial relations across floors, rooms, locations, and objects. To effectively address complex queries involving spatial reference to other objects, we integrate the hierarchical scene graph with a Large Language Model for multistep reasoning. This integration leverages inter-layer (e.g., room-to-object) and intra-layer (e.g., object-to-object) connections, enhancing spatial contextual understanding. We investigate the semantic and geometry accuracy of hierarchical representation on Habitat Matterport 3D Semantic multi-floor scenes. Our approach demonstrates efficient scene comprehension and robust object grounding compared to existing methods. Overall OVIGo-3DHSG demonstrates strong potential for applications requiring spatial reasoning and understanding of indoor environments. Related materials can be found at https://github.com/linukc/OVIGo-3DHSG.
中文: OVIGo-3DHSG方法通过三维分层场景图与大语言模型相结合,实现了开放词汇的室内物体定位与空间推理,在场景理解能力上显著优于现有方法。
English: The OVIGo-3DHSG method utilizes a 3D hierarchical scene graph and large language models to enable open-vocabulary object grounding with enhanced spatial reasoning in indoor environments, demonstrating superior scene comprehension compared to existing approaches.

Authors:Yiquan Gao, Duohui Xu
Title: Out-of-distribution data supervision towards biomedical semantic segmentation
Abstract:
Biomedical segmentation networks easily suffer from the unexpected misclassification between foreground and background objects when learning on limited and imperfect medical datasets. Inspired by the strong power of Out-of-Distribution (OoD) data on other visual tasks, we propose a data-centric framework, Med-OoD to address this issue by introducing OoD data supervision into fully-supervised biomedical segmentation with none of the following needs: (i) external data sources, (ii) feature regularization objectives, (iii) additional annotations. Our method can be seamlessly integrated into segmentation networks without any modification on the architectures. Extensive experiments show that Med-OoD largely prevents various segmentation networks from the pixel misclassification on medical images and achieves considerable performance improvements on Lizard dataset. We also present an emerging learning paradigm of training a medical segmentation network completely using OoD data devoid of foreground class labels, surprisingly turning out 76.1% mIoU as test result. We hope this learning paradigm will attract people to rethink the roles of OoD data. Code is made available at https://github.com/StudioYG/Med-OoD.
中文:Med-OoD框架利用分布外数据提升生物医学分割的准确性,无需外部数据或网络结构调整即可显著减少误分类,并在医学数据集上实现性能提升。
English: The Med-OoD framework leverages Out-of-Distribution data to enhance biomedical segmentation accuracy without requiring external data or network modifications, significantly reducing misclassification and improving performance on medical datasets.

Authors:Nataliia Molchanova, Alessandro Cagol, Mario Ocampo-Pineda, Po-Jui Lu, Matthias Weigel, Xinjie Chen, Erin Beck, Charidimos Tsagkas, Daniel Reich, Colin Vanden Bulcke, Anna Stolting, Serena Borrelli, Pietro Maggi, Adrien Depeursinge, Cristina Granziera, Henning Mueller, Pedro M. Gordaliza, Meritxell Bach Cuadra
Title: Benchmarking and Explaining Deep Learning Cortical Lesion MRI Segmentation in Multiple Sclerosis
Abstract:
Cortical lesions (CLs) have emerged as valuable biomarkers in multiple sclerosis (MS), offering high diagnostic specificity and prognostic relevance. However, their routine clinical integration remains limited due to subtle magnetic resonance imaging (MRI) appearance, challenges in expert annotation, and a lack of standardized automated methods. We propose a comprehensive multi-centric benchmark of CL detection and segmentation in MRI. A total of 656 MRI scans, including clinical trial and research data from four institutions, were acquired at 3T and 7T using MP2RAGE and MPRAGE sequences with expert-consensus annotations. We rely on the self-configuring nnU-Net framework, designed for medical imaging segmentation, and propose adaptations tailored to the improved CL detection. We evaluated model generalization through out-of-distribution testing, demonstrating strong lesion detection capabilities with an F1-score of 0.64 and 0.5 in and out of the domain, respectively. We also analyze internal model features and model errors for a better understanding of AI decision-making. Our study examines how data variability, lesion ambiguity, and protocol differences impact model performance, offering future recommendations to address these barriers to clinical adoption. To reinforce the reproducibility, the implementation and models will be publicly accessible and ready to use at https://github.com/Medical-Image-Analysis-Laboratory/ and https://doi.org/10.5281/zenodo.15911797.
中文: 本研究提出一个基于MRI数据的皮质病变检测与分割基准,采用改进的nnU-Net模型展现出稳定性能,并公开工具以促进临床应用。
English: This study introduces a benchmark for detecting and segmenting cortical lesions in multiple sclerosis using MRI data, employing an adapted nnU-Net model that demonstrates robust performance and provides public access to tools to enhance clinical adoption.

Authors:Xiang Yu, Xinyao Liu, Guang Liang
Title: YOLOv8-SMOT: An Efficient and Robust Framework for Real-Time Small Object Tracking via Slice-Assisted Training and Adaptive Association
Abstract:
Tracking small, agile multi-objects (SMOT), such as birds, from an Unmanned Aerial Vehicle (UAV) perspective is a highly challenging computer vision task. The difficulty stems from three main sources: the extreme scarcity of target appearance features, the complex motion entanglement caused by the combined dynamics of the camera and the targets themselves, and the frequent occlusions and identity ambiguity arising from dense flocking behavior. This paper details our championship-winning solution in the MVA 2025 "Finding Birds" Small Multi-Object Tracking Challenge (SMOT4SB), which adopts the tracking-by-detection paradigm with targeted innovations at both the detection and association levels. On the detection side, we propose a systematic training enhancement framework named \textbf{SliceTrain}. This framework, through the synergy of 'deterministic full-coverage slicing' and 'slice-level stochastic augmentation, effectively addresses the problem of insufficient learning for small objects in high-resolution image training. On the tracking side, we designed a robust tracker that is completely independent of appearance information. By integrating a \textbf{motion direction maintenance (EMA)} mechanism and an \textbf{adaptive similarity metric} combining \textbf{bounding box expansion and distance penalty} into the OC-SORT framework, our tracker can stably handle irregular motion and maintain target identities. Our method achieves state-of-the-art performance on the SMOT4SB public test set, reaching an SO-HOTA score of \textbf{55.205}, which fully validates the effectiveness and advancement of our framework in solving complex real-world SMOT problems. The source code will be made available at https://github.com/Salvatore-Love/YOLOv8-SMOT.
中文摘要:本文提出了一种针对无人机视角下小型敏捷物体跟踪的冠军解决方案,通过系统性训练增强框架提升检测能力,并设计了不依赖外观特征的鲁棒跟踪器来应对复杂运动模式。
English Summary: This paper presents a championship-winning solution for tracking small, agile objects from UAVs by enhancing detection through a systematic training framework and developing a robust tracker that handles motion complexities without relying on appearance features.

Authors:Hongxu Ma, Guanshuo Wang, Fufu Yu, Qiong Jia, Shouhong Ding
Title: MS-DETR: Towards Effective Video Moment Retrieval and Highlight Detection by Joint Motion-Semantic Learning
Abstract:
Video Moment Retrieval (MR) and Highlight Detection (HD) aim to pinpoint specific moments and assess clip-wise relevance based on the text query. While DETR-based joint frameworks have made significant strides, there remains untapped potential in harnessing the intricate relationships between temporal motion and spatial semantics within video content. In this paper, we propose the Motion-Semantics DETR (MS-DETR), a framework that captures rich motion-semantics features through unified learning for MR/HD tasks. The encoder first explicitly models disentangled intra-modal correlations within motion and semantics dimensions, guided by the given text queries. Subsequently, the decoder utilizes the task-wise correlation across temporal motion and spatial semantics dimensions to enable precise query-guided localization for MR and refined highlight boundary delineation for HD. Furthermore, we observe the inherent sparsity dilemma within the motion and semantics dimensions of MR/HD datasets. To address this issue, we enrich the corpus from both dimensions by generation strategies and propose contrastive denoising learning to ensure the above components learn robustly and effectively. Extensive experiments on four MR/HD benchmarks demonstrate that our method outperforms existing state-of-the-art models by a margin. Our code is available at https://github.com/snailma0229/MS-DETR.git.
中文: 本文提出的MS-DETR框架通过统一学习视频中的运动与语义特征,结合对比去噪和数据增强策略,在时刻检索与高光检测任务上实现了超越现有最优模型的性能表现。
English: The paper introduces MS-DETR, a framework that leverages unified learning of motion and semantics features to enhance Video Moment Retrieval and Highlight Detection, achieving superior performance on benchmarks through contrastive denoising and data enrichment strategies.

Authors:Kun-Hsiang Lin, Yu-Wen Tseng, Kang-Yang Huang, Jhih-Ciang Wu, Wen-Huang Cheng
Title: InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing
Abstract:
Face anti-spoofing (FAS) aims to construct a robust system that can withstand diverse attacks. While recent efforts have concentrated mainly on cross-domain generalization, two significant challenges persist: limited semantic understanding of attack types and training redundancy across domains. We address the first by integrating vision-language models (VLMs) to enhance the perception of visual input. For the second challenge, we employ a meta-domain strategy to learn a unified model that generalizes well across multiple domains. Our proposed InstructFLIP is a novel instruction-tuned framework that leverages VLMs to enhance generalization via textual guidance trained solely on a single domain. At its core, InstructFLIP explicitly decouples instructions into content and style components, where content-based instructions focus on the essential semantics of spoofing, and style-based instructions consider variations related to the environment and camera characteristics. Extensive experiments demonstrate the effectiveness of InstructFLIP by outperforming SOTA models in accuracy and substantially reducing training redundancy across diverse domains in FAS. Project website is available at https://kunkunlin1221.github.io/InstructFLIP.

Authors:Beining Xu, Siting Zhu, Hesheng Wang
Title: SGLoc: Semantic Localization System for Camera Pose Estimation from 3D Gaussian Splatting Representation
Abstract:
We propose SGLoc, a novel localization system that directly regresses camera poses from 3D Gaussian Splatting (3DGS) representation by leveraging semantic information. Our method utilizes the semantic relationship between 2D image and 3D scene representation to estimate the 6DoF pose without prior pose information. In this system, we introduce a multi-level pose regression strategy that progressively estimates and refines the pose of query image from the global 3DGS map, without requiring initial pose priors. Moreover, we introduce a semantic-based global retrieval algorithm that establishes correspondences between 2D (image) and 3D (3DGS map). By matching the extracted scene semantic descriptors of 2D query image and 3DGS semantic representation, we align the image with the local region of the global 3DGS map, thereby obtaining a coarse pose estimation. Subsequently, we refine the coarse pose by iteratively optimizing the difference between the query image and the rendered image from 3DGS. Our SGLoc demonstrates superior performance over baselines on 12scenes and 7scenes datasets, showing excellent capabilities in global localization without initial pose prior. Code will be available at https://github.com/IRMVLab/SGLoc.
中文: SGLoc是一种新颖的定位系统,通过利用语义信息直接从3D高斯泼溅表示中回归相机位姿,在无需初始位姿先验的情况下,在基准数据集上展现出卓越性能。
English: SGLoc is a novel localization system that directly regresses camera poses from 3D Gaussian Splatting representation using semantic information, achieving superior performance on benchmark datasets without requiring initial pose priors.

Authors:Yuechen Xie, Jie Song, Yicheng Shan, Xiaoyan Zhang, Yuanyu Wan, Shengxuming Zhang, Jiarui Duan, Mingli Song
Title: Dataset Ownership Verification for Pre-trained Masked Models
Abstract:
High-quality open-source datasets have emerged as a pivotal catalyst driving the swift advancement of deep learning, while facing the looming threat of potential exploitation. Protecting these datasets is of paramount importance for the interests of their owners. The verification of dataset ownership has evolved into a crucial approach in this domain; however, existing verification techniques are predominantly tailored to supervised models and contrastive pre-trained models, rendering them ill-suited for direct application to the increasingly prevalent masked models. In this work, we introduce the inaugural methodology addressing this critical, yet unresolved challenge, termed Dataset Ownership Verification for Masked Modeling (DOV4MM). The central objective is to ascertain whether a suspicious black-box model has been pre-trained on a particular unlabeled dataset, thereby assisting dataset owners in safeguarding their rights. DOV4MM is grounded in our empirical observation that when a model is pre-trained on the target dataset, the difficulty of reconstructing masked information within the embedding space exhibits a marked contrast to models not pre-trained on that dataset. We validated the efficacy of DOV4MM through ten masked image models on ImageNet-1K and four masked language models on WikiText-103. The results demonstrate that DOV4MM rejects the null hypothesis, with a $p$-value considerably below 0.05, surpassing all prior approaches. Code is available at https://github.com/xieyc99/DOV4MM.
Chinese: 本研究提出的DOV4MM方法通过利用在目标数据集上预训练的模型中对掩码信息重建难度的显著差异,解决了掩码模型数据集所有权验证这一关键挑战,其性能显著优于现有方法。
English: The proposed DOV4MM method addresses the critical challenge of verifying dataset ownership for masked models by exploiting the distinct reconstruction difficulty of masked information in models pre-trained on the target dataset, demonstrating superior performance over existing techniques.

Authors:Linwei Chen, Lin Gu, Ying Fu
Title: Frequency-Dynamic Attention Modulation for Dense Prediction
Abstract:
Vision Transformers (ViTs) have significantly advanced computer vision, demonstrating strong performance across various tasks. However, the attention mechanism in ViTs makes each layer function as a low-pass filter, and the stacked-layer architecture in existing transformers suffers from frequency vanishing. This leads to the loss of critical details and textures. We propose a novel, circuit-theory-inspired strategy called Frequency-Dynamic Attention Modulation (FDAM), which can be easily plugged into ViTs. FDAM directly modulates the overall frequency response of ViTs and consists of two techniques: Attention Inversion (AttInv) and Frequency Dynamic Scaling (FreqScale). Since circuit theory uses low-pass filters as fundamental elements, we introduce AttInv, a method that generates complementary high-pass filtering by inverting the low-pass filter in the attention matrix, and dynamically combining the two. We further design FreqScale to weight different frequency components for fine-grained adjustments to the target response function. Through feature similarity analysis and effective rank evaluation, we demonstrate that our approach avoids representation collapse, leading to consistent performance improvements across various models, including SegFormer, DeiT, and MaskDINO. These improvements are evident in tasks such as semantic segmentation, object detection, and instance segmentation. Additionally, we apply our method to remote sensing detection, achieving state-of-the-art results in single-scale settings. The code is available at https://github.com/Linwei-Chen/FDAM.
中文: 提出的频率动态注意力调制(FDAM)通过互补高通滤波和动态频率缩放解决视觉Transformer中的频率消失问题,在多种视觉任务中实现性能提升且避免表示崩溃。
English: The proposed Frequency-Dynamic Attention Modulation (FDAM) enhances Vision Transformers by addressing frequency vanishing through complementary high-pass filtering and dynamic frequency scaling, resulting in improved performance across multiple vision tasks without representation collapse.

Authors:Hao Li, Ju Dai, Feng Zhou, Kaida Ning, Lei Li, Junjun Pan
Title: AU-Blendshape for Fine-grained Stylized 3D Facial Expression Manipulation
Abstract:
While 3D facial animation has made impressive progress, challenges still exist in realizing fine-grained stylized 3D facial expression manipulation due to the lack of appropriate datasets. In this paper, we introduce the AUBlendSet, a 3D facial dataset based on AU-Blendshape representation for fine-grained facial expression manipulation across identities. AUBlendSet is a blendshape data collection based on 32 standard facial action units (AUs) across 500 identities, along with an additional set of facial postures annotated with detailed AUs. Based on AUBlendSet, we propose AUBlendNet to learn AU-Blendshape basis vectors for different character styles. AUBlendNet predicts, in parallel, the AU-Blendshape basis vectors of the corresponding style for a given identity mesh, thereby achieving stylized 3D emotional facial manipulation. We comprehensively validate the effectiveness of AUBlendSet and AUBlendNet through tasks such as stylized facial expression manipulation, speech-driven emotional facial animation, and emotion recognition data augmentation. Through a series of qualitative and quantitative experiments, we demonstrate the potential and importance of AUBlendSet and AUBlendNet in 3D facial animation tasks. To the best of our knowledge, AUBlendSet is the first dataset, and AUBlendNet is the first network for continuous 3D facial expression manipulation for any identity through facial AUs. Our source code is available at https://github.com/wslh852/AUBlendNet.git.
中文: 本文提出了首个基于动作单元混合形状的3D人脸数据集AUBlendSet,并开发了AUBlendNet网络,通过学习不同风格的动作单元基向量,实现了跨身份的风格化3D面部表情操控。
English: This paper introduces AUBlendSet, the first 3D facial dataset using AU-Blendshape representation, and AUBlendNet, a novel network that enables stylized 3D facial expression manipulation across identities by learning action unit basis vectors.

Authors:Jiahao Xia, Yike Wu, Wenjian Huang, Jianguo Zhang, Jian Zhang
Title: Unsupervised Part Discovery via Descriptor-Based Masked Image Restoration with Optimized Constraints
Abstract:
Part-level features are crucial for image understanding, but few studies focus on them because of the lack of fine-grained labels. Although unsupervised part discovery can eliminate the reliance on labels, most of them cannot maintain robustness across various categories and scenarios, which restricts their application range. To overcome this limitation, we present a more effective paradigm for unsupervised part discovery, named Masked Part Autoencoder (MPAE). It first learns part descriptors as well as a feature map from the inputs and produces patch features from a masked version of the original images. Then, the masked regions are filled with the learned part descriptors based on the similarity between the local features and descriptors. By restoring these masked patches using the part descriptors, they become better aligned with their part shapes, guided by appearance features from unmasked patches. Finally, MPAE robustly discovers meaningful parts that closely match the actual object shapes, even in complex scenarios. Moreover, several looser yet more effective constraints are proposed to enable MPAE to identify the presence of parts across various scenarios and categories in an unsupervised manner. This provides the foundation for addressing challenges posed by occlusion and for exploring part similarity across multiple categories. Extensive experiments demonstrate that our method robustly discovers meaningful parts across various categories and scenarios. The code is available at the project https://github.com/Jiahao-UTS/MPAE.
Chinese: 掩码部件自编码器(MPAE)提出了一种无监督方法,通过学习部件描述符并重构掩码图像块,能在各种类别和场景中稳健地发现有意义部件,克服了以往方法的局限性。
English: The Masked Part Autoencoder (MPAE) introduces an unsupervised method that robustly discovers meaningful parts across diverse categories and scenarios by learning part descriptors and reconstructing masked image patches, overcoming limitations of previous approaches.

Authors:Jiajian Xie, Shengyu Zhang, Zhou Zhao, Fan Wu, Fei Wu
Title: EC-Diff: Fast and High-Quality Edge-Cloud Collaborative Inference for Diffusion Models
Abstract:
Diffusion Models have shown remarkable proficiency in image and video synthesis. As model size and latency increase limit user experience, hybrid edge-cloud collaborative framework was recently proposed to realize fast inference and high-quality generation, where the cloud model initiates high-quality semantic planning and the edge model expedites later-stage refinement. However, excessive cloud denoising prolongs inference time, while insufficient steps cause semantic ambiguity, leading to inconsistency in edge model output. To address these challenges, we propose EC-Diff that accelerates cloud inference through gradient-based noise estimation while identifying the optimal point for cloud-edge handoff to maintain generation quality. Specifically, we design a K-step noise approximation strategy to reduce cloud inference frequency by using noise gradients between steps and applying cloud inference periodically to adjust errors. Then we design a two-stage greedy search algorithm to efficiently find the optimal parameters for noise approximation and edge model switching. Extensive experiments demonstrate that our method significantly enhances generation quality compared to edge inference, while achieving up to an average $2\times$ speedup in inference compared to cloud inference. Video samples and source code are available at https://ec-diff.github.io/.
中文摘要:EC-Diff框架通过基于梯度的噪声估计和周期性云端推理加速扩散模型处理,同时采用优化切换策略保持生成质量,实现云端推理速度平均提升2倍且超越边缘模型生成效果。
English Summary: The proposed EC-Diff framework optimizes cloud-edge collaboration for diffusion models by using gradient-based noise estimation and periodic cloud inference to accelerate processing while maintaining output quality through an optimized handoff strategy.

Authors:Jianzhe Ma, Wenxuan Wang, Qin Jin
Title: A Survey of Deep Learning for Geometry Problem Solving
Abstract:
Geometry problem solving, a crucial aspect of mathematical reasoning, is vital across various domains, including education, the assessment of AI's mathematical abilities, and multimodal capability evaluation. The recent surge in deep learning technologies, particularly the emergence of multimodal large language models, has significantly accelerated research in this area. This paper provides a survey of the applications of deep learning in geometry problem solving, including (i) a comprehensive summary of the relevant tasks in geometry problem solving; (ii) a thorough review of related deep learning methods; (iii) a detailed analysis of evaluation metrics and methods; and (iv) a critical discussion of the current challenges and future directions that can be explored. Our objective is to offer a comprehensive and practical reference of deep learning for geometry problem solving, thereby fostering further advancements in this field. We create a continuously updated list of papers on GitHub: https://github.com/majianz/dl4gps.
中文: 本文综述了深度学习在几何解题中的应用,涵盖任务、方法、评估指标及未来挑战,旨在推动该领域发展。
English: This paper surveys deep learning applications in geometry problem solving, covering tasks, methods, evaluation metrics, and future challenges to advance the field.

Authors:Wei Sun, Linhan Cao, Kang Fu, Dandan Zhu, Jun Jia, Menghan Hu, Xiongkuo Min, Guangtao Zhai
Title: CompressedVQA-HDR: Generalized Full-reference and No-reference Quality Assessment Models for Compressed High Dynamic Range Videos
Abstract:
Video compression is a standard procedure applied to all videos to minimize storage and transmission demands while preserving visual quality as much as possible. Therefore, evaluating the visual quality of compressed videos is crucial for guiding the practical usage and further development of video compression algorithms. Although numerous compressed video quality assessment (VQA) methods have been proposed, they often lack the generalization capability needed to handle the increasing diversity of video types, particularly high dynamic range (HDR) content. In this paper, we introduce CompressedVQA-HDR, an effective VQA framework designed to address the challenges of HDR video quality assessment. Specifically, we adopt the Swin Transformer and SigLip 2 as the backbone networks for the proposed full-reference (FR) and no-reference (NR) VQA models, respectively. For the FR model, we compute deep structural and textural similarities between reference and distorted frames using intermediate-layer features extracted from the Swin Transformer as its quality-aware feature representation. For the NR model, we extract the global mean of the final-layer feature maps from SigLip 2 as its quality-aware representation. To mitigate the issue of limited HDR training data, we pre-train the FR model on a large-scale standard dynamic range (SDR) VQA dataset and fine-tune it on the HDRSDR-VQA dataset. For the NR model, we employ an iterative mixed-dataset training strategy across multiple compressed VQA datasets, followed by fine-tuning on the HDRSDR-VQA dataset. Experimental results show that our models achieve state-of-the-art performance compared to existing FR and NR VQA models. Moreover, CompressedVQA-HDR-FR won first place in the FR track of the Generalizable HDR & SDR Video Quality Measurement Grand Challenge at IEEE ICME 2025. The code is available at https://github.com/sunwei925/CompressedVQA-HDR.
中文: 本文提出CompressedVQA-HDR框架,分别采用Swin Transformer和SigLip 2架构构建全参考和无参考视频质量评估模型,通过创新的训练策略在压缩HDR视频质量评估中实现了最优性能。
English: This paper introduces CompressedVQA-HDR, a novel video quality assessment framework that employs Swin Transformer and SigLip 2 architectures for full-reference and no-reference models respectively, achieving state-of-the-art performance in evaluating compressed HDR videos through innovative training strategies.

Authors:Linwei Chen, Ying Fu, Lin Gu, Dezhi Zheng, Jifeng Dai
Title: Spatial Frequency Modulation for Semantic Segmentation
Abstract:
High spatial frequency information, including fine details like textures, significantly contributes to the accuracy of semantic segmentation. However, according to the Nyquist-Shannon Sampling Theorem, high-frequency components are vulnerable to aliasing or distortion when propagating through downsampling layers such as strided-convolution. Here, we propose a novel Spatial Frequency Modulation (SFM) that modulates high-frequency features to a lower frequency before downsampling and then demodulates them back during upsampling. Specifically, we implement modulation through adaptive resampling (ARS) and design a lightweight add-on that can densely sample the high-frequency areas to scale up the signal, thereby lowering its frequency in accordance with the Frequency Scaling Property. We also propose Multi-Scale Adaptive Upsampling (MSAU) to demodulate the modulated feature and recover high-frequency information through non-uniform upsampling This module further improves segmentation by explicitly exploiting information interaction between densely and sparsely resampled areas at multiple scales. Both modules can seamlessly integrate with various architectures, extending from convolutional neural networks to transformers. Feature visualization and analysis confirm that our method effectively alleviates aliasing while successfully retaining details after demodulation. Finally, we validate the broad applicability and effectiveness of SFM by extending it to image classification, adversarial robustness, instance segmentation, and panoptic segmentation tasks. The code is available at https://github.com/Linwei-Chen/SFM.
Chinese Summary: 提出的空间频率调制(SFM)方法通过在下采样前调制高频特征并在上采样时解调,有效保留语义分割中的细节信息,并在多种计算机视觉任务中验证了其广泛适用性。
English Summary: The proposed Spatial Frequency Modulation (SFM) method effectively preserves high-frequency details in semantic segmentation by modulating features before downsampling and demodulating them during upsampling, with demonstrated success across multiple computer vision tasks.

Authors:Peiwen Xia, Tangfei Liao, Wei Zhu, Danhuai Zhao, Jianjun Ke, Kaihao Zhang, Tong Lu, Tao Wang
Title: CorrMoE: Mixture of Experts with De-stylization Learning for Cross-Scene and Cross-Domain Correspondence Pruning
Abstract:
Establishing reliable correspondences between image pairs is a fundamental task in computer vision, underpinning applications such as 3D reconstruction and visual localization. Although recent methods have made progress in pruning outliers from dense correspondence sets, they often hypothesize consistent visual domains and overlook the challenges posed by diverse scene structures. In this paper, we propose CorrMoE, a novel correspondence pruning framework that enhances robustness under cross-domain and cross-scene variations. To address domain shift, we introduce a De-stylization Dual Branch, performing style mixing on both implicit and explicit graph features to mitigate the adverse influence of domain-specific representations. For scene diversity, we design a Bi-Fusion Mixture of Experts module that adaptively integrates multi-perspective features through linear-complexity attention and dynamic expert routing. Extensive experiments on benchmark datasets demonstrate that CorrMoE achieves superior accuracy and generalization compared to state-of-the-art methods. The code and pre-trained models are available at https://github.com/peiwenxia/CorrMoE.
中文摘要:CorrMoE提出了一种新颖的对应关系筛选框架,通过去风格化双分支和双融合专家混合模块解决跨域跨场景挑战,在基准测试中展现出卓越的准确性和泛化能力。
English Summary: CorrMoE is a robust correspondence pruning framework that addresses cross-domain and cross-scene challenges through a de-stylization dual branch and bi-fusion mixture of experts module, achieving superior performance on benchmark datasets.

Authors:Benjamin Keel, Aaron Quyn, David Jayne, Maryam Mohsin, Samuel D. Relton
Title: Interpretable Prediction of Lymph Node Metastasis in Rectal Cancer MRI Using Variational Autoencoders
Abstract:
Effective treatment for rectal cancer relies on accurate lymph node metastasis (LNM) staging. However, radiological criteria based on lymph node (LN) size, shape and texture morphology have limited diagnostic accuracy. In this work, we investigate applying a Variational Autoencoder (VAE) as a feature encoder model to replace the large pre-trained Convolutional Neural Network (CNN) used in existing approaches. The motivation for using a VAE is that the generative model aims to reconstruct the images, so it directly encodes visual features and meaningful patterns across the data. This leads to a disentangled and structured latent space which can be more interpretable than a CNN. Models are deployed on an in-house MRI dataset with 168 patients who did not undergo neo-adjuvant treatment. The post-operative pathological N stage was used as the ground truth to evaluate model predictions. Our proposed model 'VAE-MLP' achieved state-of-the-art performance on the MRI dataset, with cross-validated metrics of AUC 0.86 +/- 0.05, Sensitivity 0.79 +/- 0.06, and Specificity 0.85 +/- 0.05. Code is available at: https://github.com/benkeel/Lymph_Node_Classification_MIUA.
中文: 本研究提出了一种VAE-MLP模型,利用变分自编码器进行特征编码以改进直肠癌淋巴结转移分期,在MRI数据集上取得了最佳性能,AUC达到0.86。
English: This study introduces a VAE-MLP model that uses a variational autoencoder for feature encoding to improve lymph node metastasis staging in rectal cancer, achieving state-of-the-art performance with an AUC of 0.86 on an MRI dataset.

Authors:Hanxue Gu, Yaqian Chen, Nicholas Konz, Qihang Li, Maciej A. Mazurowski
Title: Are Vision Foundation Models Ready for Out-of-the-Box Medical Image Registration?
Abstract:
Foundation models, pre-trained on large image datasets and capable of capturing rich feature representations, have recently shown potential for zero-shot image registration. However, their performance has mostly been tested in the context of rigid or less complex structures, such as the brain or abdominal organs, and it remains unclear whether these models can handle more challenging, deformable anatomy. Breast MRI registration is particularly difficult due to significant anatomical variation between patients, deformation caused by patient positioning, and the presence of thin and complex internal structure of fibroglandular tissue, where accurate alignment is crucial. Whether foundation model-based registration algorithms can address this level of complexity remains an open question. In this study, we provide a comprehensive evaluation of foundation model-based registration algorithms for breast MRI. We assess five pre-trained encoders, including DINO-v2, SAM, MedSAM, SSLSAM, and MedCLIP, across four key breast registration tasks that capture variations in different years and dates, sequences, modalities, and patient disease status (lesion versus no lesion). Our results show that foundation model-based algorithms such as SAM outperform traditional registration baselines for overall breast alignment, especially under large domain shifts, but struggle with capturing fine details of fibroglandular tissue. Interestingly, additional pre-training or fine-tuning on medical or breast-specific images in MedSAM and SSLSAM, does not improve registration performance and may even decrease it in some cases. Further work is needed to understand how domain-specific training influences registration and to explore targeted strategies that improve both global alignment and fine structure accuracy. We also publicly release our code at \href{https://github.com/mazurowski-lab/Foundation-based-reg}{Github}.
中文: 基础模型在零样本乳腺MRI配准中展现出潜力,能在大域偏移下超越传统方法的整体对齐效果,但难以精确捕捉纤维腺体组织的细微结构,且特定领域的训练并未持续提升性能。
English: Foundation models show promise for zero-shot breast MRI registration by outperforming traditional methods in overall alignment under large domain shifts, but they struggle with accurately capturing fine fibroglandular tissue details, and domain-specific training does not consistently enhance performance.

Authors:Zejian Li, Yize Li, Chenye Meng, Zhongni Liu, Yang Ling, Shengyuan Zhang, Guang Yang, Changyuan Yang, Zhiyuan Yang, Lingyun Sun
Title: Inversion-DPO: Precise and Efficient Post-Training for Diffusion Models
Abstract:
Recent advancements in diffusion models (DMs) have been propelled by alignment methods that post-train models to better conform to human preferences. However, these approaches typically require computation-intensive training of a base model and a reward model, which not only incurs substantial computational overhead but may also compromise model accuracy and training efficiency. To address these limitations, we propose Inversion-DPO, a novel alignment framework that circumvents reward modeling by reformulating Direct Preference Optimization (DPO) with DDIM inversion for DMs. Our method conducts intractable posterior sampling in Diffusion-DPO with the deterministic inversion from winning and losing samples to noise and thus derive a new post-training paradigm. This paradigm eliminates the need for auxiliary reward models or inaccurate appromixation, significantly enhancing both precision and efficiency of training. We apply Inversion-DPO to a basic task of text-to-image generation and a challenging task of compositional image generation. Extensive experiments show substantial performance improvements achieved by Inversion-DPO compared to existing post-training methods and highlight the ability of the trained generative models to generate high-fidelity compositionally coherent images. For the post-training of compostitional image geneation, we curate a paired dataset consisting of 11,140 images with complex structural annotations and comprehensive scores, designed to enhance the compositional capabilities of generative models. Inversion-DPO explores a new avenue for efficient, high-precision alignment in diffusion models, advancing their applicability to complex realistic generation tasks. Our code is available at https://github.com/MIGHTYEZ/Inversion-DPO
中文: Inversion-DPO提出了一种新颖的对齐框架,通过将DDIM反演与直接偏好优化相结合,绕过了奖励建模,在文本到图像和组合图像生成等任务中显著提升了扩散模型的训练精度和效率。
English: Inversion-DPO introduces a novel alignment framework that bypasses reward modeling by integrating DDIM inversion with Direct Preference Optimization, enhancing training precision and efficiency for diffusion models in tasks like text-to-image and compositional image generation.

Authors:Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, Jiwen Lu
Title: Streaming 4D Visual Geometry Transformer
Abstract:
Perceiving and reconstructing 4D spatial-temporal geometry from videos is a fundamental yet challenging computer vision task. To facilitate interactive and real-time applications, we propose a streaming 4D visual geometry transformer that shares a similar philosophy with autoregressive large language models. We explore a simple and efficient design and employ a causal transformer architecture to process the input sequence in an online manner. We use temporal causal attention and cache the historical keys and values as implicit memory to enable efficient streaming long-term 4D reconstruction. This design can handle real-time 4D reconstruction by incrementally integrating historical information while maintaining high-quality spatial consistency. For efficient training, we propose to distill knowledge from the dense bidirectional visual geometry grounded transformer (VGGT) to our causal model. For inference, our model supports the migration of optimized efficient attention operator (e.g., FlashAttention) from the field of large language models. Extensive experiments on various 4D geometry perception benchmarks demonstrate that our model increases the inference speed in online scenarios while maintaining competitive performance, paving the way for scalable and interactive 4D vision systems. Code is available at: https://github.com/wzzheng/StreamVGGT.
中文: 本文提出了一种流式4D视觉几何变换器,通过因果注意力和知识蒸馏技术,实现了从视频中进行实时高质量的4D重建,在保持竞争力的同时显著提升了交互应用的推理速度。
English: This paper introduces a streaming 4D visual geometry transformer that enables real-time, high-quality 4D reconstruction from videos by using causal attention and knowledge distillation, achieving competitive performance with faster inference for interactive applications.

Authors:Mengyu Wang, Henghui Ding, Jianing Peng, Yao Zhao, Yunpeng Chen, Yunchao Wei
Title: CharaConsist: Fine-Grained Consistent Character Generation
Abstract:
In text-to-image generation, producing a series of consistent contents that preserve the same identity is highly valuable for real-world applications. Although a few works have explored training-free methods to enhance the consistency of generated subjects, we observe that they suffer from the following problems. First, they fail to maintain consistent background details, which limits their applicability. Furthermore, when the foreground character undergoes large motion variations, inconsistencies in identity and clothing details become evident. To address these problems, we propose CharaConsist, which employs point-tracking attention and adaptive token merge along with decoupled control of the foreground and background. CharaConsist enables fine-grained consistency for both foreground and background, supporting the generation of one character in continuous shots within a fixed scene or in discrete shots across different scenes. Moreover, CharaConsist is the first consistent generation method tailored for text-to-image DiT model. Its ability to maintain fine-grained consistency, combined with the larger capacity of latest base model, enables it to produce high-quality visual outputs, broadening its applicability to a wider range of real-world scenarios. The source code has been released at https://github.com/Murray-Wang/CharaConsist
中文摘要:CharaConsist方法通过采用点跟踪注意力和自适应令牌合并技术,解决了文本生成图像中主体一致性问题,实现了前景和背景的精细一致性控制,适用于固定场景连续镜头和不同场景离散镜头的角色生成。
English Summary: The proposed CharaConsist method addresses subject consistency issues in text-to-image generation by implementing point-tracking attention and adaptive token merging, enabling fine-grained foreground and background consistency across continuous or discrete scenes.

Authors:Christian Daniele, Silvia Villa, Samuel Vaiter, Luca Calatroni
Title: Deep Equilibrium models for Poisson Imaging Inverse problems via Mirror Descent
Abstract:
Deep Equilibrium Models (DEQs) are implicit neural networks with fixed points, which have recently gained attention for learning image regularization functionals, particularly in settings involving Gaussian fidelities, where assumptions on the forward operator ensure contractiveness of standard (proximal) Gradient Descent operators. In this work, we extend the application of DEQs to Poisson inverse problems, where the data fidelity term is more appropriately modeled by the Kullback-Leibler divergence. To this end, we introduce a novel DEQ formulation based on Mirror Descent defined in terms of a tailored non-Euclidean geometry that naturally adapts with the structure of the data term. This enables the learning of neural regularizers within a principled training framework. We derive sufficient conditions to guarantee the convergence of the learned reconstruction scheme and propose computational strategies that enable both efficient training and fully parameter-free inference. Numerical experiments show that our method outperforms traditional model-based approaches and it is comparable to the performance of Bregman Plug-and-Play methods, while mitigating their typical drawbacks - namely, sensitivity to initialization and careful tuning of hyperparameters. The code is publicly available at https://github.com/christiandaniele/DEQ-MD.
中文: 本研究将深度平衡模型扩展应用于泊松逆问题,通过引入基于镜像下降的新颖公式并利用定制化的非欧几何结构,实现了高效学习神经正则化器,保证收敛性,在超越传统方法的同时减轻了对初始化和超参数调优的敏感性。
English: This study extends Deep Equilibrium Models (DEQs) to Poisson inverse problems by introducing a novel Mirror Descent-based formulation that leverages tailored non-Euclidean geometry, enabling efficient learning of neural regularizers with guaranteed convergence and outperforming traditional approaches while mitigating sensitivity to initialization and hyperparameters.

Authors:Kaif Shaikh, Franziska Boenisch, Adam Dziedzic
Title: Implementing Adaptations for Vision AutoRegressive Model
Abstract:
Vision AutoRegressive model (VAR) was recently introduced as an alternative to Diffusion Models (DMs) in image generation domain. In this work we focus on its adaptations, which aim to fine-tune pre-trained models to perform specific downstream tasks, like medical data generation. While for DMs there exist many techniques, adaptations for VAR remain underexplored. Similarly, differentially private (DP) adaptations-ones that aim to preserve privacy of the adaptation data-have been extensively studied for DMs, while VAR lacks such solutions. In our work, we implement and benchmark many strategies for VAR, and compare them to state-of-the-art DM adaptation strategies. We observe that VAR outperforms DMs for non-DP adaptations, however, the performance of DP suffers, which necessitates further research in private adaptations for VAR. Code is available at https://github.com/sprintml/finetuning_var_dp.
Chinese: VAR模型在非隐私适应方面优于扩散模型,但在差分隐私适应中表现不佳,亟需进一步研究以提升其隐私保护性能。
English: The VAR model surpasses DMs in non-private adaptations but requires further research for effective differentially private implementations, as current DP strategies underperform.

Authors:Hongbo Ye, Fenghe Tang, Peiang Zhao, Zhen Huang, Dexin Zhao, Minghao Bian, S. Kevin Zhou
Title: U-RWKV: Lightweight medical image segmentation with direction-adaptive RWKV
Abstract:
Achieving equity in healthcare accessibility requires lightweight yet high-performance solutions for medical image segmentation, particularly in resource-limited settings. Existing methods like U-Net and its variants often suffer from limited global Effective Receptive Fields (ERFs), hindering their ability to capture long-range dependencies. To address this, we propose U-RWKV, a novel framework leveraging the Recurrent Weighted Key-Value(RWKV) architecture, which achieves efficient long-range modeling at O(N) computational cost. The framework introduces two key innovations: the Direction-Adaptive RWKV Module(DARM) and the Stage-Adaptive Squeeze-and-Excitation Module(SASE). DARM employs Dual-RWKV and QuadScan mechanisms to aggregate contextual cues across images, mitigating directional bias while preserving global context and maintaining high computational efficiency. SASE dynamically adapts its architecture to different feature extraction stages, balancing high-resolution detail preservation and semantic relationship capture. Experiments demonstrate that U-RWKV achieves state-of-the-art segmentation performance with high computational efficiency, offering a practical solution for democratizing advanced medical imaging technologies in resource-constrained environments. The code is available at https://github.com/hbyecoding/U-RWKV.
中文: U-RWKV提出了一种基于RWKV架构的新型医学图像分割框架,通过DARM和SASE模块以O(N)计算成本高效捕获长程依赖关系,在资源有限环境中实现了最先进的性能,为医疗公平性提供了实用解决方案。
English: U-RWKV introduces a novel medical image segmentation framework using the RWKV architecture with DARM and SASE modules to efficiently capture long-range dependencies at O(N) computational cost, achieving state-of-the-art performance for equitable healthcare accessibility in resource-limited settings.

Authors:Pierrick Leroy, Antonio Mastropietro, Marco Nurisso, Francesco Vaccarino
Title: Attributes Shape the Embedding Space of Face Recognition Models
Abstract:
Face Recognition (FR) tasks have made significant progress with the advent of Deep Neural Networks, particularly through margin-based triplet losses that embed facial images into high-dimensional feature spaces. During training, these contrastive losses focus exclusively on identity information as labels. However, we observe a multiscale geometric structure emerging in the embedding space, influenced by interpretable facial (e.g., hair color) and image attributes (e.g., contrast). We propose a geometric approach to describe the dependence or invariance of FR models to these attributes and introduce a physics-inspired alignment metric. We evaluate the proposed metric on controlled, simplified models and widely used FR models fine-tuned with synthetic data for targeted attribute augmentation. Our findings reveal that the models exhibit varying degrees of invariance across different attributes, providing insight into their strengths and weaknesses and enabling deeper interpretability. Code available here: https://github.com/mantonios107/attrs-fr-embs}{https://github.com/mantonios107/attrs-fr-embs
中文: 本研究提出了一种几何方法和物理启发的度量标准,用于分析人脸识别模型对可解释的面部和图像属性的响应,揭示了模型在不同属性上的不变性差异,从而提升了模型的可解释性。
English: This study introduces a geometric approach and a physics-inspired metric to analyze how face recognition models respond to interpretable facial and image attributes, revealing varying levels of invariance across attributes and enhancing model interpretability.

Authors:Jianfei Jiang, Qiankun Liu, Haochen Yu, Hongyuan Liu, Liyong Wang, Jiansheng Chen, Huimin Ma
Title: MonoMVSNet: Monocular Priors Guided Multi-View Stereo Network
Abstract:
Learning-based Multi-View Stereo (MVS) methods aim to predict depth maps for a sequence of calibrated images to recover dense point clouds. However, existing MVS methods often struggle with challenging regions, such as textureless regions and reflective surfaces, where feature matching fails. In contrast, monocular depth estimation inherently does not require feature matching, allowing it to achieve robust relative depth estimation in these regions. To bridge this gap, we propose MonoMVSNet, a novel monocular feature and depth guided MVS network that integrates powerful priors from a monocular foundation model into multi-view geometry. Firstly, the monocular feature of the reference view is integrated into source view features by the attention mechanism with a newly designed cross-view position encoding. Then, the monocular depth of the reference view is aligned to dynamically update the depth candidates for edge regions during the sampling procedure. Finally, a relative consistency loss is further designed based on the monocular depth to supervise the depth prediction. Extensive experiments demonstrate that MonoMVSNet achieves state-of-the-art performance on the DTU and Tanks-and-Temples datasets, ranking first on the Tanks-and-Temples Intermediate and Advanced benchmarks. The source code is available at https://github.com/JianfeiJ/MonoMVSNet.
中文: MonoMVSNet通过注意力机制和动态深度候选更新,将单目深度与特征先验融入多视角立体视觉,在DTU和Tanks-and-Temples等基准测试中取得了领先性能。
English: MonoMVSNet enhances Multi-View Stereo by integrating monocular depth and feature priors through attention mechanisms and dynamic depth candidate updates, achieving state-of-the-art results on benchmarks like DTU and Tanks-and-Temples.

Authors:An-Lun Liu, Yu-Wei Chao, Yi-Ting Chen
Title: Task-Oriented Human Grasp Synthesis via Context- and Task-Aware Diffusers
Abstract:
In this paper, we study task-oriented human grasp synthesis, a new grasp synthesis task that demands both task and context awareness. At the core of our method is the task-aware contact maps. Unlike traditional contact maps that only reason about the manipulated object and its relation with the hand, our enhanced maps take into account scene and task information. This comprehensive map is critical for hand-object interaction, enabling accurate grasping poses that align with the task. We propose a two-stage pipeline that first constructs a task-aware contact map informed by the scene and task. In the subsequent stage, we use this contact map to synthesize task-oriented human grasps. We introduce a new dataset and a metric for the proposed task to evaluate our approach. Our experiments validate the importance of modeling both scene and task, demonstrating significant improvements over existing methods in both grasp quality and task performance. See our project page for more details: https://hcis-lab.github.io/TOHGS/

Authors:Ronggang Huang, Haoxin Yang, Yan Cai, Xuemiao Xu, Huaidong Zhang, Shengfeng He
Title: ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition
Abstract:
3D visual grounding aims to identify and localize objects in a 3D space based on textual descriptions. However, existing methods struggle with disentangling targets from anchors in complex multi-anchor queries and resolving inconsistencies in spatial descriptions caused by perspective variations. To tackle these challenges, we propose ViewSRD, a framework that formulates 3D visual grounding as a structured multi-view decomposition process. First, the Simple Relation Decoupling (SRD) module restructures complex multi-anchor queries into a set of targeted single-anchor statements, generating a structured set of perspective-aware descriptions that clarify positional relationships. These decomposed representations serve as the foundation for the Multi-view Textual-Scene Interaction (Multi-TSI) module, which integrates textual and scene features across multiple viewpoints using shared, Cross-modal Consistent View Tokens (CCVTs) to preserve spatial correlations. Finally, a Textual-Scene Reasoning module synthesizes multi-view predictions into a unified and robust 3D visual grounding. Experiments on 3D visual grounding datasets show that ViewSRD significantly outperforms state-of-the-art methods, particularly in complex queries requiring precise spatial differentiation. Code is available at https://github.com/visualjason/ViewSRD.
中文:ViewSRD提出了一种结构化多视角分解框架,通过视角感知描述生成和跨模态视角融合来简化复杂多锚点查询,在三维视觉定位任务中实现了最先进的性能。
English: ViewSRD introduces a structured multi-view decomposition framework that simplifies complex multi-anchor queries through perspective-aware description generation and cross-modal view integration, achieving state-of-the-art performance in 3D visual grounding.

Authors:Guanghao Wu, Chen Xu, Hai Song, Chong Wang, Qixing Zhang
Title: MFGDiffusion: Mask-Guided Smoke Synthesis for Enhanced Forest Fire Detection
Abstract:
Smoke is the first visible indicator of a wildfire.With the advancement of deep learning, image-based smoke detection has become a crucial method for detecting and preventing forest fires. However, the scarcity of smoke image data from forest fires is one of the significant factors hindering the detection of forest fire smoke. Image generation models offer a promising solution for synthesizing realistic smoke images. However, current inpainting models exhibit limitations in generating high-quality smoke representations, particularly manifesting as inconsistencies between synthesized smoke and background contexts. To solve these problems, we proposed a comprehensive framework for generating forest fire smoke images. Firstly, we employed the pre-trained segmentation model and the multimodal model to obtain smoke masks and image captions.Then, to address the insufficient utilization of masks and masked images by inpainting models, we introduced a network architecture guided by mask and masked image features. We also proposed a new loss function, the mask random difference loss, which enhances the consistency of the generated effects around the mask by randomly expanding and eroding the mask edges.Finally, to generate a smoke image dataset using random masks for subsequent detection tasks, we incorporated smoke characteristics and use a multimodal large language model as a filtering tool to select diverse and reasonable smoke images, thereby improving the quality of the synthetic dataset. Experiments showed that our generated smoke images are realistic and diverse, and effectively enhance the performance of forest fire smoke detection models. Code is available at https://github.com/wghr123/MFGDiffusion.
中文: 本研究提出了一种基于扩散模型的综合框架,通过掩码特征引导网络和新型损失函数生成逼真森林火灾烟雾图像,有效解决数据稀缺问题并显著提升烟雾检测模型性能。
English: This study introduces a comprehensive framework for generating realistic forest fire smoke images using advanced diffusion models, addressing data scarcity and enhancing detection model performance through innovative mask-guided networks and loss functions.

Authors:X. Feng, H. Yu, M. Wu, S. Hu, J. Chen, C. Zhu, J. Wu, X. Chu, K. Huang
Title: NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation Models
Abstract:
With the rapid development of foundation video generation technologies, long video generation models have exhibited promising research potential thanks to expanded content creation space. Recent studies reveal that the goal of long video generation tasks is not only to extend video duration but also to accurately express richer narrative content within longer videos. However, due to the lack of evaluation benchmarks specifically designed for long video generation models, the current assessment of these models primarily relies on benchmarks with simple narrative prompts (e.g., VBench). To the best of our knowledge, our proposed NarrLV is the first benchmark to comprehensively evaluate the Narrative expression capabilities of Long Video generation models. Inspired by film narrative theory, (i) we first introduce the basic narrative unit maintaining continuous visual presentation in videos as Temporal Narrative Atom (TNA), and use its count to quantitatively measure narrative richness. Guided by three key film narrative elements influencing TNA changes, we construct an automatic prompt generation pipeline capable of producing evaluation prompts with a flexibly expandable number of TNAs. (ii) Then, based on the three progressive levels of narrative content expression, we design an effective evaluation metric using the MLLM-based question generation and answering framework. (iii) Finally, we conduct extensive evaluations on existing long video generation models and the foundation generation models. Experimental results demonstrate that our metric aligns closely with human judgments. The derived evaluation outcomes reveal the detailed capability boundaries of current video generation models in narrative content expression.

Authors:Zhifeng Gu, Bing Wang
Title: MMOne: Representing Multiple Modalities in One Scene
Abstract:
Humans perceive the world through multimodal cues to understand and interact with the environment. Learning a scene representation for multiple modalities enhances comprehension of the physical world. However, modality conflicts, arising from inherent distinctions among different modalities, present two critical challenges: property disparity and granularity disparity. To address these challenges, we propose a general framework, MMOne, to represent multiple modalities in one scene, which can be readily extended to additional modalities. Specifically, a modality modeling module with a novel modality indicator is proposed to capture the unique properties of each modality. Additionally, we design a multimodal decomposition mechanism to separate multi-modal Gaussians into single-modal Gaussians based on modality differences. We address the essential distinctions among modalities by disentangling multimodal information into shared and modality-specific components, resulting in a more compact and efficient multimodal scene representation. Extensive experiments demonstrate that our method consistently enhances the representation capability for each modality and is scalable to additional modalities. The code is available at https://github.com/Neal2020GitHub/MMOne.
中文摘要:本研究提出MMOne框架,通过模态建模和分解机制解决模态冲突,将多模态信息解耦为共享与特定成分,从而提升各模态表示能力并支持扩展。
English Summary: The study introduces MMOne, a framework that addresses modality conflicts by modeling unique properties and decomposing multimodal information into shared and specific components, resulting in enhanced and scalable scene representation.

Authors:Hankun Liu, Yujian Zhao, Guanglin Niu
Title: Try Harder: Hard Sample Generation and Learning for Clothes-Changing Person Re-ID
Abstract:
Hard samples pose a significant challenge in person re-identification (ReID) tasks, particularly in clothing-changing person Re-ID (CC-ReID). Their inherent ambiguity or similarity, coupled with the lack of explicit definitions, makes them a fundamental bottleneck. These issues not only limit the design of targeted learning strategies but also diminish the model's robustness under clothing or viewpoint changes. In this paper, we propose a novel multimodal-guided Hard Sample Generation and Learning (HSGL) framework, which is the first effort to unify textual and visual modalities to explicitly define, generate, and optimize hard samples within a unified paradigm. HSGL comprises two core components: (1) Dual-Granularity Hard Sample Generation (DGHSG), which leverages multimodal cues to synthesize semantically consistent samples, including both coarse- and fine-grained hard positives and negatives for effectively increasing the hardness and diversity of the training data. (2) Hard Sample Adaptive Learning (HSAL), which introduces a hardness-aware optimization strategy that adjusts feature distances based on textual semantic labels, encouraging the separation of hard positives and drawing hard negatives closer in the embedding space to enhance the model's discriminative capability and robustness to hard samples. Extensive experiments on multiple CC-ReID benchmarks demonstrate the effectiveness of our approach and highlight the potential of multimodal-guided hard sample generation and learning for robust CC-ReID. Notably, HSAL significantly accelerates the convergence of the targeted learning procedure and achieves state-of-the-art performance on both PRCC and LTCC datasets. The code is available at https://github.com/undooo/TryHarder-ACMMM25.
中文: HSGL框架创新性地融合文本与视觉模态,通过双粒度样本生成和自适应学习来定义、生成及优化困难样本,显著提升了模型在换装行人重识别中的鲁棒性,并取得了最优性能。
English: The proposed HSGL framework innovatively integrates text and visual modalities to define, generate, and optimize hard samples through dual-granularity generation and adaptive learning, significantly enhancing model robustness and achieving state-of-the-art performance in clothing-changing person re-identification.

Authors:Weizhao Ma, Dong Zhou, Yuhui Hu, Zipeng He
Title: GKNet: Graph-based Keypoints Network for Monocular Pose Estimation of Non-cooperative Spacecraft
Abstract:
Monocular pose estimation of non-cooperative spacecraft is significant for on-orbit service (OOS) tasks, such as satellite maintenance, space debris removal, and station assembly. Considering the high demands on pose estimation accuracy, mainstream monocular pose estimation methods typically consist of keypoint detectors and PnP solver. However, current keypoint detectors remain vulnerable to structural symmetry and partial occlusion of non-cooperative spacecraft. To this end, we propose a graph-based keypoints network for the monocular pose estimation of non-cooperative spacecraft, GKNet, which leverages the geometric constraint of keypoints graph. In order to better validate keypoint detectors, we present a moderate-scale dataset for the spacecraft keypoint detection, named SKD, which consists of 3 spacecraft targets, 90,000 simulated images, and corresponding high-precise keypoint annotations. Extensive experiments and an ablation study have demonstrated the high accuracy and effectiveness of our GKNet, compared to the state-of-the-art spacecraft keypoint detectors. The code for GKNet and the SKD dataset is available at https://github.com/Dongzhou-1996/GKNet.
Chinese: 本文提出GKNet,一种基于图结构的关键点检测网络,通过利用关键点间的几何约束有效解决了非合作航天器结构对称性和局部遮挡对单目姿态估计的影响,并基于新构建的数据集验证了其优越性能。
English: This paper introduces GKNet, a graph-based keypoint detection network that addresses challenges like structural symmetry and occlusion in monocular pose estimation of non-cooperative spacecraft, validated through a new dataset and outperforming existing methods.

Authors:Jeongyun Kim, Seunghoon Jeong, Giseop Kim, Myung-Hwan Jeon, Eunji Jun, Ayoung Kim
Title: TRAN-D: 2D Gaussian Splatting-based Sparse-view Transparent Object Depth Reconstruction via Physics Simulation for Scene Update
Abstract:
Understanding the 3D geometry of transparent objects from RGB images is challenging due to their inherent physical properties, such as reflection and refraction. To address these difficulties, especially in scenarios with sparse views and dynamic environments, we introduce TRAN-D, a novel 2D Gaussian Splatting-based depth reconstruction method for transparent objects. Our key insight lies in separating transparent objects from the background, enabling focused optimization of Gaussians corresponding to the object. We mitigate artifacts with an object-aware loss that places Gaussians in obscured regions, ensuring coverage of invisible surfaces while reducing overfitting. Furthermore, we incorporate a physics-based simulation that refines the reconstruction in just a few seconds, effectively handling object removal and chain-reaction movement of remaining objects without the need for rescanning. TRAN-D is evaluated on both synthetic and real-world sequences, and it consistently demonstrated robust improvements over existing GS-based state-of-the-art methods. In comparison with baselines, TRAN-D reduces the mean absolute error by over 39% for the synthetic TRansPose sequences. Furthermore, despite being updated using only one image, TRAN-D reaches a δ < 2.5 cm accuracy of 48.46%, over 1.5 times that of baselines, which uses six images. Code and more results are available at https://jeongyun0609.github.io/TRAN-D/.

Authors:Hayeon Kim, Ji Ha Jang, Se Young Chun
Title: Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling
Abstract:
Recent advances in 3D neural representations and instance-level editing models have enabled the efficient creation of high-quality 3D content. However, achieving precise local 3D edits remains challenging, especially for Gaussian Splatting, due to inconsistent multi-view 2D part segmentations and inherently ambiguous nature of Score Distillation Sampling (SDS) loss. To address these limitations, we propose RoMaP, a novel local 3D Gaussian editing framework that enables precise and drastic part-level modifications. First, we introduce a robust 3D mask generation module with our 3D-Geometry Aware Label Prediction (3D-GALP), which uses spherical harmonics (SH) coefficients to model view-dependent label variations and soft-label property, yielding accurate and consistent part segmentations across viewpoints. Second, we propose a regularized SDS loss that combines the standard SDS loss with additional regularizers. In particular, an L1 anchor loss is introduced via our Scheduled Latent Mixing and Part (SLaMP) editing method, which generates high-quality part-edited 2D images and confines modifications only to the target region while preserving contextual coherence. Additional regularizers, such as Gaussian prior removal, further improve flexibility by allowing changes beyond the existing context, and robust 3D masking prevents unintended edits. Experimental results demonstrate that our RoMaP achieves state-of-the-art local 3D editing on both reconstructed and generated Gaussian scenes and objects qualitatively and quantitatively, making it possible for more robust and flexible part-level 3D Gaussian editing. Code is available at https://janeyeon.github.io/romap.

Authors:Lirong Zheng, Yanshan Li, Rui Yu, Kaihao Zhang
Title: Efficient Dual-domain Image Dehazing with Haze Prior Perception
Abstract:
Transformer-based models exhibit strong global modeling capabilities in single-image dehazing, but their high computational cost limits real-time applicability. Existing methods predominantly rely on spatial-domain features to capture long-range dependencies, which are computationally expensive and often inadequate under complex haze conditions. While some approaches introduce frequency-domain cues, the weak coupling between spatial and frequency branches limits the overall performance. To overcome these limitations, we propose the Dark Channel Guided Frequency-aware Dehazing Network (DGFDNet), a novel dual-domain framework that performs physically guided degradation alignment across spatial and frequency domains. At its core, the DGFDBlock comprises two key modules: 1) the Haze-Aware Frequency Modulator (HAFM), which generates a pixel-level haze confidence map from dark channel priors to adaptively enhance haze-relevant frequency components, thereby achieving global degradation-aware spectral modulation; 2) the Multi-level Gating Aggregation Module (MGAM), which fuses multi-scale features through diverse convolutional kernels and hybrid gating mechanisms to recover fine structural details. Additionally, a Prior Correction Guidance Branch (PCGB) incorporates a closed-loop feedback mechanism, enabling iterative refinement of the prior by intermediate dehazed features and significantly improving haze localization accuracy, especially in challenging outdoor scenes. Extensive experiments on four benchmark haze datasets demonstrate that DGFDNet achieves state-of-the-art performance with superior robustness and real-time efficiency. Code is available at: https://github.com/Dilizlr/DGFDNet.
中文摘要:提出的DGFDNet通过融合暗通道先验与频率感知调制,结合多尺度特征门控融合机制,在基准去雾数据集上实现了最优性能并保持实时处理能力。
English Summary: The proposed DGFDNet introduces a dual-domain dehazing framework that integrates dark channel priors with frequency-aware modulation and multi-scale feature fusion, achieving state-of-the-art performance with real-time efficiency on benchmark datasets.

Authors:Xingyu Zheng, Haotong Qin, Yuye Li, Jiakai Wang, Jinyang Guo, Michele Magno, Xianglong Liu
Title: First-Order Error Matters: Accurate Compensation for Quantized Large Language Models
Abstract:
Post-training quantization (PTQ) offers an efficient approach to compressing large language models (LLMs), significantly reducing memory access and computational costs. Existing compensation-based weight calibration methods often rely on a second-order Taylor expansion to model quantization error, under the assumption that the first-order term is negligible in well-trained full-precision models. However, we reveal that the progressive compensation process introduces accumulated first-order deviations between latent weights and their full-precision counterparts, making this assumption fundamentally flawed. To address this, we propose FOEM, a novel PTQ method that explicitly incorporates first-order gradient terms to improve quantization error compensation. FOEM approximates gradients by directly computing the difference between latent and full-precision weights, avoiding the high cost and limited generalization of backpropagation-based gradient computation. This approach introduces minimal additional computational overhead. Moreover, FOEM leverages precomputed Cholesky factors to efficiently recover the inverse of Hessian submatrices in real time. Extensive experiments across a wide range of models and benchmarks demonstrate that FOEM consistently outperforms the classical GPTQ method. In 3-bit weight-only quantization, FOEM reduces the perplexity of Llama3-8B by 89.6%, and improves the 5-shot MMLU accuracy of Llama3-70B from 51.7% to 74.9%, approaching the full-precision performance of 78.6%. Furthermore, FOEM can be seamlessly integrated with advanced techniques such as GPTAQ and SpinQuant, yielding additional improvements under the challenging W4A4KV4 setting, and further narrowing the accuracy gap with full-precision baselines beyond what current state-of-the-art methods achieve. The code is available at https://github.com/Xingyu-Zheng/FOEM.
中文摘要:FOEM提出了一种新颖的训练后量化方法,通过引入一阶梯度项解决权重校准中的累积偏差问题,以极低计算开销显著提升模型性能。
English Summary: FOEM introduces a novel post-training quantization method that incorporates first-order gradient terms to address accumulated deviations in weight calibration, significantly improving model performance with minimal computational overhead.

Authors:Yanbo Wang, Zipeng Fang, Lei Zhao, Weidong Chen
Title: Learning to Tune Like an Expert: Interpretable and Scene-Aware Navigation via MLLM Reasoning and CVAE-Based Adaptation
Abstract:
Service robots are increasingly deployed in diverse and dynamic environments, where both physical layouts and social contexts change over time and across locations. In these unstructured settings, conventional navigation systems that rely on fixed parameters often fail to generalize across scenarios, resulting in degraded performance and reduced social acceptance. Although recent approaches have leveraged reinforcement learning to enhance traditional planners, these methods often fail in real-world deployments due to poor generalization and limited simulation diversity, which hampers effective sim-to-real transfer. To tackle these issues, we present LE-Nav, an interpretable and scene-aware navigation framework that leverages multi-modal large language model reasoning and conditional variational autoencoders to adaptively tune planner hyperparameters. To achieve zero-shot scene understanding, we utilize one-shot exemplars and chain-of-thought prompting strategies. Additionally, a conditional variational autoencoder captures the mapping between natural language instructions and navigation hyperparameters, enabling expert-level tuning. Experiments show that LE-Nav can generate hyperparameters achieving human-level tuning across diverse planners and scenarios. Real-world navigation trials and a user study on a smart wheelchair platform demonstrate that it outperforms state-of-the-art methods on quantitative metrics such as success rate, efficiency, safety, and comfort, while receiving higher subjective scores for perceived safety and social acceptance. Code is available at https://github.com/Cavendish518/LE-Nav.
中文: LE-Nav是一种可解释的导航框架,通过多模态推理和条件变分自编码器动态调整规划器参数,在真实场景试验中实现了人类水平的性能并优于现有方法。
English: LE-Nav is an interpretable navigation framework that uses multi-modal reasoning and conditional variational autoencoders to dynamically tune planner parameters, achieving human-level performance and outperforming existing methods in real-world trials.

Authors:Quan Bi Pay, Vishnu Monn Baskaran, Junn Yong Loo, KokSheik Wong, Simon See
Title: SpaRTAN: Spatial Reinforcement Token-based Aggregation Network for Visual Recognition
Abstract:
The resurgence of convolutional neural networks (CNNs) in visual recognition tasks, exemplified by ConvNeXt, has demonstrated their capability to rival transformer-based architectures through advanced training methodologies and ViT-inspired design principles. However, both CNNs and transformers exhibit a simplicity bias, favoring straightforward features over complex structural representations. Furthermore, modern CNNs often integrate MLP-like blocks akin to those in transformers, but these blocks suffer from significant information redundancies, necessitating high expansion ratios to sustain competitive performance. To address these limitations, we propose SpaRTAN, a lightweight architectural design that enhances spatial and channel-wise information processing. SpaRTAN employs kernels with varying receptive fields, controlled by kernel size and dilation factor, to capture discriminative multi-order spatial features effectively. A wave-based channel aggregation module further modulates and reinforces pixel interactions, mitigating channel-wise redundancies. Combining the two modules, the proposed network can efficiently gather and dynamically contextualize discriminative features. Experimental results in ImageNet and COCO demonstrate that SpaRTAN achieves remarkable parameter efficiency while maintaining competitive performance. In particular, on the ImageNet-1k benchmark, SpaRTAN achieves 77. 7% accuracy with only 3.8M parameters and approximately 1.0 GFLOPs, demonstrating its ability to deliver strong performance through an efficient design. On the COCO benchmark, it achieves 50.0% AP, surpassing the previous benchmark by 1.2% with only 21.5M parameters. The code is publicly available at [https://github.com/henry-pay/SpaRTAN].
中文: SpaRTAN是一种轻量级CNN架构,通过多尺度卷积核和基于波的通道聚合模块优化空间与通道信息处理,在ImageNet和COCO基准测试中以少量参数实现了卓越的性能表现。
English: SpaRTAN is a lightweight CNN architecture that enhances spatial and channel-wise feature processing through multi-scale kernels and wave-based aggregation, achieving competitive performance with high parameter efficiency on ImageNet and COCO benchmarks.

Authors:Ayush Gupta, Siyuan Huang, Rama Chellappa
Title: Mind the Gap: Bridging Occlusion in Gait Recognition via Residual Gap Correction
Abstract:
Gait is becoming popular as a method of person re-identification because of its ability to identify people at a distance. However, most current works in gait recognition do not address the practical problem of occlusions. Among those which do, some require paired tuples of occluded and holistic sequences, which are impractical to collect in the real world. Further, these approaches work on occlusions but fail to retain performance on holistic inputs. To address these challenges, we propose RG-Gait, a method for residual correction for occluded gait recognition with holistic retention. We model the problem as a residual learning task, conceptualizing the occluded gait signature as a residual deviation from the holistic gait representation. Our proposed network adaptively integrates the learned residual, significantly improving performance on occluded gait sequences without compromising the holistic recognition accuracy. We evaluate our approach on the challenging Gait3D, GREW and BRIAR datasets and show that learning the residual can be an effective technique to tackle occluded gait recognition with holistic retention. We release our code publicly at https://github.com/Ayush-00/rg-gait.
中文摘要:RG-Gait提出一种残差修正方法,将遮挡步态建模为完整步态的偏差表示,在提升遮挡识别能力的同时保持了对无遮挡步态的识别精度。
English Summary: RG-Gait introduces a residual correction method that models occluded gait as deviations from holistic representations, enhancing occlusion recognition without sacrificing performance on unoccluded inputs.

Authors:Quan Bi Pay, Vishnu Monn Baskaran, Junn Yong Loo, KokSheik Wong, Simon See
Title: Conceptualizing Multi-scale Wavelet Attention and Ray-based Encoding for Human-Object Interaction Detection
Abstract:
Human-object interaction (HOI) detection is essential for accurately localizing and characterizing interactions between humans and objects, providing a comprehensive understanding of complex visual scenes across various domains. However, existing HOI detectors often struggle to deliver reliable predictions efficiently, relying on resource-intensive training methods and inefficient architectures. To address these challenges, we conceptualize a wavelet attention-like backbone and a novel ray-based encoder architecture tailored for HOI detection. Our wavelet backbone addresses the limitations of expressing middle-order interactions by aggregating discriminative features from the low- and high-order interactions extracted from diverse convolutional filters. Concurrently, the ray-based encoder facilitates multi-scale attention by optimizing the focus of the decoder on relevant regions of interest and mitigating computational overhead. As a result of harnessing the attenuated intensity of learnable ray origins, our decoder aligns query embeddings with emphasized regions of interest for accurate predictions. Experimental results on benchmark datasets, including ImageNet and HICO-DET, showcase the potential of our proposed architecture. The code is publicly available at [https://github.com/henry-pay/RayEncoder].
Chinese: 本研究提出了一种小波类注意力主干和基于射线的编码器,通过高效聚合多尺度特征并优化计算焦点,提升了人-物交互检测的性能,在基准数据集上取得了良好效果。
English: The study introduces a wavelet attention-like backbone and a ray-based encoder to enhance human-object interaction detection by efficiently aggregating multi-scale features and optimizing computational focus, achieving promising results on benchmark datasets.

Authors:Shaowen Tong, Zimin Xia, Alexandre Alahi, Xuming He, Yujiao Shi
Title: GeoDistill: Geometry-Guided Self-Distillation for Weakly Supervised Cross-View Localization
Abstract:
Cross-view localization, the task of estimating a camera's 3-degrees-of-freedom (3-DoF) pose by aligning ground-level images with satellite images, is crucial for large-scale outdoor applications like autonomous navigation and augmented reality. Existing methods often rely on fully supervised learning, which requires costly ground-truth pose annotations. In this work, we propose GeoDistill, a Geometry guided weakly supervised self distillation framework that uses teacher-student learning with Field-of-View (FoV)-based masking to enhance local feature learning for robust cross-view localization. In GeoDistill, the teacher model localizes a panoramic image, while the student model predicts locations from a limited FoV counterpart created by FoV-based masking. By aligning the student's predictions with those of the teacher, the student focuses on key features like lane lines and ignores textureless regions, such as roads. This results in more accurate predictions and reduced uncertainty, regardless of whether the query images are panoramas or limited FoV images. Our experiments show that GeoDistill significantly improves localization performance across different frameworks. Additionally, we introduce a novel orientation estimation network that predicts relative orientation without requiring precise planar position ground truth. GeoDistill provides a scalable and efficient solution for real-world cross-view localization challenges. Code and model can be found at https://github.com/tongshw/GeoDistill.
中文: GeoDistill提出了一种弱监督自蒸馏框架,通过结合教师-学生学习和基于视场角的掩码来增强跨视角定位,无需昂贵标注即可提升精度并降低不确定性。
English: GeoDistill introduces a weakly supervised self-distillation framework that enhances cross-view localization by aligning teacher-student predictions with FoV-based masking, improving accuracy and reducing uncertainty without costly annotations.

Authors:Roman Naeem, David Hagerman, Jennifer Alvén, Lennart Svensson, Fredrik Kahl
Title: Trexplorer Super: Topologically Correct Centerline Tree Tracking of Tubular Objects in CT Volumes
Abstract:
Tubular tree structures, such as blood vessels and airways, are essential in human anatomy and accurately tracking them while preserving their topology is crucial for various downstream tasks. Trexplorer is a recurrent model designed for centerline tracking in 3D medical images but it struggles with predicting duplicate branches and terminating tracking prematurely. To address these issues, we present Trexplorer Super, an enhanced version that notably improves performance through novel advancements. However, evaluating centerline tracking models is challenging due to the lack of public datasets. To enable thorough evaluation, we develop three centerline datasets, one synthetic and two real, each with increasing difficulty. Using these datasets, we conduct a comprehensive evaluation of existing state-of-the-art (SOTA) models and compare them with our approach. Trexplorer Super outperforms previous SOTA models on every dataset. Our results also highlight that strong performance on synthetic data does not necessarily translate to real datasets. The code and datasets are available at https://github.com/RomStriker/Trexplorer-Super.
Chinese: Trexplorer Super 是一种增强模型,在三维医学图像中对管状结构中心线追踪的表现优于现有最先进方法,并在新开发的合成和真实数据集上进行了全面评估。
English: Trexplorer Super is an enhanced model that outperforms existing state-of-the-art methods in centerline tracking for tubular structures in 3D medical images, with evaluations conducted on newly developed synthetic and real datasets.

Authors:Chetan Madan, Aarjav Satia, Soumen Basu, Pankaj Gupta, Usha Dutta, Chetan Arora
Title: Focus on Texture: Rethinking Pre-training in Masked Autoencoders for Medical Image Classification
Abstract:
Masked Autoencoders (MAEs) have emerged as a dominant strategy for self-supervised representation learning in natural images, where models are pre-trained to reconstruct masked patches with a pixel-wise mean squared error (MSE) between original and reconstructed RGB values as the loss. We observe that MSE encourages blurred image re-construction, but still works for natural images as it preserves dominant edges. However, in medical imaging, when the texture cues are more important for classification of a visual abnormality, the strategy fails. Taking inspiration from Gray Level Co-occurrence Matrix (GLCM) feature in Radiomics studies, we propose a novel MAE based pre-training framework, GLCM-MAE, using reconstruction loss based on matching GLCM. GLCM captures intensity and spatial relationships in an image, hence proposed loss helps preserve morphological features. Further, we propose a novel formulation to convert matching GLCM matrices into a differentiable loss function. We demonstrate that unsupervised pre-training on medical images with the proposed GLCM loss improves representations for downstream tasks. GLCM-MAE outperforms the current state-of-the-art across four tasks - gallbladder cancer detection from ultrasound images by 2.1%, breast cancer detection from ultrasound by 3.1%, pneumonia detection from x-rays by 0.5%, and COVID detection from CT by 0.6%. Source code and pre-trained models are available at: https://github.com/ChetanMadan/GLCM-MAE.
中文: 掩码自编码器(MAE)通过引入基于灰度共生矩阵(GLCM)的重建损失,有效保留了医学图像中的关键纹理特征,从而在胆囊癌、乳腺癌、肺炎和COVID-19检测任务中显著提升了识别性能。
English: Masked Autoencoders (MAEs) are enhanced with a novel GLCM-based reconstruction loss that preserves critical texture features in medical images, leading to improved performance in detecting diseases like cancer and pneumonia across various imaging modalities.

Authors:Ali Hojjat, Janek Haberer, Soren Pirk, Olaf Landsiedel
Title: ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference
Abstract:
Vision Transformers deliver state-of-the-art performance, yet their fixed computational budget prevents scalable deployment across heterogeneous hardware. Recent nested Transformer architectures mitigate this by embedding nested subnetworks within a single model to enable scalable inference. However, these models allocate the same amount of compute to all inputs, regardless of their complexity, which leads to inefficiencies. To address this, we introduce ThinkingViT, a nested ViT architecture that employs progressive thinking stages to dynamically adjust inference computation based on input difficulty. ThinkingViT initiates inference by activating a small subset of the most important attention heads and terminates early if predictions reach sufficient certainty. Otherwise, it activates additional attention heads and re-evaluates the input. At the core of ThinkingViT is our Token Recycling mechanism, which conditions each subsequent inference stage on the embeddings from the previous stage, enabling progressive improvement. Due to its backbone-preserving design, ThinkingViT also serves as a plugin upgrade for vanilla ViT. Experiments show that ThinkingViT surpasses nested baselines by up to 2.0 percentage points (p.p.) in accuracy at the same throughput and by up to 2.9 p.p. at equal GMACs on ImageNet-1K. The source code is available at https://github.com/ds-kiel/ThinkingViT.
中文: ThinkingViT提出了一种嵌套视觉变换器架构,通过渐进式思考阶段和令牌回收机制,根据输入复杂度动态调整计算资源,实现了更高的准确性和效率。
English: ThinkingViT introduces a nested Vision Transformer architecture that dynamically adjusts computational resources based on input complexity, achieving superior accuracy and efficiency through progressive thinking stages and token recycling.

Authors:Hsiang-Wei Huang, Jen-Hao Cheng, Kuang-Ming Chen, Cheng-Yen Yang, Bahaa Alattar, Yi-Ru Lin, Pyongkun Kim, Sangwon Kim, Kwangju Kim, Chung-I Huang, Jenq-Neng Hwang
Title: Warehouse Spatial Question Answering with LLM Agent
Abstract:
Spatial understanding has been a challenging task for existing Multi-modal Large Language Models~(MLLMs). Previous methods leverage large-scale MLLM finetuning to enhance MLLM's spatial understanding ability. In this paper, we present a data-efficient approach. We propose a LLM agent system with strong and advanced spatial reasoning ability, which can be used to solve the challenging spatial question answering task in complex indoor warehouse scenarios. Our system integrates multiple tools that allow the LLM agent to conduct spatial reasoning and API tools interaction to answer the given complicated spatial question. Extensive evaluations on the 2025 AI City Challenge Physical AI Spatial Intelligence Warehouse dataset demonstrate that our system achieves high accuracy and efficiency in tasks such as object retrieval, counting, and distance estimation. The code is available at: https://github.com/hsiangwei0903/SpatialAgent
Chinese: 本文提出了一种数据高效的LLM智能体系统,具备先进的空间推理能力,通过整合多种工具在复杂仓库场景中精确解决空间问答任务,并在AI City Challenge数据集上展现出优异性能。
English: This paper introduces a data-efficient LLM agent system with advanced spatial reasoning capabilities, utilizing multiple tools to accurately solve complex spatial tasks in warehouse environments, as demonstrated by high performance on the AI City Challenge dataset.

Authors:Jeffrey Joan Sam, Janhavi Sathe, Nikhil Chigali, Naman Gupta, Radhey Ruparel, Yicheng Jiang, Janmajay Singh, James W. Berck, Arko Barman
Title: A New Dataset and Performance Benchmark for Real-time Spacecraft Segmentation in Onboard Flight Computers
Abstract:
Spacecraft deployed in outer space are routinely subjected to various forms of damage due to exposure to hazardous environments. In addition, there are significant risks to the subsequent process of in-space repairs through human extravehicular activity or robotic manipulation, incurring substantial operational costs. Recent developments in image segmentation could enable the development of reliable and cost-effective autonomous inspection systems. While these models often require large amounts of training data to achieve satisfactory results, publicly available annotated spacecraft segmentation data are very scarce. Here, we present a new dataset of nearly 64k annotated spacecraft images that was created using real spacecraft models, superimposed on a mixture of real and synthetic backgrounds generated using NASA's TTALOS pipeline. To mimic camera distortions and noise in real-world image acquisition, we also added different types of noise and distortion to the images. Finally, we finetuned YOLOv8 and YOLOv11 segmentation models to generate performance benchmarks for the dataset under well-defined hardware and inference time constraints to mimic real-world image segmentation challenges for real-time onboard applications in space on NASA's inspector spacecraft. The resulting models, when tested under these constraints, achieved a Dice score of 0.92, Hausdorff distance of 0.69, and an inference time of about 0.5 second. The dataset and models for performance benchmark are available at https://github.com/RiceD2KLab/SWiM.
Chinese: 为解决航天器自主检测系统训练数据匮乏的问题,开发了一个包含近6.4万张标注图像的新数据集,经过优化的YOLO模型在模拟太空环境下实现了高精度和实时性能。
English: A new dataset of nearly 64,000 annotated spacecraft images has been developed to address the scarcity of training data for autonomous inspection systems, with fine-tuned YOLO models achieving high accuracy and real-time performance under simulated space conditions.

Authors:Tongshun Zhang, Pingping Liu, Yubing Lu, Mengen Cai, Zijian Zhang, Zhe Zhang, Qiuzhan Zhou
Title: CWNet: Causal Wavelet Network for Low-Light Image Enhancement
Abstract:
Traditional Low-Light Image Enhancement (LLIE) methods primarily focus on uniform brightness adjustment, often neglecting instance-level semantic information and the inherent characteristics of different features. To address these limitations, we propose CWNet (Causal Wavelet Network), a novel architecture that leverages wavelet transforms for causal reasoning. Specifically, our approach comprises two key components: 1) Inspired by the concept of intervention in causality, we adopt a causal reasoning perspective to reveal the underlying causal relationships in low-light enhancement. From a global perspective, we employ a metric learning strategy to ensure causal embeddings adhere to causal principles, separating them from non-causal confounding factors while focusing on the invariance of causal factors. At the local level, we introduce an instance-level CLIP semantic loss to precisely maintain causal factor consistency. 2) Based on our causal analysis, we present a wavelet transform-based backbone network that effectively optimizes the recovery of frequency information, ensuring precise enhancement tailored to the specific attributes of wavelet transforms. Extensive experiments demonstrate that CWNet significantly outperforms current state-of-the-art methods across multiple datasets, showcasing its robust performance across diverse scenes. Code is available at https://github.com/bywlzts/CWNet-Causal-Wavelet-Network.
中文: CWNet提出了一种基于因果推理和小波变换的新架构,通过保持语义一致性和优化频率恢复来解决传统低光图像增强的局限性,在多个数据集上展现出卓越性能。
English: CWNet introduces a causal reasoning approach and wavelet transform-based architecture to address the limitations of traditional low-light image enhancement by preserving semantic consistency and optimizing frequency recovery, achieving superior performance across diverse datasets.

Authors:Ruixi Zheng, Wei Zhang, Yijie Li, Xi Zhu, Zhou Lan, Jarrett Rushmore, Yogesh Rathi, Nikos Makris, Lauren J. O'Donnell, Fan Zhang
Title: AGFS-Tractometry: A Novel Atlas-Guided Fine-Scale Tractometry Approach for Enhanced Along-Tract Group Statistical Comparison Using Diffusion MRI Tractography
Abstract:
Diffusion MRI (dMRI) tractography is currently the only method for in vivo mapping of the brain's white matter (WM) connections. Tractometry is an advanced tractography analysis technique for along-tract profiling to investigate the morphology and microstructural properties along the fiber tracts. Tractometry has become an essential tool for studying local along-tract differences between different populations (e.g., health vs disease). In this study, we propose a novel atlas-guided fine-scale tractometry method, namely AGFS-Tractometry, that leverages tract spatial information and permutation testing to enhance the along-tract statistical analysis between populations. There are two major contributions in AGFS-Tractometry. First, we create a novel atlas-guided tract profiling template that enables consistent, fine-scale, along-tract parcellation of subject-specific fiber tracts. Second, we propose a novel nonparametric permutation testing group comparison method to enable simultaneous analysis across all along-tract parcels while correcting for multiple comparisons. We perform experimental evaluations on synthetic datasets with known group differences and in vivo real data. We compare AGFS-Tractometry with two state-of-the-art tractometry methods, including Automated Fiber-tract Quantification (AFQ) and BUndle ANalytics (BUAN). Our results show that the proposed AGFS-Tractometry obtains enhanced sensitivity and specificity in detecting local WM differences. In the real data analysis experiments, AGFS-Tractometry can identify more regions with significant differences, which are anatomically consistent with the existing literature. Overall, these demonstrate the ability of AGFS-Tractometry to detect subtle or spatially localized WM group-level differences. The created tract profiling template and related code are available at: https://github.com/ZhengRuixi/AGFS-Tractometry.git.
Chinese: 本研究提出了一种新型图谱引导的精细纤维束测量方法AGFS-Tractometry,通过精细分段和先进统计检验增强了对白质局部差异的检测能力,实验证明其相较于现有方法具有更高的敏感性和特异性。
English: This study introduces AGFS-Tractometry, a novel atlas-guided method that enhances the detection of local white matter differences in diffusion MRI tractography through fine-scale parcellation and advanced statistical testing, demonstrating superior sensitivity and specificity compared to existing techniques.

Authors:Mingxian Lin, Wei Huang, Yitang Li, Chengjie Jiang, Kui Wu, Fangwei Zhong, Shengju Qian, Xin Wang, Xiaojuan Qi
Title: EmbRACE-3K: Embodied Reasoning and Action in Complex Environments
Abstract:
Recent advanced vision-language models(VLMs) have demonstrated strong performance on passive, offline image and video understanding tasks. However, their effectiveness in embodied settings, which require online interaction and active scene understanding remains limited. In such scenarios, an agent perceives the environment from a first-person perspective, with each action dynamically shaping subsequent observations. Even state-of-the-art models such as GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro struggle in open-environment interactions, exhibiting clear limitations in spatial reasoning and long-horizon planning. To address this gap, we introduce EmRACE-3K, a dataset of over 3,000 language-guided tasks situated in diverse, photorealistic environments constructed using Unreal Engine and the UnrealCV-Zoo framework. The tasks encompass a wide range of embodied challenges, including navigation, object manipulation, and multi-stage goal execution. Each task unfolds as a multi-step trajectory, pairing first-person visual observations with high-level instructions, grounded actions, and natural language rationales that express the agent's intent at every step. Using EmRACE-3K, we establish a benchmark to evaluate the embodied reasoning capabilities of VLMs across three key dimensions: Exploration, Dynamic Spatial-Semantic Reasoning, and Multi-stage Goal Execution. In zero-shot settings, all models achieve success rates below 20%, underscoring the challenge posed by our benchmark and the current limitations of VLMs in interactive environments. To demonstrate the utility of EmRACE-3K, we further fine-tune Qwen2.5-VL-7B using supervised learning followed by reinforcement learning. This approach yields substantial improvements across all three challenge categories, highlighting the dataset's effectiveness in enabling the development of embodied reasoning capabilities.

Authors:Shivangi Aneja, Sebastian Weiss, Irene Baeza, Prashanth Chandran, Gaspard Zoss, Matthias Nießner, Derek Bradley
Title: ScaffoldAvatar: High-Fidelity Gaussian Avatars with Patch Expressions
Abstract:
Generating high-fidelity real-time animated sequences of photorealistic 3D head avatars is important for many graphics applications, including immersive telepresence and movies. This is a challenging problem particularly when rendering digital avatar close-ups for showing character's facial microfeatures and expressions. To capture the expressive, detailed nature of human heads, including skin furrowing and finer-scale facial movements, we propose to couple locally-defined facial expressions with 3D Gaussian splatting to enable creating ultra-high fidelity, expressive and photorealistic 3D head avatars. In contrast to previous works that operate on a global expression space, we condition our avatar's dynamics on patch-based local expression features and synthesize 3D Gaussians at a patch level. In particular, we leverage a patch-based geometric 3D face model to extract patch expressions and learn how to translate these into local dynamic skin appearance and motion by coupling the patches with anchor points of Scaffold-GS, a recent hierarchical scene representation. These anchors are then used to synthesize 3D Gaussians on-the-fly, conditioned by patch-expressions and viewing direction. We employ color-based densification and progressive training to obtain high-quality results and faster convergence for high resolution 3K training images. By leveraging patch-level expressions, ScaffoldAvatar consistently achieves state-of-the-art performance with visually natural motion, while encompassing diverse facial expressions and styles in real time.

Authors:Chenyu Lian, Hong-Yu Zhou, Zhanli Hu, Jing Qin
Title: BenchReAD: A systematic benchmark for retinal anomaly detection
Abstract:
Retinal anomaly detection plays a pivotal role in screening ocular and systemic diseases. Despite its significance, progress in the field has been hindered by the absence of a comprehensive and publicly available benchmark, which is essential for the fair evaluation and advancement of methodologies. Due to this limitation, previous anomaly detection work related to retinal images has been constrained by (1) a limited and overly simplistic set of anomaly types, (2) test sets that are nearly saturated, and (3) a lack of generalization evaluation, resulting in less convincing experimental setups. Furthermore, existing benchmarks in medical anomaly detection predominantly focus on one-class supervised approaches (training only with negative samples), overlooking the vast amounts of labeled abnormal data and unlabeled data that are commonly available in clinical practice. To bridge these gaps, we introduce a benchmark for retinal anomaly detection, which is comprehensive and systematic in terms of data and algorithm. Through categorizing and benchmarking previous methods, we find that a fully supervised approach leveraging disentangled representations of abnormalities (DRA) achieves the best performance but suffers from significant drops in performance when encountering certain unseen anomalies. Inspired by the memory bank mechanisms in one-class supervised learning, we propose NFM-DRA, which integrates DRA with a Normal Feature Memory to mitigate the performance degradation, establishing a new SOTA. The benchmark is publicly available at https://github.com/DopamineLcy/BenchReAD.
Chinese: 该摘要针对视网膜异常检测领域缺乏全面基准的问题,提出了一个系统性基准,并开发了NFM-DRA新方法,通过结合异常解耦表示与正常特征记忆机制,有效提升了检测性能并建立了新的技术标杆。
English: This abstract introduces a comprehensive benchmark for retinal anomaly detection to address the limitations of previous methods, proposing a novel approach called NFM-DRA that integrates disentangled representations of abnormalities with a normal feature memory to achieve state-of-the-art performance.

Authors:Zhicun Yin, Junjie Chen, Ming Liu, Zhixin Wang, Fan Li, Renjing Pei, Xiaoming Li, Rynson W. H. Lau, Wangmeng Zuo
Title: RefSTAR: Blind Facial Image Restoration with Reference Selection, Transfer, and Reconstruction
Abstract:
Blind facial image restoration is highly challenging due to unknown complex degradations and the sensitivity of humans to faces. Although existing methods introduce auxiliary information from generative priors or high-quality reference images, they still struggle with identity preservation problems, mainly due to improper feature introduction on detailed textures. In this paper, we focus on effectively incorporating appropriate features from high-quality reference images, presenting a novel blind facial image restoration method that considers reference selection, transfer, and reconstruction (RefSTAR). In terms of selection, we construct a reference selection (RefSel) module. For training the RefSel module, we construct a RefSel-HQ dataset through a mask generation pipeline, which contains annotating masks for 10,000 ground truth-reference pairs. As for the transfer, due to the trivial solution in vanilla cross-attention operations, a feature fusion paradigm is designed to force the features from the reference to be integrated. Finally, we propose a reference image reconstruction mechanism that further ensures the presence of reference image features in the output image. The cycle consistency loss is also redesigned in conjunction with the mask. Extensive experiments on various backbone models demonstrate superior performance, showing better identity preservation ability and reference feature transfer quality. Source code, dataset, and pre-trained models are available at https://github.com/yinzhicun/RefSTAR.
中文摘要:本文提出RefSTAR方法,通过构建参考图像选择模块、特征融合范式和重建机制,有效利用高质量参考图像特征进行盲人脸图像恢复,显著提升了身份保持能力和特征迁移质量。
English Summary: This paper introduces RefSTAR, a novel blind facial image restoration method that effectively incorporates features from high-quality reference images through specialized selection, transfer, and reconstruction mechanisms to better preserve identity and transfer quality.

Authors:Shanshan Zhong, Jiawei Peng, Zehan Zheng, Zhongzhan Huang, Wufei Ma, Guofeng Zhang, Qihao Liu, Alan Yuille, Jieneng Chen
Title: 4D-Animal: Freely Reconstructing Animatable 3D Animals from Videos
Abstract:
Existing methods for reconstructing animatable 3D animals from videos typically rely on sparse semantic keypoints to fit parametric models. However, obtaining such keypoints is labor-intensive, and keypoint detectors trained on limited animal data are often unreliable. To address this, we propose 4D-Animal, a novel framework that reconstructs animatable 3D animals from videos without requiring sparse keypoint annotations. Our approach introduces a dense feature network that maps 2D representations to SMAL parameters, enhancing both the efficiency and stability of the fitting process. Furthermore, we develop a hierarchical alignment strategy that integrates silhouette, part-level, pixel-level, and temporal cues from pre-trained 2D visual models to produce accurate and temporally coherent reconstructions across frames. Extensive experiments demonstrate that 4D-Animal outperforms both model-based and model-free baselines. Moreover, the high-quality 3D assets generated by our method can benefit other 3D tasks, underscoring its potential for large-scale applications. The code is released at https://github.com/zhongshsh/4D-Animal.
Chinese: 4D-Animal框架通过密集特征网络和分层对齐策略,无需稀疏关键点标注即可从视频中重建可动画的3D动物模型,其性能优于现有方法,生成的高质量资源可应用于其他3D任务。
English: The 4D-Animal framework reconstructs animatable 3D animals from videos without sparse keypoint annotations by using a dense feature network and hierarchical alignment strategy, outperforming existing methods and generating high-quality assets for other 3D tasks.

Authors:Qiang Li, Qingsen Yan, Haojian Huang, Peng Wu, Haokui Zhang, Yanning Zhang
Title: Text-Visual Semantic Constrained AI-Generated Image Quality Assessment
Abstract:
With the rapid advancements in Artificial Intelligence Generated Image (AGI) technology, the accurate assessment of their quality has become an increasingly vital requirement. Prevailing methods typically rely on cross-modal models like CLIP or BLIP to evaluate text-image alignment and visual quality. However, when applied to AGIs, these methods encounter two primary challenges: semantic misalignment and details perception missing. To address these limitations, we propose Text-Visual Semantic Constrained AI-Generated Image Quality Assessment (SC-AGIQA), a unified framework that leverages text-visual semantic constraints to significantly enhance the comprehensive evaluation of both text-image consistency and perceptual distortion in AI-generated images. Our approach integrates key capabilities from multiple models and tackles the aforementioned challenges by introducing two core modules: the Text-assisted Semantic Alignment Module (TSAM), which leverages Multimodal Large Language Models (MLLMs) to bridge the semantic gap by generating an image description and comparing it against the original prompt for a refined consistency check, and the Frequency-domain Fine-Grained Degradation Perception Module (FFDPM), which draws inspiration from Human Visual System (HVS) properties by employing frequency domain analysis combined with perceptual sensitivity weighting to better quantify subtle visual distortions and enhance the capture of fine-grained visual quality details in images. Extensive experiments conducted on multiple benchmark datasets demonstrate that SC-AGIQA outperforms existing state-of-the-art methods. The code is publicly available at https://github.com/mozhu1/SC-AGIQA.
中文: 提出的SC-AGIQA框架通过文本辅助语义对齐模块提升图文一致性,结合频域细粒度退化感知模块增强视觉失真量化,有效解决了AI生成图像评估中的语义错位和细节感知缺失问题,在多个基准数据集上表现出超越现有方法的优越性能。
English: The proposed SC-AGIQA framework addresses semantic misalignment and detail perception issues in AI-generated image assessment by integrating a Text-assisted Semantic Alignment Module for improved text-image consistency and a Frequency-domain Fine-Grained Degradation Perception Module for enhanced visual distortion quantification, demonstrating superior performance over existing methods.

Authors:Utkarsh Singhal, Ryan Feng, Stella X. Yu, Atul Prakash
Title: Test-Time Canonicalization by Foundation Models for Robust Perception
Abstract:
Perception in the real world requires robustness to diverse viewing conditions. Existing approaches often rely on specialized architectures or training with predefined data augmentations, limiting adaptability. Taking inspiration from mental rotation in human vision, we propose FOCAL, a test-time robustness framework that transforms the input into the most typical view. At inference time, FOCAL explores a set of transformed images and chooses the one with the highest likelihood under foundation model priors. This test-time optimization boosts robustness while requiring no retraining or architectural changes. Applied to models like CLIP and SAM, it significantly boosts robustness across a wide range of transformations, including 2D and 3D rotations, contrast and lighting shifts, and day-night changes. We also explore potential applications in active vision. By reframing invariance as a test-time optimization problem, FOCAL offers a general and scalable approach to robustness. Our code is available at: https://github.com/sutkarsh/focal.
中文: FOCAL是一种测试时鲁棒性框架,通过将输入转换为最优视角并利用基础模型先验,无需重新训练即可显著提升模型对多种变换的适应能力。
English: FOCAL is a test-time robustness framework that transforms inputs into optimal views using foundation model priors, enhancing adaptability to various transformations without retraining.

Authors:Jiahe Zhao, Rongkun Zheng, Yi Wang, Helin Wang, Hengshuang Zhao
Title: DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs
Abstract:
In video Multimodal Large Language Models (video MLLMs), the visual encapsulation process plays a pivotal role in converting video contents into representative tokens for LLM input. While linear projectors are widely employed for encapsulation, they introduce semantic indistinctness and temporal incoherence when applied to videos. Conversely, the structure of resamplers shows promise in tackling these challenges, but an effective solution remains unexplored. Drawing inspiration from resampler structures, we introduce DisCo, a novel visual encapsulation method designed to yield semantically distinct and temporally coherent visual tokens for video MLLMs. DisCo integrates two key components: (1) A Visual Concept Discriminator (VCD) module, assigning unique semantics for visual tokens by associating them in pair with discriminative concepts in the video. (2) A Temporal Focus Calibrator (TFC) module, ensuring consistent temporal focus of visual tokens to video elements across every video frame. Through extensive experiments on multiple video MLLM frameworks, we demonstrate that DisCo remarkably outperforms previous state-of-the-art methods across a variety of video understanding benchmarks, while also achieving higher token efficiency thanks to the reduction of semantic indistinctness. The code: https://github.com/ZJHTerry18/DisCo.
中文摘要:DisCo是一种新颖的视频多模态大语言模型视觉封装方法,通过视觉概念判别器和时序焦点校准器组件,生成语义清晰且时序连贯的视觉标记,在多项视频理解基准测试中显著优于现有方法并实现更高的标记效率。
English Summary: DisCo is a novel visual encapsulation method for video MLLMs that enhances semantic distinctness and temporal coherence through its Visual Concept Discriminator and Temporal Focus Calibrator components, achieving superior performance on video understanding benchmarks with higher token efficiency.

Authors:Muyi Bao, Changyu Zeng, Yifan Wang, Zhengni Yang, Zimu Wang, Guangliang Cheng, Jun Qi, Wei Wang
Title: FTCFormer: Fuzzy Token Clustering Transformer for Image Classification
Abstract:
Transformer-based deep neural networks have achieved remarkable success across various computer vision tasks, largely attributed to their long-range self-attention mechanism and scalability. However, most transformer architectures embed images into uniform, grid-based vision tokens, neglecting the underlying semantic meanings of image regions, resulting in suboptimal feature representations. To address this issue, we propose Fuzzy Token Clustering Transformer (FTCFormer), which incorporates a novel clustering-based downsampling module to dynamically generate vision tokens based on the semantic meanings instead of spatial positions. It allocates fewer tokens to less informative regions and more to represent semantically important regions, regardless of their spatial adjacency or shape irregularity. To further enhance feature extraction and representation, we propose a Density Peak Clustering-Fuzzy K-Nearest Neighbor (DPC-FKNN) mechanism for clustering center determination, a Spatial Connectivity Score (SCS) for token assignment, and a channel-wise merging (Cmerge) strategy for token merging. Extensive experiments on 32 datasets across diverse domains validate the effectiveness of FTCFormer on image classification, showing consistent improvements over the TCFormer baseline, achieving gains of improving 1.43% on five fine-grained datasets, 1.09% on six natural image datasets, 0.97% on three medical datasets and 0.55% on four remote sensing datasets. The code is available at: https://github.com/BaoBao0926/FTCFormer/tree/main.
中文: FTCFormer通过基于聚类的下采样模块,依据语义重要性而非空间位置动态生成视觉标记,从而优化特征表示,并在32个不同领域的图像数据集上实现了稳定的性能提升。
English: The FTCFormer introduces a clustering-based downsampling module that dynamically generates vision tokens based on semantic importance rather than spatial grids, enhancing feature representation and achieving consistent performance gains across 32 diverse image datasets.

Authors:Xinlong Ding, Hongwei Yu, Jiawei Li, Feifan Li, Yu Shang, Bochao Zou, Huimin Ma, Jiansheng Chen
Title: Kaleidoscopic Background Attack: Disrupting Pose Estimation with Multi-Fold Radial Symmetry Textures
Abstract:
Camera pose estimation is a fundamental computer vision task that is essential for applications like visual localization and multi-view stereo reconstruction. In the object-centric scenarios with sparse inputs, the accuracy of pose estimation can be significantly influenced by background textures that occupy major portions of the images across different viewpoints. In light of this, we introduce the Kaleidoscopic Background Attack (KBA), which uses identical segments to form discs with multi-fold radial symmetry. These discs maintain high similarity across different viewpoints, enabling effective attacks on pose estimation models even with natural texture segments. Additionally, a projected orientation consistency loss is proposed to optimize the kaleidoscopic segments, leading to significant enhancement in the attack effectiveness. Experimental results show that optimized adversarial kaleidoscopic backgrounds can effectively attack various camera pose estimation models.
中文总结:本文提出万花筒背景攻击(KBA),通过使用对称纹理圆盘干扰相机姿态估计模型,并利用投影方向一致性损失优化攻击效果,有效针对多种姿态估计方法实现对抗攻击。
English Summary: The paper introduces the Kaleidoscopic Background Attack (KBA), which uses symmetrical texture discs to deceive camera pose estimation models by exploiting background interference, and enhances attack effectiveness through a novel consistency loss optimization.

Authors:Jinglun Li, Kaixun Jiang, Zhaoyu Chen, Bo Lin, Yao Tang, Weifeng Ge, Wenqiang Zhang
Title: Synthesizing Near-Boundary OOD Samples for Out-of-Distribution Detection
Abstract:
Pre-trained vision-language models have exhibited remarkable abilities in detecting out-of-distribution (OOD) samples. However, some challenging OOD samples, which lie close to in-distribution (InD) data in image feature space, can still lead to misclassification. The emergence of foundation models like diffusion models and multimodal large language models (MLLMs) offers a potential solution to this issue. In this work, we propose SynOOD, a novel approach that harnesses foundation models to generate synthetic, challenging OOD data for fine-tuning CLIP models, thereby enhancing boundary-level discrimination between InD and OOD samples. Our method uses an iterative in-painting process guided by contextual prompts from MLLMs to produce nuanced, boundary-aligned OOD samples. These samples are refined through noise adjustments based on gradients from OOD scores like the energy score, effectively sampling from the InD/OOD boundary. With these carefully synthesized images, we fine-tune the CLIP image encoder and negative label features derived from the text encoder to strengthen connections between near-boundary OOD samples and a set of negative labels. Finally, SynOOD achieves state-of-the-art performance on the large-scale ImageNet benchmark, with minimal increases in parameters and runtime. Our approach significantly surpasses existing methods, and the code is available at https://github.com/Jarvisgivemeasuit/SynOOD.
中文: SynOOD利用基础模型通过迭代修复和噪声优化生成合成的边界分布外样本,有效提升了CLIP模型对分布内外数据的区分能力,在ImageNet基准测试中以最小计算成本实现了最优性能。
English: SynOOD leverages foundation models to generate synthetic boundary OOD samples through iterative in-painting and noise refinement, enhancing CLIP's discrimination between in-distribution and out-of-distribution data while achieving state-of-the-art performance on ImageNet with minimal computational overhead.

Authors:Xiangyu Yin, Boyuan Yang, Weichen Liu, Qiyao Xue, Abrar Alamri, Goeran Fiedler, Wei Gao
Title: ProGait: A Multi-Purpose Video Dataset and Benchmark for Transfemoral Prosthesis Users
Abstract:
Prosthetic legs play a pivotal role in clinical rehabilitation, allowing individuals with lower-limb amputations the ability to regain mobility and improve their quality of life. Gait analysis is fundamental for optimizing prosthesis design and alignment, directly impacting the mobility and life quality of individuals with lower-limb amputations. Vision-based machine learning (ML) methods offer a scalable and non-invasive solution to gait analysis, but face challenges in correctly detecting and analyzing prosthesis, due to their unique appearances and new movement patterns. In this paper, we aim to bridge this gap by introducing a multi-purpose dataset, namely ProGait, to support multiple vision tasks including Video Object Segmentation, 2D Human Pose Estimation, and Gait Analysis (GA). ProGait provides 412 video clips from four above-knee amputees when testing multiple newly-fitted prosthetic legs through walking trials, and depicts the presence, contours, poses, and gait patterns of human subjects with transfemoral prosthetic legs. Alongside the dataset itself, we also present benchmark tasks and fine-tuned baseline models to illustrate the practical application and performance of the ProGait dataset. We compared our baseline models against pre-trained vision models, demonstrating improved generalizability when applying the ProGait dataset for prosthesis-specific tasks. Our code is available at https://github.com/pittisl/ProGait and dataset at https://huggingface.co/datasets/ericyxy98/ProGait.
中文: 本文提出ProGait数据集,旨在解决基于视觉的假肢步态分析难题,通过提供视频数据和基准测试来提升假肢特定运动的检测与分析能力。
English: This paper introduces the ProGait dataset to address challenges in vision-based gait analysis for prosthetic legs, providing video data and benchmarks to improve detection and analysis of prosthesis-specific movements.

Authors:Shicai Wei, Chunbo Luo, Yang Luo
Title: Boosting Multimodal Learning via Disentangled Gradient Learning
Abstract:
Multimodal learning often encounters the under-optimized problem and may have worse performance than unimodal learning. Existing methods attribute this problem to the imbalanced learning between modalities and rebalance them through gradient modulation. However, they fail to explain why the dominant modality in multimodal models also underperforms that in unimodal learning. In this work, we reveal the optimization conflict between the modality encoder and modality fusion module in multimodal models. Specifically, we prove that the cross-modal fusion in multimodal models decreases the gradient passed back to each modality encoder compared with unimodal models. Consequently, the performance of each modality in the multimodal model is inferior to that in the unimodal model. To this end, we propose a disentangled gradient learning (DGL) framework to decouple the optimization of the modality encoder and modality fusion module in the multimodal model. DGL truncates the gradient back-propagated from the multimodal loss to the modality encoder and replaces it with the gradient from unimodal loss. Besides, DGL removes the gradient back-propagated from the unimodal loss to the modality fusion module. This helps eliminate the gradient interference between the modality encoder and modality fusion module while ensuring their respective optimization processes. Finally, extensive experiments on multiple types of modalities, tasks, and frameworks with dense cross-modal interaction demonstrate the effectiveness and versatility of the proposed DGL. Code is available at \href{https://github.com/shicaiwei123/ICCV2025-GDL}{https://github.com/shicaiwei123/ICCV2025-GDL}
Chinese: 本研究揭示了多模态学习中模态编码器与融合模块间的优化冲突,提出解耦梯度学习(DGL)框架,通过分离两者的训练过程,在多种任务和模态中有效提升了性能。
English: This study identifies optimization conflicts between modality encoders and fusion modules in multimodal learning, proposing a Disentangled Gradient Learning (DGL) framework to decouple their training and enhance performance across diverse tasks and modalities.

Authors:Huai-Qian Khor, Yante Li, Xingxun Jiang, Guoying Zhao
Title: Is Micro-expression Ethnic Leaning?
Abstract:
How much does ethnicity play its part in emotional expression? Emotional expression and micro-expression research probe into understanding human psychological responses to emotional stimuli, thereby revealing substantial hidden yet authentic emotions that can be useful in the event of diagnosis and interviews. While increased attention had been provided to micro-expression analysis, the studies were done under Ekman's assumption of emotion universality, where emotional expressions are identical across cultures and social contexts. Our computational study uncovers some of the influences of ethnic background in expression analysis, leading to an argument that the emotional universality hypothesis is an overgeneralization from the perspective of manual psychological analysis. In this research, we propose to investigate the level of influence of ethnicity in a simulated micro-expression scenario. We construct a cross-cultural micro-expression database and algorithmically annotate the ethnic labels to facilitate the investigation. With the ethnically annotated dataset, we perform a prima facie study to compare mono-ethnicity and stereo-ethnicity in a controlled environment, which uncovers a certain influence of ethnic bias via an experimental way. Building on this finding, we propose a framework that integrates ethnic context into the emotional feature learning process, yielding an ethnically aware framework that recognises ethnicity differences in micro-expression recognition. For improved understanding, qualitative analyses have been done to solidify the preliminary investigation into this new realm of research. Code is publicly available at https://github.com/IcedDoggie/ICMEW2025_EthnicMER
Chinese: 本研究通过揭示微表情分析中的种族影响因素,质疑了情感普遍性假说,并提出一个融合种族背景的识别框架以提升分析准确性。
English: This study challenges the universality of emotional expressions by demonstrating ethnic influences in micro-expression analysis and proposes an ethnically-aware framework to improve recognition accuracy.

Authors:Shicai Wei, Chunbo Luo, Yang Luo
Title: Improving Multimodal Learning via Imbalanced Learning
Abstract:
Multimodal learning often encounters the under-optimized problem and may perform worse than unimodal learning. Existing approaches attribute this issue to imbalanced learning across modalities and tend to address it through gradient balancing. However, this paper argues that balanced learning is not the optimal setting for multimodal learning. With bias-variance analysis, we prove that imbalanced dependency on each modality obeying the inverse ratio of their variances contributes to optimal performance. To this end, we propose the Asymmetric Representation Learning(ARL) strategy to assist multimodal learning via imbalanced optimization. ARL introduces auxiliary regularizers for each modality encoder to calculate their prediction variance. ARL then calculates coefficients via the unimodal variance to re-weight the optimization of each modality, forcing the modality dependence ratio to be inversely proportional to the modality variance ratio. Moreover, to minimize the generalization error, ARL further introduces the prediction bias of each modality and jointly optimizes them with multimodal loss. Notably, all auxiliary regularizers share parameters with the multimodal model and rely only on the modality representation. Thus the proposed ARL strategy introduces no extra parameters and is independent of the structures and fusion methods of the multimodal model. Finally, extensive experiments on various datasets validate the effectiveness and versatility of ARL. Code is available at \href{https://github.com/shicaiwei123/ICCV2025-ARL}{https://github.com/shicaiwei123/ICCV2025-ARL}
中文摘要:本文提出非对称表征学习(ARL)策略,通过基于模态方差反比的非平衡优化突破传统梯度平衡的局限,在无需增加参数的情况下实现多模态学习的最优性能,并在多个数据集上验证了其有效性。
English Summary: This paper challenges the conventional gradient balancing approach in multimodal learning by proposing Asymmetric Representation Learning (ARL), which achieves optimal performance through imbalanced modality optimization inversely proportional to their variances, validated across multiple datasets without adding parameters.

Authors:Jaeseong Lee, Yeeun Choi, Heechan Choi, Hanjung Kim, Seonjoo Kim
Title: A Training-Free, Task-Agnostic Framework for Enhancing MLLM Performance on High-Resolution Images
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision-language understanding, reasoning, and generation. However, they struggle with tasks requiring fine-grained localization and reasoning in high-resolution images. This constraint stems from the fact that MLLMs are fine-tuned with fixed image resolution to align with the pre-trained image encoder used in MLLM. Consequently, feeding high-resolution images directly into MLLMs leads to poor generalization due to a train-test resolution discrepancy, while downsampling these images-although ensuring consistency-compromises fine-grained visual details and ultimately degrades performance. To address this challenge, we propose Extract Candidate then Predict (ECP), a novel training-free, task-agnostic two-stage framework designed to enhance MLLM performance on high-resolution images. The key intuition behind ECP is that while MLLMs struggle with high-resolution images, their predictions on downsampled images still contain implicit localization cues. By first identifying candidate region using the coarse prediction and then predicting the final output based on candidate region, ECP effectively preserves fine-grained details while mitigating the challenges posed by high-resolution data. We validate our framework on 4K GUI grounding and 4K, 8K MLLM perception, achieving +21.3%, +5.8%, +5.2% absolute improvement compared to baseline respectively, demonstrating its effectiveness. Code is available at https://github.com/yenncye/ECP.
中文: 提出的ECP框架通过利用粗略预测识别候选区域,无需额外训练即可增强多模态大语言模型在高分辨率图像上的性能,有效保留细节信息。
English: The proposed Extract Candidate then Predict (ECP) framework enhances Multimodal Large Language Models' performance on high-resolution images by leveraging coarse predictions to identify candidate regions, preserving fine-grained details without requiring additional training.

Authors:Shuyu Yang, Yaxiong Wang, Yongrui Li, Li Zhu, Zhedong Zheng
Title: Minimizing the Pretraining Gap: Domain-aligned Text-Based Person Retrieval
Abstract:
In this work, we focus on text-based person retrieval, which aims to identify individuals based on textual descriptions. Given the significant privacy issues and the high cost associated with manual annotation, synthetic data has become a popular choice for pretraining models, leading to notable advancements. However, the considerable domain gap between synthetic pretraining datasets and real-world target datasets, characterized by differences in lighting, color, and viewpoint, remains a critical obstacle that hinders the effectiveness of the pretrain-finetune paradigm. To bridge this gap, we introduce a unified text-based person retrieval pipeline considering domain adaptation at both image and region levels. In particular, it contains two primary components, i.e., Domain-aware Diffusion (DaD) for image-level adaptation and Multi-granularity Relation Alignment (MRA) for region-level adaptation. As the name implies, Domain-aware Diffusion is to migrate the distribution of images from the pretraining dataset domain to the target real-world dataset domain, e.g., CUHK-PEDES. Subsequently, MRA performs a meticulous region-level alignment by establishing correspondences between visual regions and their descriptive sentences, thereby addressing disparities at a finer granularity. Extensive experiments show that our dual-level adaptation method has achieved state-of-the-art results on the CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets, outperforming existing methodologies. The dataset, model, and code are available at https://github.com/Shuyu-XJTU/MRA.
中文: 本研究提出了一种统一的基于文本的行人检索流程,通过图像级和区域级的双重领域自适应方法,有效缩小了合成预训练数据与真实数据集之间的领域差异,在多个基准数据集上取得了最优性能。
English: This study introduces a unified text-based person retrieval pipeline that addresses the domain gap between synthetic pretraining data and real-world datasets through dual-level domain adaptation, achieving state-of-the-art performance on benchmark datasets.

Authors:Ivan Martinović, Josip Šarić, Marin Oršić, Matej Kristan, Siniša Šegvić
Title: DEARLi: Decoupled Enhancement of Recognition and Localization for Semi-supervised Panoptic Segmentation
Abstract:
Pixel-level annotation is expensive and time-consuming. Semi-supervised segmentation methods address this challenge by learning models on few labeled images alongside a large corpus of unlabeled images. Although foundation models could further account for label scarcity, effective mechanisms for their exploitation remain underexplored. We address this by devising a novel semi-supervised panoptic approach fueled by two dedicated foundation models. We enhance recognition by complementing unsupervised mask-transformer consistency with zero-shot classification of CLIP features. We enhance localization by class-agnostic decoder warm-up with respect to SAM pseudo-labels. The resulting decoupled enhancement of recognition and localization (DEARLi) particularly excels in the most challenging semi-supervised scenarios with large taxonomies and limited labeled data. Moreover, DEARLi outperforms the state of the art in semi-supervised semantic segmentation by a large margin while requiring 8x less GPU memory, in spite of being trained only for the panoptic objective. We observe 29.9 PQ and 38.9 mIoU on ADE20K with only 158 labeled images. The source code is available at https://github.com/helen1c/DEARLi.
中文: DEARLi提出了一种新颖的半监督全景分割方法,通过分别利用基础模型增强识别和定位能力,在标注数据稀缺的情况下表现出色,并以更少的GPU内存大幅超越现有技术水平。
English: DEARLi introduces a novel semi-supervised panoptic segmentation method that enhances recognition and localization separately using foundation models, achieving state-of-the-art results with significantly less GPU memory and excelling in scenarios with limited labeled data.

Authors:Bingchao Wang, Zhiwei Ning, Jianyu Ding, Xuanang Gao, Yin Li, Dongsheng Jiang, Jie Yang, Wei Liu
Title: FIX-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text
Abstract:
CLIP has shown promising performance across many short-text tasks in a zero-shot manner. However, limited by the input length of the text encoder, CLIP struggles on under-stream tasks with long-text inputs ($>77$ tokens). To remedy this issue, we propose FIX-CLIP, which includes three novel modules: (1) A dual-branch training pipeline that aligns short and long texts with masked and raw images, respectively, which boosts the long-text representation while preserving the short-text ability. (2) Multiple learnable regional prompts with unidirectional masks in Transformer layers for regional information extraction. (3) A hierarchical feature alignment module in the intermediate encoder layers to promote the consistency of multi-scale features. Furthermore, we collect 30M images and utilize existing MLLMs to synthesize long-text captions for training. Extensive experiments show that FIX-CLIP achieves state-of-the-art performance on both long-text and short-text retrieval benchmarks. For downstream applications, we reveal that FIX-CLIP's text encoder delivers promising performance in a plug-and-play manner for diffusion models with long-text input. The code is available at https://github.com/bcwang-sjtu/Fix-CLIP.
中文: FIX-CLIP通过双分支训练、区域提示和分层特征对齐增强CLIP的长文本处理能力,在长短文本检索任务中均达到领先水平。
English: FIX-CLIP enhances CLIP's capability for long-text tasks through dual-branch training, regional prompts, and hierarchical feature alignment, achieving state-of-the-art performance in both long and short-text retrieval.

Authors:Meng Yu, Kun Zhan
Title: Frequency Regulation for Exposure Bias Mitigation in Diffusion Models
Abstract:
Diffusion models exhibit impressive generative capabilities but are significantly impacted by exposure bias. In this paper, we make a key observation: the energy of predicted noisy samples in the reverse process continuously declines compared to perturbed samples in the forward process. Building on this, we identify two important findings: 1) The reduction in energy follows distinct patterns in the low-frequency and high-frequency subbands; 2) The subband energy of reverse-process reconstructed samples is consistently lower than that of forward-process ones, and both are lower than the original data samples. Based on the first finding, we introduce a dynamic frequency regulation mechanism utilizing wavelet transforms, which separately adjusts the low- and high-frequency subbands. Leveraging the second insight, we derive the rigorous mathematical form of exposure bias. It is worth noting that, our method is training-free and plug-and-play, significantly improving the generative quality of various diffusion models and frameworks with negligible computational cost. The source code is available at https://github.com/kunzhan/wpp.
中文摘要:本文提出一种无需训练即插即用的方法,通过小波变换动态调节频域子带,以极小计算成本显著提升扩散模型生成质量并有效解决曝光偏差问题。
English Summary: This paper introduces a training-free, plug-and-play method that uses wavelet transforms to dynamically regulate frequency subbands, significantly improving diffusion model performance by addressing exposure bias with minimal computational cost.

Authors:Chenguo Lin, Yuchen Lin, Panwang Pan, Yifan Yu, Honglei Yan, Katerina Fragkiadaki, Yadong Mu
Title: MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second
Abstract:
We present MoVieS, a novel feed-forward model that synthesizes 4D dynamic novel views from monocular videos in one second. MoVieS represents dynamic 3D scenes using pixel-aligned grids of Gaussian primitives, explicitly supervising their time-varying motion. This allows, for the first time, the unified modeling of appearance, geometry and motion, and enables view synthesis, reconstruction and 3D point tracking within a single learning-based framework. By bridging novel view synthesis with dynamic geometry reconstruction, MoVieS enables large-scale training on diverse datasets with minimal dependence on task-specific supervision. As a result, it also naturally supports a wide range of zero-shot applications, such as scene flow estimation and moving object segmentation. Extensive experiments validate the effectiveness and efficiency of MoVieS across multiple tasks, achieving competitive performance while offering several orders of magnitude speedups.

Authors:Xianghong Zou, Jianping Li, Zhe Chen, Zhen Cao, Zhen Dong, Qiegen Liu, Bisheng Yang
Title: LifelongPR: Lifelong point cloud place recognition based on sample replay and prompt learning
Abstract:
Point cloud place recognition (PCPR) determines the geo-location within a prebuilt map and plays a crucial role in geoscience and robotics applications such as autonomous driving, intelligent transportation, and augmented reality. In real-world large-scale deployments of a geographic positioning system, PCPR models must continuously acquire, update, and accumulate knowledge to adapt to diverse and dynamic environments, i.e., the ability known as continual learning (CL). However, existing PCPR models often suffer from catastrophic forgetting, leading to significant performance degradation in previously learned scenes when adapting to new environments or sensor types. This results in poor model scalability, increased maintenance costs, and system deployment difficulties, undermining the practicality of PCPR. To address these issues, we propose LifelongPR, a novel continual learning framework for PCPR, which effectively extracts and fuses knowledge from sequential point cloud data. First, to alleviate the knowledge loss, we propose a replay sample selection method that dynamically allocates sample sizes according to each dataset's information quantity and selects spatially diverse samples for maximal representativeness. Second, to handle domain shifts, we design a prompt learning-based CL framework with a lightweight prompt module and a two-stage training strategy, enabling domain-specific feature adaptation while minimizing forgetting. Comprehensive experiments on large-scale public and self-collected datasets are conducted to validate the effectiveness of the proposed method. Compared with state-of-the-art (SOTA) methods, our method achieves 6.50% improvement in mIR@1, 7.96% improvement in mR@1, and an 8.95% reduction in F. The code and pre-trained models are publicly available at https://github.com/zouxianghong/LifelongPR.
中文:LifelongPR框架通过动态回放采样和基于提示学习的领域自适应方法,有效解决了点云地点识别中的灾难性遗忘问题,在多项指标上实现了显著性能提升。
English: The proposed LifelongPR framework addresses catastrophic forgetting in point cloud place recognition through dynamic replay sampling and prompt-based domain adaptation, achieving state-of-the-art performance improvements across multiple metrics.

Authors:Shubham Shukla, Kunal Sonalkar
Title: Can GPT-4o mini and Gemini 2.0 Flash Predict Fine-Grained Fashion Product Attributes? A Zero-Shot Analysis
Abstract:
The fashion retail business is centered around the capacity to comprehend products. Product attribution helps in comprehending products depending on the business process. Quality attribution improves the customer experience as they navigate through millions of products offered by a retail website. It leads to well-organized product catalogs. In the end, product attribution directly impacts the 'discovery experience' of the customer. Although large language models (LLMs) have shown remarkable capabilities in understanding multimodal data, their performance on fine-grained fashion attribute recognition remains under-explored. This paper presents a zero-shot evaluation of state-of-the-art LLMs that balance performance with speed and cost efficiency, mainly GPT-4o-mini and Gemini 2.0 Flash. We have used the dataset DeepFashion-MultiModal (https://github.com/yumingj/DeepFashion-MultiModal) to evaluate these models in the attribution tasks of fashion products. Our study evaluates these models across 18 categories of fashion attributes, offering insight into where these models excel. We only use images as the sole input for product information to create a constrained environment. Our analysis shows that Gemini 2.0 Flash demonstrates the strongest overall performance with a macro F1 score of 56.79% across all attributes, while GPT-4o-mini scored a macro F1 score of 43.28%. Through detailed error analysis, our findings provide practical insights for deploying these LLMs in production e-commerce product attribution-related tasks and highlight the need for domain-specific fine-tuning approaches. This work also lays the groundwork for future research in fashion AI and multimodal attribute extraction.
中文: 本研究通过图像输入评估GPT-4o-mini和Gemini 2.0 Flash在细粒度时尚属性识别中的零样本性能,发现Gemini 2.0 Flash以56.79%的宏观F1分数表现更优,同时揭示了电子商务领域应用中对特定领域优化的需求。
English: This study evaluates the zero-shot performance of GPT-4o-mini and Gemini 2.0 Flash on fine-grained fashion attribute recognition using image inputs, finding Gemini 2.0 Flash superior with a 56.79% macro F1 score while highlighting the need for domain-specific optimization in e-commerce applications.

Authors:Huilai Li, Yonghao Dang, Ying Xing, Yiming Wang, Jianqin Yin
Title: ESG-Net: Event-Aware Semantic Guided Network for Dense Audio-Visual Event Localization
Abstract:
Dense audio-visual event localization (DAVE) aims to identify event categories and locate the temporal boundaries in untrimmed videos. Most studies only employ event-related semantic constraints on the final outputs, lacking cross-modal semantic bridging in intermediate layers. This causes modality semantic gap for further fusion, making it difficult to distinguish between event-related content and irrelevant background content. Moreover, they rarely consider the correlations between events, which limits the model to infer concurrent events among complex scenarios. In this paper, we incorporate multi-stage semantic guidance and multi-event relationship modeling, which respectively enable hierarchical semantic understanding of audio-visual events and adaptive extraction of event dependencies, thereby better focusing on event-related information. Specifically, our eventaware semantic guided network (ESG-Net) includes a early semantics interaction (ESI) module and a mixture of dependency experts (MoDE) module. ESI applys multi-stage semantic guidance to explicitly constrain the model in learning semantic information through multi-modal early fusion and several classification loss functions, ensuring hierarchical understanding of event-related content. MoDE promotes the extraction of multi-event dependencies through multiple serial mixture of experts with adaptive weight allocation. Extensive experiments demonstrate that our method significantly surpasses the state-of-the-art methods, while greatly reducing parameters and computational load. Our code will be released on https://github.com/uchiha99999/ESG-Net.
Chinese: 本文提出的ESG-Net通过多阶段语义引导和多事件关系建模,改进了密集音视频事件定位的层次化语义理解与事件依赖提取,在显著降低计算负载的同时实现了最先进的性能。
English: This paper introduces ESG-Net, a novel approach for dense audio-visual event localization that integrates multi-stage semantic guidance and multi-event relationship modeling to enhance hierarchical understanding and dependency extraction, achieving superior performance with reduced computational demands.

Authors:Zhanjiang Yang, Lijun Sun, Jiawei Dong, Xiaoxin An, Yang Liu, Meng Li
Title: MCGA: Mixture of Codebooks Hyperspectral Reconstruction via Grayscale-Aware Attention
Abstract:
Reconstructing hyperspectral images (HSIs) from RGB inputs provides a cost-effective alternative to hyperspectral cameras, but reconstructing high-dimensional spectra from three channels is inherently ill-posed. Existing methods typically directly regress RGB-to-HSI mappings using large attention networks, which are computationally expensive and handle ill-posedness only implicitly. We propose MCGA, a Mixture-of-Codebooks with Grayscale-aware Attention framework that explicitly addresses these challenges using spectral priors and photometric consistency. MCGA first learns transferable spectral priors via a mixture-of-codebooks (MoC) from heterogeneous HSI datasets, then aligns RGB features with these priors through grayscale-aware photometric attention (GANet). Efficiency and robustness are further improved via top-K attention design and test-time adaptation (TTA). Experiments on benchmarks and real-world data demonstrate the state-of-the-art accuracy, strong cross-dataset generalization, and 4-5x faster inference. Codes will be available once acceptance at https://github.com/Fibonaccirabbit/MCGA.
中文摘要:提出的MCGA框架通过利用光谱先验和光度一致性,从RGB输入高效重建高光谱图像,实现了最先进的精度、更快的推理速度和更强的泛化能力。
English Summary: The proposed MCGA framework efficiently reconstructs hyperspectral images from RGB inputs by leveraging spectral priors and photometric consistency, achieving state-of-the-art accuracy with faster inference and better generalization.

Authors:Youliang Zhang, Zhaoyang Li, Duomin Wang, Jiahe Zhang, Deyu Zhou, Zixin Yin, Xili Dai, Gang Yu, Xiu Li
Title: SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation
Abstract:
The rapid development of large-scale models has catalyzed significant breakthroughs in the digital human domain. These advanced methodologies offer high-fidelity solutions for avatar driving and rendering, leading academia to focus on the next major challenge: audio-visual dyadic interactive virtual human. To facilitate research in this emerging area, we present SpeakerVid-5M dataset, the first large-scale, high-quality dataset designed for audio-visual dyadic interactive virtual human generation. Totaling over 8,743 hours, SpeakerVid-5M contains more than 5.2 million video clips of human portraits. It covers diverse scales and interaction types, including monadic talking, listening, and dyadic conversations. Crucially, the dataset is structured along two key dimensions: interaction type and data quality. First, it is categorized into four types (dialogue branch, single branch, listening branch and multi-turn branch) based on the interaction scenario. Second, it is stratified into a large-scale pre-training subset and a curated, high-quality subset for Supervised Fine-Tuning (SFT). This dual structure accommodates a wide array of 2D virtual human tasks. In addition, we provide an autoregressive (AR)-based video chat baseline trained on this data, accompanied by a dedicated set of metrics and test data to serve as a benchmark VidChatBench for future work. Both the dataset and the corresponding data processing code will be publicly released. Project page: https://dorniwang.github.io/SpeakerVid-5M/

Authors:Amirhossein Ansari, Ke Wang, Pulei Xiong
Title: NegRefine: Refining Negative Label-Based Zero-Shot OOD Detection
Abstract:
Recent advancements in Vision-Language Models like CLIP have enabled zero-shot OOD detection by leveraging both image and textual label information. Among these, negative label-based methods such as NegLabel and CSP have shown promising results by utilizing a lexicon of words to define negative labels for distinguishing OOD samples. However, these methods suffer from detecting in-distribution samples as OOD due to negative labels that are subcategories of in-distribution labels or proper nouns. They also face limitations in handling images that match multiple in-distribution and negative labels. We propose NegRefine, a novel negative label refinement framework for zero-shot OOD detection. By introducing a filtering mechanism to exclude subcategory labels and proper nouns from the negative label set and incorporating a multi-matching-aware scoring function that dynamically adjusts the contributions of multiple labels matching an image, NegRefine ensures a more robust separation between in-distribution and OOD samples. We evaluate NegRefine on large-scale benchmarks, including ImageNet-1K. The code is available at https://github.com/ah-ansari/NegRefine.
Chinese: NegRefine是一种新颖的负标签优化框架,通过过滤子类别和专有名词并采用动态评分函数,有效提升零样本分布外检测性能,在ImageNet-1K等基准测试中表现优异。
English: NegRefine is a novel framework that refines negative labels by filtering out subcategories and proper nouns and employs a dynamic scoring function to enhance zero-shot out-of-distribution detection, achieving robust performance on benchmarks like ImageNet-1K.

Authors:Abdul Manaf, Nimra Mughal
Title: AI-Enhanced Pediatric Pneumonia Detection: A CNN-Based Approach Using Data Augmentation and Generative Adversarial Networks (GANs)
Abstract:
Pneumonia is a leading cause of mortality in children under five, requiring accurate chest X-ray diagnosis. This study presents a machine learning-based Pediatric Chest Pneumonia Classification System to assist healthcare professionals in diagnosing pneumonia from chest X-ray images. The CNN-based model was trained on 5,863 labeled chest X-ray images from children aged 0-5 years from the Guangzhou Women and Children's Medical Center. To address limited data, we applied augmentation techniques (rotation, zooming, shear, horizontal flipping) and employed GANs to generate synthetic images, addressing class imbalance. The system achieved optimal performance using combined original, augmented, and GAN-generated data, evaluated through accuracy and F1 score metrics. The final model was deployed via a Flask web application, enabling real-time classification with probability estimates. Results demonstrate the potential of deep learning and GANs in improving diagnostic accuracy and efficiency for pediatric pneumonia classification, particularly valuable in resource-limited clinical settings https://github.com/AbdulManaf12/Pediatric-Chest-Pneumonia-Classification
中文: 本研究开发了一种基于CNN和GAN的机器学习系统,通过胸部X光片对儿童肺炎进行分类,实现了高诊断准确性,并通过网络应用程序部署供临床使用。
English: This study developed a machine learning system using CNN and GANs to classify pediatric pneumonia from chest X-rays, achieving high diagnostic accuracy and deployment via a web application for clinical use.

Authors:Osher Rafaeli, Tal Svoray, Ariel Nahlieli
Title: Prompt2DEM: High-Resolution DEMs for Urban and Open Environments from Global Prompts Using a Monocular Foundation Model
Abstract:
High-resolution elevation estimations are essential to understand catchment and hillslope hydrology, study urban morphology and dynamics, and monitor the growth, decline, and mortality of terrestrial ecosystems. Various deep learning approaches (e.g., super-resolution techniques, monocular depth estimation) have been developed to create high-resolution Digital Elevation Models (DEMs). However, super-resolution techniques are limited by the upscaling factor, and monocular depth estimation lacks global elevation context, making its conversion to a seamless DEM restricted. The recently introduced technique of prompt-based monocular depth estimation has opened new opportunities to extract estimates of absolute elevation in a global context. We present here a framework for the estimation of high-resolution DEMs as a new paradigm for absolute global elevation mapping. It is exemplified using low-resolution Shuttle Radar Topography Mission (SRTM) elevation data as prompts and high-resolution RGB imagery from the National Agriculture Imagery Program (NAIP). The approach fine-tunes a vision transformer encoder with LiDAR-derived DEMs and employs a versatile prompting strategy, enabling tasks such as DEM estimation, void filling, and updating. Our framework achieves a 100x resolution gain (from 30-m to 30-cm), surpassing prior methods by an order of magnitude. Evaluations across three diverse U.S. landscapes show robust generalization, capturing urban structures and fine-scale terrain features with < 5 m MAE relative to LiDAR, improving over SRTM by up to 18%. Hydrological analysis confirms suitability for hazard and environmental studies. We demonstrate scalability by applying the framework to large regions in the U.S. and Israel. All code and pretrained models are publicly available at: https://osherr1996.github.io/prompt2dem_propage/.

Authors:Xinyu Zhang, Zhonghao Ye, Jingwei Zhang, Xiang Tian, Zhisheng Liang, Shipeng Yu
Title: VST-Pose: A Velocity-Integrated Spatiotem-poral Attention Network for Human WiFi Pose Estimation
Abstract:
WiFi-based human pose estimation has emerged as a promising non-visual alternative approaches due to its pene-trability and privacy advantages. This paper presents VST-Pose, a novel deep learning framework for accurate and continuous pose estimation using WiFi channel state information. The proposed method introduces ViSTA-Former, a spatiotemporal attention backbone with dual-stream architecture that adopts a dual-stream architecture to separately capture temporal dependencies and structural relationships among body joints. To enhance sensitivity to subtle human motions, a velocity modeling branch is integrated into the framework, which learns short-term keypoint dis-placement patterns and improves fine-grained motion representation. We construct a 2D pose dataset specifically designed for smart home care scenarios and demonstrate that our method achieves 92.2% accuracy on the PCK@50 metric, outperforming existing methods by 8.3% in PCK@50 on the self-collected dataset. Further evaluation on the public MMFi dataset confirms the model's robustness and effectiveness in 3D pose estimation tasks. The proposed system provides a reliable and privacy-aware solution for continuous human motion analysis in indoor environments. Our codes are available in https://github.com/CarmenQing/VST-Pose.
中文:VST-Pose是一种基于WiFi信号的新型深度学习框架,通过速度建模增强了对细微动作的捕捉能力,在姿态估计任务中达到92.2%的准确率,为室内人体运动分析提供了可靠的隐私保护方案。
English: VST-Pose is a novel deep learning framework using WiFi signals that achieves 92.2% pose estimation accuracy with enhanced motion sensitivity through velocity modeling, offering a privacy-preserving solution for indoor human motion analysis.

Authors:Zhengyuan Peng, Jianqing Xu, Shen Li, Jiazhen Ji, Yuge Huang, Jingyun Zhang, Jinmin Li, Shouhong Ding, Rizen Guo, Xin Tan, Lizhuang Ma
Title: EyeSeg: An Uncertainty-Aware Eye Segmentation Framework for AR/VR
Abstract:
Human-machine interaction through augmented reality (AR) and virtual reality (VR) is increasingly prevalent, requiring accurate and efficient gaze estimation which hinges on the accuracy of eye segmentation to enable smooth user experiences. We introduce EyeSeg, a novel eye segmentation framework designed to overcome key challenges that existing approaches struggle with: motion blur, eyelid occlusion, and train-test domain gaps. In these situations, existing models struggle to extract robust features, leading to suboptimal performance. Noting that these challenges can be generally quantified by uncertainty, we design EyeSeg as an uncertainty-aware eye segmentation framework for AR/VR wherein we explicitly model the uncertainties by performing Bayesian uncertainty learning of a posterior under the closed set prior. Theoretically, we prove that a statistic of the learned posterior indicates segmentation uncertainty levels and empirically outperforms existing methods in downstream tasks, such as gaze estimation. EyeSeg outputs an uncertainty score and the segmentation result, weighting and fusing multiple gaze estimates for robustness, which proves to be effective especially under motion blur, eyelid occlusion and cross-domain challenges. Moreover, empirical results suggest that EyeSeg achieves segmentation improvements of MIoU, E1, F1, and ACC surpassing previous approaches. The code is publicly available at https://github.com/JethroPeng/EyeSeg.
中文摘要:EyeSeg是一种面向AR/VR的不确定性感知眼部分割框架,通过贝叶斯学习建模不确定性,有效解决了运动模糊、眼睑遮挡和领域差异等关键挑战,在分割精度和视线估计鲁棒性方面均优于现有方法。
English Summary: EyeSeg is an uncertainty-aware eye segmentation framework for AR/VR that addresses motion blur, eyelid occlusion, and domain gaps by modeling uncertainties through Bayesian learning, achieving superior segmentation performance and gaze estimation robustness.

Authors:Haozhe Zhao, Zefan Cai, Shuzheng Si, Liang Chen, Jiuxiang Gu, Wen Xiao, Junjie Hu
Title: MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models
Abstract:
Recent text-to-image models produce high-quality results but still struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex multimodal image generation. To address these limitations, we propose MENTOR, a novel autoregressive (AR) framework for efficient Multimodal-conditioned Tuning for Autoregressive multimodal image generation. MENTOR combines an AR image generator with a two-stage training paradigm, enabling fine-grained, token-level alignment between multimodal inputs and image outputs without relying on auxiliary adapters or cross-attention modules. The two-stage training consists of: (1) a multimodal alignment stage that establishes robust pixel- and semantic-level alignment, followed by (2) a multimodal instruction tuning stage that balances the integration of multimodal inputs and enhances generation controllability. Despite modest model size, suboptimal base components, and limited training resources, MENTOR achieves strong performance on the DreamBench++ benchmark, outperforming competitive baselines in concept preservation and prompt following. Additionally, our method delivers superior image reconstruction fidelity, broad task adaptability, and improved training efficiency compared to diffusion-based methods. Dataset, code, and models are available at: https://github.com/HaozheZhao/MENTOR
中文摘要:MENTOR框架通过自回归方法和两阶段训练实现了多模态输入与图像输出的细粒度对齐,在有限资源下仍展现出卓越的生成性能与训练效率,超越了现有基准方法。
English Summary: The MENTOR framework introduces an autoregressive approach with two-stage training to enhance multimodal image generation by achieving fine-grained alignment between inputs and outputs, demonstrating superior performance and efficiency over existing methods despite limited resources.

Authors:Bolun Zheng, Xinjie Liu, Qianyu Zhang, Canjin Wang, Fangni Chen, Mingen Xu
Title: EHPE: A Segmented Architecture for Enhanced Hand Pose Estimation
Abstract:
3D hand pose estimation has garnered great attention in recent years due to its critical applications in human-computer interaction, virtual reality, and related fields. The accurate estimation of hand joints is essential for high-quality hand pose estimation. However, existing methods neglect the importance of Distal Phalanx Tip (TIP) and Wrist in predicting hand joints overall and often fail to account for the phenomenon of error accumulation for distal joints in gesture estimation, which can cause certain joints to incur larger errors, resulting in misalignments and artifacts in the pose estimation and degrading the overall reconstruction quality. To address this challenge, we propose a novel segmented architecture for enhanced hand pose estimation (EHPE). We perform local extraction of TIP and wrist, thus alleviating the effect of error accumulation on TIP prediction and further reduce the predictive errors for all joints on this basis. EHPE consists of two key stages: In the TIP and Wrist Joints Extraction stage (TW-stage), the positions of the TIP and wrist joints are estimated to provide an initial accurate joint configuration; In the Prior Guided Joints Estimation stage (PG-stage), a dual-branch interaction network is employed to refine the positions of the remaining joints. Extensive experiments on two widely used benchmarks demonstrate that EHPE achieves state-of-the-arts performance. Code is available at https://github.com/SereinNout/EHPE.
中文摘要:本文提出的EHPE方法通过分段式架构分别提取指尖和手腕关节再优化其余关节,有效解决了三维手势估计中的误差累积问题,在多个基准测试中达到了最优性能。
English Summary: The proposed EHPE method addresses error accumulation in 3D hand pose estimation by introducing a segmented architecture that separately extracts distal phalanx tip and wrist joints before refining remaining joints, achieving state-of-the-art performance on benchmarks.

Authors:Ximeng Zhai, Bohan Xu, Yaohong Chen, Hao Wang, Kehua Guo, Yimian Dai
Title: SeqCSIST: Sequential Closely-Spaced Infrared Small Target Unmixing
Abstract:
Due to the limitation of the optical lens focal length and the resolution of the infrared detector, distant Closely-Spaced Infrared Small Target (CSIST) groups typically appear as mixing spots in the infrared image. In this paper, we propose a novel task, Sequential CSIST Unmixing, namely detecting all targets in the form of sub-pixel localization from a highly dense CSIST group. However, achieving such precise detection is an extremely difficult challenge. In addition, the lack of high-quality public datasets has also restricted the research progress. To this end, firstly, we contribute an open-source ecosystem, including SeqCSIST, a sequential benchmark dataset, and a toolkit that provides objective evaluation metrics for this special task, along with the implementation of 23 relevant methods. Furthermore, we propose the Deformable Refinement Network (DeRefNet), a model-driven deep learning framework that introduces a Temporal Deformable Feature Alignment (TDFA) module enabling adaptive inter-frame information aggregation. To the best of our knowledge, this work is the first endeavor to address the CSIST Unmixing task within a multi-frame paradigm. Experiments on the SeqCSIST dataset demonstrate that our method outperforms the state-of-the-art approaches with mean Average Precision (mAP) metric improved by 5.3\%. Our dataset and toolkit are available from https://github.com/GrokCV/SeqCSIST.
Chinese: 本文提出了序列化密集红外小目标分离任务,通过亚像素定位检测目标,并开发了DeRefNet框架和新数据集,在mAP指标上比现有最佳方法提升了5.3%。
English: This paper introduces the Sequential CSIST Unmixing task for detecting closely-spaced infrared small targets through sub-pixel localization and presents the DeRefNet framework with a novel dataset, achieving a 5.3% mAP improvement over existing methods.

Authors:Zihao Xiong, Fei Zhou, Fengyi Wu, Shuai Yuan, Maixia Fu, Zhenming Peng, Jian Yang, Yimian Dai
Title: DRPCA-Net: Make Robust PCA Great Again for Infrared Small Target Detection
Abstract:
Infrared small target detection plays a vital role in remote sensing, industrial monitoring, and various civilian applications. Despite recent progress powered by deep learning, many end-to-end convolutional models tend to pursue performance by stacking increasingly complex architectures, often at the expense of interpretability, parameter efficiency, and generalization. These models typically overlook the intrinsic sparsity prior of infrared small targets--an essential cue that can be explicitly modeled for both performance and efficiency gains. To address this, we revisit the model-based paradigm of Robust Principal Component Analysis (RPCA) and propose Dynamic RPCA Network (DRPCA-Net), a novel deep unfolding network that integrates the sparsity-aware prior into a learnable architecture. Unlike conventional deep unfolding methods that rely on static, globally learned parameters, DRPCA-Net introduces a dynamic unfolding mechanism via a lightweight hypernetwork. This design enables the model to adaptively generate iteration-wise parameters conditioned on the input scene, thereby enhancing its robustness and generalization across diverse backgrounds. Furthermore, we design a Dynamic Residual Group (DRG) module to better capture contextual variations within the background, leading to more accurate low-rank estimation and improved separation of small targets. Extensive experiments on multiple public infrared datasets demonstrate that DRPCA-Net significantly outperforms existing state-of-the-art methods in detection accuracy. Code is available at https://github.com/GrokCV/DRPCA-Net.
中文: 本文提出的动态RPCA网络(DRPCA-Net)通过深度展开架构融合稀疏性先验知识,并采用自适应参数生成机制,在多种复杂背景下实现了显著优于现有方法的红外小目标检测精度。
English: The proposed Dynamic RPCA Network (DRPCA-Net) integrates sparsity-aware priors through a deep unfolding architecture with adaptive parameter generation, achieving superior infrared small target detection accuracy across diverse backgrounds.

Authors:Yunwei Lan, Zhigao Cui, Xin Luo, Chang Liu, Nian Wang, Menglin Zhang, Yanzhao Su, Dong Liu
Title: When Schrödinger Bridge Meets Real-World Image Dehazing with Unpaired Training
Abstract:
Recent advancements in unpaired dehazing, particularly those using GANs, show promising performance in processing real-world hazy images. However, these methods tend to face limitations due to the generator's limited transport mapping capability, which hinders the full exploitation of their effectiveness in unpaired training paradigms. To address these challenges, we propose DehazeSB, a novel unpaired dehazing framework based on the Schrödinger Bridge. By leveraging optimal transport (OT) theory, DehazeSB directly bridges the distributions between hazy and clear images. This enables optimal transport mappings from hazy to clear images in fewer steps, thereby generating high-quality results. To ensure the consistency of structural information and details in the restored images, we introduce detail-preserving regularization, which enforces pixel-level alignment between hazy inputs and dehazed outputs. Furthermore, we propose a novel prompt learning to leverage pre-trained CLIP models in distinguishing hazy images and clear ones, by learning a haze-aware vision-language alignment. Extensive experiments on multiple real-world datasets demonstrate our method's superiority. Code: https://github.com/ywxjm/DehazeSB.
Chinese: DehazeSB提出了一种基于薛定谔桥的新型非配对去雾框架,利用最优传输理论直接连接有雾与清晰图像的分布,结合细节保留正则化和提示学习,在多个真实数据集上展现出卓越性能。
English: DehazeSB introduces a novel unpaired dehazing framework using the Schrödinger Bridge and optimal transport theory to efficiently map hazy to clear images with enhanced detail preservation and haze-aware prompt learning, achieving superior results on real-world datasets.

Authors:Yiwen Liang, Hui Chen, Yizhe Xiong, Zihan Zhou, Mengyao Lyu, Zijia Lin, Shuaicheng Niu, Sicheng Zhao, Jungong Han, Guiguang Ding
Title: Advancing Reliable Test-Time Adaptation of Vision-Language Models under Visual Variations
Abstract:
Vision-language models (VLMs) exhibit remarkable zero-shot capabilities but struggle with distribution shifts in downstream tasks when labeled data is unavailable, which has motivated the development of Test-Time Adaptation (TTA) to improve VLMs' performance during inference without annotations. Among various TTA approaches, cache-based methods show promise by preserving historical knowledge from low-entropy samples in a dynamic cache and fostering efficient adaptation. However, these methods face two critical reliability challenges: (1) entropy often becomes unreliable under distribution shifts, causing error accumulation in the cache and degradation in adaptation performance; (2) the final predictions may be unreliable due to inflexible decision boundaries that fail to accommodate large downstream shifts. To address these challenges, we propose a Reliable Test-time Adaptation (ReTA) method that integrates two complementary strategies to enhance reliability from two perspectives. First, to mitigate the unreliability of entropy as a sample selection criterion for cache construction, we introduce Consistency-aware Entropy Reweighting (CER), which incorporates consistency constraints to weight entropy during cache updating. While conventional approaches rely solely on low entropy for cache prioritization and risk introducing noise, our method leverages predictive consistency to maintain a high-quality cache and facilitate more robust adaptation. Second, we present Diversity-driven Distribution Calibration (DDC), which models class-wise text embeddings as multivariate Gaussian distributions, enabling adaptive decision boundaries for more accurate predictions across visually diverse content. Extensive experiments demonstrate that ReTA consistently outperforms state-of-the-art methods, particularly under real-world distribution shifts. Code: https://github.com/Evelyn1ywliang/ReTA.
中文: 针对分布偏移下视觉语言模型因熵值样本选择不可靠和决策边界僵化导致的性能下降问题,提出的ReTA方法通过一致性感知熵加权提升缓存质量,并利用多样性驱动的分布校准实现自适应预测,显著提升模型鲁棒性。
English: Vision-language models face reliability issues under distribution shifts due to unreliable entropy-based sample selection and rigid decision boundaries, which the proposed ReTA method addresses through consistency-aware entropy reweighting for cache quality and diversity-driven distribution calibration for adaptive predictions.

Authors:Yuanhong Zheng, Ruixuan Yu, Jian Sun
Title: Efficient Multi-Person Motion Prediction by Lightweight Spatial and Temporal Interactions
Abstract:
3D multi-person motion prediction is a highly complex task, primarily due to the dependencies on both individual past movements and the interactions between agents. Moreover, effectively modeling these interactions often incurs substantial computational costs. In this work, we propose a computationally efficient model for multi-person motion prediction by simplifying spatial and temporal interactions. Our approach begins with the design of lightweight dual branches that learn local and global representations for individual and multiple persons separately. Additionally, we introduce a novel cross-level interaction block to integrate the spatial and temporal representations from both branches. To further enhance interaction modeling, we explicitly incorporate the spatial inter-person distance embedding. With above efficient temporal and spatial design, we achieve state-of-the-art performance for multiple metrics on standard datasets of CMU-Mocap, MuPoTS-3D, and 3DPW, while significantly reducing the computational cost. Code is available at https://github.com/Yuanhong-Zheng/EMPMP.
中文: 本文提出了一种计算高效的3D多人运动预测模型,通过轻量级双分支结构和新型跨层级交互模块简化时空交互,在标准数据集上实现最优性能的同时显著降低了计算成本。
English: This paper introduces a computationally efficient model for 3D multi-person motion prediction by simplifying spatial and temporal interactions through lightweight dual branches and a novel cross-level interaction block, achieving state-of-the-art performance on standard datasets while significantly reducing computational costs.

Authors:Ankit Sanjyal
Title: RectifiedHR: High-Resolution Diffusion via Energy Profiling and Adaptive Guidance Scheduling
Abstract:
High-resolution image synthesis with diffusion models often suffers from energy instabilities and guidance artifacts that degrade visual quality. We analyze the latent energy landscape during sampling and propose adaptive classifier-free guidance (CFG) schedules that maintain stable energy trajectories. Our approach introduces energy-aware scheduling strategies that modulate guidance strength over time, achieving superior stability scores (0.9998) and consistency metrics (0.9873) compared to fixed-guidance approaches. We demonstrate that DPM++ 2M with linear-decreasing CFG scheduling yields optimal performance, providing sharper, more faithful images while reducing artifacts. Our energy profiling framework serves as a powerful diagnostic tool for understanding and improving diffusion model behavior.
中文摘要:本研究通过提出自适应无分类器引导调度策略,动态调整引导强度以解决扩散模型图像合成中的能量不稳定和伪影问题,相比固定引导方法获得了更优的稳定性和图像质量。
English Summary: This study addresses energy instabilities and artifacts in diffusion-based image synthesis by introducing adaptive classifier-free guidance schedules that dynamically adjust guidance strength, achieving superior stability and image quality compared to fixed approaches.

Authors:Timothy Chase, Karthik Dantu
Title: Domain Adaptation and Multi-view Attention for Learnable Landmark Tracking with Sparse Data
Abstract:
The detection and tracking of celestial surface terrain features are crucial for autonomous spaceflight applications, including Terrain Relative Navigation (TRN), Entry, Descent, and Landing (EDL), hazard analysis, and scientific data collection. Traditional photoclinometry-based pipelines often rely on extensive a priori imaging and offline processing, constrained by the computational limitations of radiation-hardened systems. While historically effective, these approaches typically increase mission costs and duration, operate at low processing rates, and have limited generalization. Recently, learning-based computer vision has gained popularity to enhance spacecraft autonomy and overcome these limitations. While promising, emerging techniques frequently impose computational demands exceeding the capabilities of typical spacecraft hardware for real-time operation and are further challenged by the scarcity of labeled training data for diverse extraterrestrial environments. In this work, we present novel formulations for in-situ landmark tracking via detection and description. We utilize lightweight, computationally efficient neural network architectures designed for real-time execution on current-generation spacecraft flight processors. For landmark detection, we propose improved domain adaptation methods that enable the identification of celestial terrain features with distinct, cheaply acquired training data. Concurrently, for landmark description, we introduce a novel attention alignment formulation that learns robust feature representations that maintain correspondence despite significant landmark viewpoint variations. Together, these contributions form a unified system for landmark tracking that demonstrates superior performance compared to existing state-of-the-art techniques.

Authors:Sourish Suri, Yifei Shao
Title: Automated Multi-Class Crop Pathology Classification via Convolutional Neural Networks: A Deep Learning Approach for Real-Time Precision Agriculture
Abstract:
Crop diseases present a significant barrier to agricultural productivity and global food security, especially in large-scale farming where early identification is often delayed or inaccurate. This research introduces a Convolutional Neural Network (CNN)-based image classification system designed to automate the detection and classification of eight common crop diseases using leaf imagery. The methodology involves a complete deep learning pipeline: image acquisition from a large, labeled dataset, preprocessing via resizing, normalization, and augmentation, and model training using TensorFlow with Keras' Sequential API. The CNN architecture comprises three convolutional layers with increasing filter sizes and ReLU activations, followed by max pooling, flattening, and fully connected layers, concluding with a softmax output for multi-class classification. The system achieves high training accuracy (~90%) and demonstrates reliable performance on unseen data, although a validation accuracy of ~60% suggests minor overfitting. Notably, the model integrates a treatment recommendation module, providing actionable guidance by mapping each detected disease to suitable pesticide or fungicide interventions. Furthermore, the solution is deployed on an open-source, mobile-compatible platform, enabling real-time image-based diagnostics for farmers in remote areas. This research contributes a scalable and accessible tool to the field of precision agriculture, reducing reliance on manual inspection and promoting sustainable disease management practices. By merging deep learning with practical agronomic support, this work underscores the potential of CNNs to transform crop health monitoring and enhance food production resilience on a global scale.
中文摘要:本研究开发了一种基于卷积神经网络的图像分类系统,能够自动识别叶片图像中的八种常见作物病害并提供治理建议,通过高训练精度和移动端部署为精准农业提供支持。
English Summary: This research develops a CNN-based image classification system that automatically detects eight common crop diseases from leaf images and provides treatment recommendations, achieving high training accuracy and mobile deployment to support farmers in precision agriculture.

Authors:Svetlana Orlova, Tommie Kerssies, Brunó B. Englert, Gijs Dubbelman
Title: Simplifying Traffic Anomaly Detection with Video Foundation Models
Abstract:
Recent methods for ego-centric Traffic Anomaly Detection (TAD) often rely on complex multi-stage or multi-representation fusion architectures, yet it remains unclear whether such complexity is necessary. Recent findings in visual perception suggest that foundation models, enabled by advanced pre-training, allow simple yet flexible architectures to outperform specialized designs. Therefore, in this work, we investigate an architecturally simple encoder-only approach using plain Video Vision Transformers (Video ViTs) and study how pre-training enables strong TAD performance. We find that: (i) advanced pre-training enables simple encoder-only models to match or even surpass the performance of specialized state-of-the-art TAD methods, while also being significantly more efficient; (ii) although weakly- and fully-supervised pre-training are advantageous on standard benchmarks, we find them less effective for TAD. Instead, self-supervised Masked Video Modeling (MVM) provides the strongest signal; and (iii) Domain-Adaptive Pre-Training (DAPT) on unlabeled driving videos further improves downstream performance, without requiring anomalous examples. Our findings highlight the importance of pre-training and show that effective, efficient, and scalable TAD models can be built with minimal architectural complexity. We release our code, domain-adapted encoders, and fine-tuned models to support future work: https://github.com/tue-mps/simple-tad.
Chinese: 最新研究表明,采用视频视觉变换器的简单编码器模型,在结合自监督掩码视频建模和领域自适应预训练等先进技术后,能在以自我为中心的交通异常检测中超越复杂专用方法,同时显著提升效率。
English: Recent research demonstrates that simple encoder-only models using Video Vision Transformers, when enhanced with advanced pre-training like self-supervised Masked Video Modeling and domain-adaptive techniques, can outperform complex specialized methods in ego-centric Traffic Anomaly Detection while being more efficient.

Authors:Wencan Huang, Daizong Liu, Wei Hu
Title: Fast3D: Accelerating 3D Multi-modal Large Language Models for Efficient 3D Scene Understanding
Abstract:
While 3D Multi-modal Large Language Models (MLLMs) demonstrate remarkable scene understanding capabilities, their practical deployment faces critical challenges due to computational inefficiency. The key bottleneck stems from processing excessive object-centric visual tokens required for comprehensive 3D scene representation. Although visual token pruning has shown promise in accelerating 2D MLLMs, its applicability to 3D domains remains largely unexplored due to fundamental disparities in token structures. In this paper, we reveal two critical insights: (1) Significant redundancy exists in object-level 3D token representations, analogous to patch-level redundancy in 2D systems; (2) Global attention patterns exhibit strong predictive power for identifying non-essential tokens in 3D contexts. Building on these observations, we propose Fast3D, a plug-and-play visual token pruning framework for 3D MLLMs featuring two technical innovations: (1) Global Attention Prediction (GAP), where a lightweight neural network learns to predict the global attention distributions of the target model, enabling efficient token importance estimation for precise pruning guidance; (2) Sample-Adaptive visual token Pruning (SAP), which introduces dynamic token budgets through attention-based complexity assessment, automatically adjusting layer-wise pruning ratios based on input characteristics. Both of these two techniques operate without modifying the parameters of the target model. Extensive evaluations across five benchmarks validate the effectiveness of Fast3D, particularly under high visual token pruning ratios. Code is available at https://github.com/wencan25/Fast3D
中文:Fast3D框架通过全局注意力预测和自适应剪枝技术,在不改变模型参数的情况下有效减少3D多模态大语言模型中的冗余视觉标记,显著提升了计算效率并在高剪枝率下保持优异性能。
English: The Fast3D framework addresses computational inefficiency in 3D MLLMs by introducing global attention prediction and sample-adaptive pruning to reduce redundant visual tokens without altering model parameters, achieving strong performance under high pruning ratios.

Authors:Yueqian Wang, Xiaojun Meng, Yifan Wang, Huishuai Zhang, Dongyan Zhao
Title: ProactiveVideoQA: A Comprehensive Benchmark Evaluating Proactive Interactions in Video Large Language Models
Abstract:
With the growing research focus on multimodal dialogue systems, the capability for proactive interaction is gradually gaining recognition. As an alternative to conventional turn-by-turn dialogue, users increasingly expect multimodal systems to be more initiative, for example, by autonomously determining the timing of multi-turn responses in real time during video playback. To facilitate progress in this emerging area, we introduce ProactiveVideoQA, the first comprehensive benchmark to evaluate a system's ability to engage in proactive interaction. Since model responses are generated at varying timestamps, we further propose PAUC, the first metric that accounts for the temporal dynamics of model responses. This enables a more accurate evaluation of systems operating in proactive settings. Through extensive benchmarking of various baseline systems on ProactiveVideoQA and a user study of human preferences, we show that PAUC is in better agreement with human preferences than traditional evaluation metrics, which typically only consider the textual content of responses. These findings demonstrate that PAUC provides a more faithful assessment of user experience in proactive interaction scenarios. Project homepage: https://github.com/yellow-binary-tree/ProactiveVideoQA
中文摘要:提出了ProactiveVideoQA基准和PAUC评估指标,用以衡量多模态对话系统的主动交互能力,其中PAUC通过考量响应时间动态,被证实更符合人类偏好。
English Summary: The ProactiveVideoQA benchmark and PAUC metric are introduced to evaluate multimodal dialogue systems' proactive interaction capabilities, with PAUC proving more aligned with human preferences by accounting for response timing dynamics.

Authors:Zile Wang, Hao Yu, Jiabo Zhan, Chun Yuan
Title: AlphaVAE: Unified End-to-End RGBA Image Reconstruction and Generation with Alpha-Aware Representation Learning
Abstract:
Recent advances in latent diffusion models have achieved remarkable results in high-fidelity RGB image synthesis by leveraging pretrained VAEs to compress and reconstruct pixel data at low computational cost. However, the generation of transparent or layered content (RGBA image) remains largely unexplored, due to the lack of large-scale benchmarks. In this work, we propose ALPHA, the first comprehensive RGBA benchmark that adapts standard RGB metrics to four-channel images via alpha blending over canonical backgrounds. We further introduce ALPHAVAE, a unified end-to-end RGBA VAE that extends a pretrained RGB VAE by incorporating a dedicated alpha channel. The model is trained with a composite objective that combines alpha-blended pixel reconstruction, patch-level fidelity, perceptual consistency, and dual KL divergence constraints to ensure latent fidelity across both RGB and alpha representations. Our RGBA VAE, trained on only 8K images in contrast to 1M used by prior methods, achieves a +4.9 dB improvement in PSNR and a +3.2% increase in SSIM over LayerDiffuse in reconstruction. It also enables superior transparent image generation when fine-tuned within a latent diffusion framework. Our code, data, and models are released on https://github.com/o0o0o00o0/AlphaVAE for reproducibility.
中文: 本文提出了首个全面的RGBA基准ALPHA和统一的RGBA变分自编码器ALPHAVAE,该模型仅用极少量训练数据就在透明图像重建和生成方面显著超越了现有方法。
English: This paper introduces ALPHA, the first comprehensive RGBA benchmark, and ALPHAVAE, a unified RGBA variational autoencoder that significantly outperforms existing methods in transparent image reconstruction and generation despite using minimal training data.

Authors:Zhiwei Xu
Title: DAA*: Deep Angular A Star for Image-based Path Planning
Abstract:
Path smoothness is often overlooked in path imitation learning from expert demonstrations. In this paper, we introduce a novel learning method, termed deep angular A* (DAA*), by incorporating the proposed path angular freedom (PAF) into A* to improve path similarity through adaptive path smoothness. The PAF aims to explore the effect of move angles on path node expansion by finding the trade-off between their minimum and maximum values, allowing for high adaptiveness for imitation learning. DAA* improves path optimality by closely aligning with the reference path through joint optimization of path shortening and smoothing, which correspond to heuristic distance and PAF, respectively. Throughout comprehensive evaluations on 7 datasets, including 4 maze datasets, 2 video-game datasets, and a real-world drone-view dataset containing 2 scenarios, we demonstrate remarkable improvements of our DAA* over neural A* in path similarity between the predicted and reference paths with a shorter path length when the shortest path is plausible, improving by 9.0% SPR, 6.9% ASIM, and 3.9% PSIM. Furthermore, when jointly learning pathfinding with both path loss and path probability map loss, DAA* significantly outperforms the state-of-the-art TransPath by 6.3% SPR, 6.0% PSIM, and 3.7% ASIM. We also discuss the minor trade-off between path optimality and search efficiency where applicable. Our code and model weights are available at https://github.com/zwxu064/DAAStar.git.
中文: 本文提出深度角度A*(DAA*)方法,通过引入路径角度自由度来优化路径平滑度与专家路径的相似性,在多个数据集上显著提升了路径相似性指标。
English: This paper introduces Deep Angular A* (DAA*), a novel method that enhances path imitation learning by incorporating path angular freedom to optimize smoothness and similarity to expert paths, achieving significant improvements in path metrics across multiple datasets.

Authors:Abdulvahap Mutlu, Şengül Doğan, Türker Tuncer
Title: ViT-ProtoNet for Few-Shot Image Classification: A Multi-Benchmark Evaluation
Abstract:
The remarkable representational power of Vision Transformers (ViTs) remains underutilized in few-shot image classification. In this work, we introduce ViT-ProtoNet, which integrates a ViT-Small backbone into the Prototypical Network framework. By averaging class conditional token embeddings from a handful of support examples, ViT-ProtoNet constructs robust prototypes that generalize to novel categories under 5-shot settings. We conduct an extensive empirical evaluation on four standard benchmarks: Mini-ImageNet, FC100, CUB-200, and CIFAR-FS, including overlapped support variants to assess robustness. Across all splits, ViT-ProtoNet consistently outperforms CNN-based prototypical counterparts, achieving up to a 3.2\% improvement in 5-shot accuracy and demonstrating superior feature separability in latent space. Furthermore, it outperforms or is competitive with transformer-based competitors using a more lightweight backbone. Comprehensive ablations examine the impact of transformer depth, patch size, and fine-tuning strategy. To foster reproducibility, we release code and pretrained weights. Our results establish ViT-ProtoNet as a powerful, flexible approach for few-shot classification and set a new baseline for transformer-based meta-learners.
中文: ViT-ProtoNet通过将视觉Transformer骨干网络融入原型网络,在小样本图像分类中显著提升了准确率和特征可分性,在多个基准测试中以轻量级架构确立了新基线。
English: ViT-ProtoNet enhances few-shot image classification by integrating a Vision Transformer backbone into Prototypical Networks, achieving superior accuracy and feature separability across multiple benchmarks with a lightweight architecture.

Authors:Yuval Grader, Hadar Averbuch-Elor
Title: Supercharging Floorplan Localization with Semantic Rays
Abstract:
Floorplans provide a compact representation of the building's structure, revealing not only layout information but also detailed semantics such as the locations of windows and doors. However, contemporary floorplan localization techniques mostly focus on matching depth-based structural cues, ignoring the rich semantics communicated within floorplans. In this work, we introduce a semantic-aware localization framework that jointly estimates depth and semantic rays, consolidating over both for predicting a structural-semantic probability volume. Our probability volume is constructed in a coarse-to-fine manner: We first sample a small set of rays to obtain an initial low-resolution probability volume. We then refine these probabilities by performing a denser sampling only in high-probability regions and process the refined values for predicting a 2D location and orientation angle. We conduct an evaluation on two standard floorplan localization benchmarks. Our experiments demonstrate that our approach substantially outperforms state-of-the-art methods, achieving significant improvements in recall metrics compared to prior works. Moreover, we show that our framework can easily incorporate additional metadata such as room labels, enabling additional gains in both accuracy and efficiency.

Authors:Chenhao Ding, Jiangtao Zhang, Zongsheng Yue, Hui Wang, Qian Zhao, Deyu Meng
Title: Generative Latent Kernel Modeling for Blind Motion Deblurring
Abstract:
Deep prior-based approaches have demonstrated remarkable success in blind motion deblurring (BMD) recently. These methods, however, are often limited by the high non-convexity of the underlying optimization process in BMD, which leads to extreme sensitivity to the initial blur kernel. To address this issue, we propose a novel framework for BMD that leverages a deep generative model to encode the kernel prior and induce a better initialization for the blur kernel. Specifically, we pre-train a kernel generator based on a generative adversarial network (GAN) to aptly characterize the kernel's prior distribution, as well as a kernel initializer to provide a well-informed and high-quality starting point for kernel estimation. By combining these two components, we constrain the BMD solution within a compact latent kernel manifold, thus alleviating the aforementioned sensitivity for kernel initialization. Notably, the kernel generator and initializer are designed to be easily integrated with existing BMD methods in a plug-and-play manner, enhancing their overall performance. Furthermore, we extend our approach to tackle blind non-uniform motion deblurring without the need for additional priors, achieving state-of-the-art performance on challenging benchmark datasets. The source code is available at https://github.com/dch0319/GLKM-Deblur.
中文摘要:本文提出了一种新颖的盲运动去模糊框架,通过基于GAN的核生成器和初始化器提供更优的核初始化,有效降低对初始条件的敏感性,并在多个基准数据集上实现了最先进的性能。
English Summary: This paper introduces a novel blind motion deblurring framework that uses a GAN-based kernel generator and initializer to provide better kernel initialization, reducing sensitivity to initial conditions while achieving state-of-the-art performance.

Authors:Anita Kriz, Elizabeth Laura Janes, Xing Shen, Tal Arbel
Title: Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models
Abstract:
Multimodal large language models (MLLMs) hold considerable promise for applications in healthcare. However, their deployment in safety-critical settings is hindered by two key limitations: (i) sensitivity to prompt design, and (ii) a tendency to generate incorrect responses with high confidence. As clinicians may rely on a model's stated confidence to gauge the reliability of its predictions, it is especially important that when a model expresses high confidence, it is also highly accurate. We introduce Prompt4Trust, the first reinforcement learning (RL) framework for prompt augmentation targeting confidence calibration in MLLMs. A lightweight LLM is trained to produce context-aware auxiliary prompts that guide a downstream task MLLM to generate responses in which the expressed confidence more accurately reflects predictive accuracy. Unlike conventional calibration techniques, Prompt4Trust specifically prioritizes aspects of calibration most critical for safe and trustworthy clinical decision-making. Beyond improvements driven by this clinically motivated calibration objective, our proposed method also improves task accuracy, achieving state-of-the-art medical visual question answering (VQA) performance on the PMC-VQA benchmark, which is composed of multiple-choice questions spanning diverse medical imaging modalities. Moreover, our framework trained with a small downstream task MLLM showed promising zero-shot generalization to larger MLLMs in our experiments, suggesting the potential for scalable calibration without the associated computational costs. This work demonstrates the potential of automated yet human-aligned prompt engineering for improving the the trustworthiness of MLLMs in safety critical settings. Our codebase can be found at https://github.com/xingbpshen/prompt4trust.
Chinese: Prompt4Trust是一种强化学习框架,旨在提升多模态大语言模型的置信度校准与任务准确性,尤其关注临床决策的可信度,并在医学视觉问答基准中实现了领先性能。
English: Prompt4Trust is a reinforcement learning framework that enhances multimodal large language models' confidence calibration and task accuracy, particularly for trustworthy clinical decision-making, achieving state-of-the-art performance in medical visual question answering.

Authors:Shuhan Ye, Yuanbin Qian, Chong Wang, Sunqi Lin, Jiazhen Xu, Jiangbo Qian, Yuqi Li
Title: Cross Knowledge Distillation between Artificial and Spiking Neural Networks
Abstract:
Recently, Spiking Neural Networks (SNNs) have demonstrated rich potential in computer vision domain due to their high biological plausibility, event-driven characteristic and energy-saving efficiency. Still, limited annotated event-based datasets and immature SNN architectures result in their performance inferior to that of Artificial Neural Networks (ANNs). To enhance the performance of SNNs on their optimal data format, DVS data, we explore using RGB data and well-performing ANNs to implement knowledge distillation. In this case, solving cross-modality and cross-architecture challenges is necessary. In this paper, we propose cross knowledge distillation (CKD), which not only leverages semantic similarity and sliding replacement to mitigate the cross-modality challenge, but also uses an indirect phased knowledge distillation to mitigate the cross-architecture challenge. We validated our method on main-stream neuromorphic datasets, including N-Caltech101 and CEP-DVS. The experimental results show that our method outperforms current State-of-the-Art methods. The code will be available at https://github.com/ShawnYE618/CKD
Chinese: 脉冲神经网络在计算机视觉领域潜力巨大,但因标注数据有限和架构不成熟而性能不及人工神经网络,为此提出跨知识蒸馏方法,通过语义相似性和间接分阶段蒸馏解决跨模态和跨架构问题,在主流神经形态数据集上表现优于现有最优方法。
English: Spiking Neural Networks (SNNs) show promise in computer vision but lag behind Artificial Neural Networks (ANNs) due to limited datasets and immature architectures, leading to the proposal of cross knowledge distillation (CKD) that addresses cross-modality and cross-architecture challenges and outperforms state-of-the-art methods on neuromorphic datasets.

Authors:Junyu Chen, Yihua Gao, Mingyuan Ge, Mingyong Li
Title: Ambiguity-Aware and High-Order Relation Learning for Multi-Grained Image-Text Matching
Abstract:
Image-text matching is crucial for bridging the semantic gap between computer vision and natural language processing. However, existing methods still face challenges in handling high-order associations and semantic ambiguities among similar instances. These ambiguities arise from subtle differences between soft positive samples (semantically similar but incorrectly labeled) and soft negative samples (locally matched but globally inconsistent), creating matching uncertainties. Furthermore, current methods fail to fully utilize the neighborhood relationships among semantically similar instances within training batches, limiting the model's ability to learn high-order shared knowledge. This paper proposes the Ambiguity-Aware and High-order Relation learning framework (AAHR) to address these issues. AAHR constructs a unified representation space through dynamic clustering prototype contrastive learning, effectively mitigating the soft positive sample problem. The framework introduces global and local feature extraction mechanisms and an adaptive aggregation network, significantly enhancing full-grained semantic understanding capabilities. Additionally, AAHR employs intra-modal and inter-modal correlation matrices to investigate neighborhood relationships among sample instances thoroughly. It incorporates GNN to enhance semantic interactions between instances. Furthermore, AAHR integrates momentum contrastive learning to expand the negative sample set. These combined strategies significantly improve the model's ability to discriminate between features. Experimental results demonstrate that AAHR outperforms existing state-of-the-art methods on Flickr30K, MSCOCO, and ECCV Caption datasets, considerably improving the accuracy and efficiency of image-text matching. The code and model checkpoints for this research are available at https://github.com/Image-Text-Matching/AAHR .
中文摘要:AAHR框架通过动态聚类原型对比学习解决语义模糊问题,并采用多模态关联分析与图神经网络增强特征判别能力,在多个基准数据集上实现了最优的图像-文本匹配性能。
English Summary: The AAHR framework addresses image-text matching challenges by mitigating semantic ambiguities through dynamic clustering prototype contrastive learning and enhancing feature discrimination using multi-modal correlation analysis with GNN integration, achieving state-of-the-art performance on benchmark datasets.

Authors:Shiyi Mu, Zichong Gu, Hanqi Lyu, Yilin Gao, Shugong Xu
Title: Stereo-based 3D Anomaly Object Detection for Autonomous Driving: A New Dataset and Baseline
Abstract:
3D detection technology is widely used in the field of autonomous driving, with its application scenarios gradually expanding from enclosed highways to open conventional roads. For rare anomaly categories that appear on the road, 3D detection models trained on closed sets often misdetect or fail to detect anomaly objects. To address this risk, it is necessary to enhance the generalization ability of 3D detection models for targets of arbitrary shapes and to possess the capability to filter out anomalies. The generalization of 3D detection is limited by two factors: the coupled training of 2D and 3D, and the insufficient diversity in the scale distribution of training samples. This paper proposes a Stereo-based 3D Anomaly object Detection (S3AD) algorithm, which decouples the training strategy of 3D and 2D to release the generalization ability for arbitrary 3D foreground detection, and proposes an anomaly scoring algorithm based on foreground confidence prediction, achieving target-level anomaly scoring. In order to further verify and enhance the generalization of anomaly detection, we use a 3D rendering method to synthesize two augmented reality binocular stereo 3D detection datasets which named KITTI-AR. KITTI-AR extends upon KITTI by adding 97 new categories, totaling 6k pairs of stereo images. The KITTI-AR-ExD subset includes 39 common categories as extra training data to address the sparse sample distribution issue. Additionally, 58 rare categories form the KITTI-AR-OoD subset, which are not used in training to simulate zero-shot scenarios in real-world settings, solely for evaluating 3D anomaly detection. Finally, the performance of the algorithm and the dataset is verified in the experiments. (Code and dataset can be obtained at https://github.com/shiyi-mu/S3AD-Code).
Chinese: 本文提出S3AD算法,通过解耦2D和3D训练并采用新的评分方法,增强了自动驾驶中的3D异常检测能力,并在合成的KITTI-AR数据集上进行了验证。
English: This paper introduces the S3AD algorithm, which enhances 3D anomaly detection in autonomous driving by decoupling 2D and 3D training and using a novel scoring method, validated on the synthesized KITTI-AR dataset.

Authors:Jonas Scholz, Richard E. Turner
Title: Warm Starts Accelerate Generative Modelling
Abstract:
Iterative generative models, like diffusion and flow-matching, create high-fidelity samples by progressively refining a noise vector into data. However, this process is notoriously slow, often requiring hundreds of function evaluations. We introduce the warm-start model, a simple, deterministic model that dramatically accelerates conditional generation by providing a better starting point. Instead of starting generation from an uninformed N(0, I) prior, our warm-start model predicts an informed prior N(mu, sigma), whose moments are conditioned on the input context. This "warm start" substantially reduces the distance the generative process must traverse, particularly when the conditioning information is strongly informative. On tasks like image inpainting, our method achieves results competitive with a 1000-step DDPM baseline using only 11 total function evaluations (1 for the warm start, 10 for generation). A simple conditional normalization trick makes our method compatible with any standard generative model and sampler without modification, allowing it to be combined with other efficient sampling techniques for further acceleration. Our implementation is available at https://github.com/jonas-scholz123/warm-start-model.
Chinese: 预热启动模型通过提供基于输入条件的有信息先验,显著减少了生成过程所需的函数评估次数,仅用11次评估即可达到与1000步基准相媲美的效果。
English: The warm-start model accelerates conditional generation by providing an informed prior that reduces the number of function evaluations needed, achieving competitive results with only 11 evaluations compared to 1000-step baselines.

Authors:Qiyan Zhao, Xiaofeng Zhang, Yiheng Li, Yun Xing, Xiaosong Yuan, Feilong Tang, Sinan Fan, Xuhang Chen, Xuyao Zhang, Dahan Wang
Title: MCA-LLaVA: Manhattan Causal Attention for Reducing Hallucination in Large Vision-Language Models
Abstract:
Hallucinations pose a significant challenge in Large Vision Language Models (LVLMs), with misalignment between multimodal features identified as a key contributing factor. This paper reveals the negative impact of the long-term decay in Rotary Position Encoding (RoPE), used for positional modeling in LVLMs, on multimodal alignment. Concretely, under long-term decay, instruction tokens exhibit uneven perception of image tokens located at different positions within the two-dimensional space: prioritizing image tokens from the bottom-right region since in the one-dimensional sequence, these tokens are positionally closer to the instruction tokens. This biased perception leads to insufficient image-instruction interaction and suboptimal multimodal alignment. We refer to this phenomenon as image alignment bias. To enhance instruction's perception of image tokens at different spatial locations, we propose MCA-LLaVA, based on Manhattan distance, which extends the long-term decay to a two-dimensional, multi-directional spatial decay. MCA-LLaVA integrates the one-dimensional sequence order and two-dimensional spatial position of image tokens for positional modeling, mitigating hallucinations by alleviating image alignment bias. Experimental results of MCA-LLaVA across various hallucination and general benchmarks demonstrate its effectiveness and generality. The code can be accessed in https://github.com/ErikZ719/MCA-LLaVA.
中文: 本文发现旋转位置编码中的长期衰减会导致大型视觉语言模型产生感知偏差从而引发幻觉,并提出MCA-LLaVA模型通过将空间衰减扩展到二维来改善多模态对齐。
English: This paper identifies that the long-term decay in Rotary Position Encoding causes biased perception in Large Vision Language Models, leading to hallucinations, and proposes MCA-LLaVA to mitigate this by extending spatial decay to two dimensions for better multimodal alignment.

Authors:Yongwei Jiang, Yixiong Zou, Yuhua Li, Ruixuan Li
Title: Revisiting Pool-based Prompt Learning for Few-shot Class-incremental Learning
Abstract:
Few-Shot Class-Incremental Learning (FSCIL) faces dual challenges of data scarcity and incremental learning in real-world scenarios. While pool-based prompting methods have demonstrated success in traditional incremental learning, their effectiveness in FSCIL settings remains unexplored. This paper presents the first study of current prompt pool methods in FSCIL tasks, revealing an unanticipated performance degradation in incremental sessions. Through comprehensive analysis, we identify that this phenomenon stems from token-dimension saturation: with limited data, excessive prompts compete for task-relevant information, leading to model overfitting. Based on this finding, we propose LGSP-Prompt (Local-Global Spatial Prompting), which innovatively shifts pool-based prompt learning from the token dimension to the spatial dimension. LGSP-Prompt generates spatial prompts by synergistically combining local spatial features and global frequency-domain representations to highlight key patterns in input images. We construct two spatial prompt pools enabling dynamic prompt selection to maintain acquired knowledge while effectively learning novel sessions. Extensive experiments demonstrate that our approach achieves state-of-the-art performance across multiple FSCIL benchmarks, showing significant advantages in both base knowledge preservation and incremental learning. Our implementation is available at https://github.com/Jywsuperman/LGSP.
中文摘要:本文提出LGSP-Prompt方法,通过将提示学习从词元维度转向空间维度,结合局部空间特征和全局频域表示生成动态空间提示,有效解决了小样本类增量学习中的性能下降问题,在多个基准测试中实现了最先进的性能。
English Summary: This paper introduces LGSP-Prompt, a novel method that addresses performance degradation in Few-Shot Class-Incremental Learning by shifting prompt learning from token to spatial dimensions, achieving state-of-the-art results through dynamic spatial prompts combining local features and global frequency representations.

Authors:Zhimin Liao, Ping Wei, Ruijie Zhang, Shuaijia Chen, Haoxuan Wang, Ziyang Ren
Title: $I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting
Abstract:
Forecasting the evolution of 3D scenes and generating unseen scenarios via occupancy-based world models offers substantial potential for addressing corner cases in autonomous driving systems. While tokenization has revolutionized image and video generation, efficiently tokenizing complex 3D scenes remains a critical challenge for 3D world models. To address this, we propose $I^{2}$-World, an efficient framework for 4D occupancy forecasting. Our method decouples scene tokenization into intra-scene and inter-scene tokenizers. The intra-scene tokenizer employs a multi-scale residual quantization strategy to hierarchically compress 3D scenes while preserving spatial details. The inter-scene tokenizer residually aggregates temporal dependencies across timesteps. This dual design preserves the compactness of 3D tokenizers while retaining the dynamic expressiveness of 4D tokenizers. Unlike decoder-only GPT-style autoregressive models, $I^{2}$-World adopts an encoder-decoder architecture. The encoder aggregates spatial context from the current scene and predicts a transformation matrix to enable high-level control over scene generation. The decoder, conditioned on this matrix and historical tokens, ensures temporal consistency during generation. Experiments demonstrate that $I^{2}$-World achieves state-of-the-art performance, outperforming existing methods by 25.1\% in mIoU and 36.9\% in IoU for 4D occupancy forecasting while exhibiting exceptional computational efficiency: it requires merely 2.9 GB of training memory and achieves real-time inference at 37.0 FPS. Our code is available on https://github.com/lzzzzzm/II-World.
中文: I²-World提出了一种高效的4D占据预测框架,通过解耦场景标记化为场景内与场景间组件,以25.1%的mIoU提升和37.0 FPS实时推理速度实现了最先进性能。
English: I²-World introduces an efficient 4D occupancy forecasting framework that decouples scene tokenization into intra-scene and inter-scene components, achieving state-of-the-art performance with 25.1% higher mIoU and 37.0 FPS real-time inference.

Authors:Dewen Zhang, Tahir Hussain, Wangpeng An, Hayaru Shouno
Title: PoseLLM: Enhancing Language-Guided Human Pose Estimation with MLP Alignment
Abstract:
Human pose estimation traditionally relies on architectures that encode keypoint priors, limiting their generalization to novel poses or unseen keypoints. Recent language-guided approaches like LocLLM reformulate keypoint localization as a vision-language task, enabling zero-shot generalization through textual descriptions. However, LocLLM's linear projector fails to capture complex spatial-textual interactions critical for high-precision localization. To address this, we propose PoseLLM, the first Large Language Model (LLM)-based pose estimation framework that replaces the linear projector with a nonlinear MLP vision-language connector. This lightweight two-layer MLP with GELU activation enables hierarchical cross-modal feature transformation, enhancing the fusion of visual patches and textual keypoint descriptions. Trained exclusively on COCO data, PoseLLM achieves 77.8 AP on the COCO validation set, outperforming LocLLM by +0.4 AP, while maintaining strong zero-shot generalization on Human-Art and MPII. Our work demonstrates that a simple yet powerful nonlinear connector significantly boosts localization accuracy without sacrificing generalization, advancing the state-of-the-art in language-guided pose estimation. Code is available at https://github.com/Ody-trek/PoseLLM.
中文: PoseLLM采用非线性MLP视觉语言连接器替代LocLLM的线性投影器,通过增强跨模态特征融合,在提升姿态估计精度的同时保持了优秀的零样本泛化能力。
English: PoseLLM introduces a nonlinear MLP vision-language connector to replace the linear projector in LocLLM, enhancing cross-modal feature fusion for improved pose estimation accuracy and zero-shot generalization.

Authors:Chuan Guo, Inwoo Hwang, Jian Wang, Bing Zhou
Title: SnapMoGen: Human Motion Generation from Expressive Texts
Abstract:
Text-to-motion generation has experienced remarkable progress in recent years. However, current approaches remain limited to synthesizing motion from short or general text prompts, primarily due to dataset constraints. This limitation undermines fine-grained controllability and generalization to unseen prompts. In this paper, we introduce SnapMoGen, a new text-motion dataset featuring high-quality motion capture data paired with accurate, expressive textual annotations. The dataset comprises 20K motion clips totaling 44 hours, accompanied by 122K detailed textual descriptions averaging 48 words per description (vs. 12 words of HumanML3D). Importantly, these motion clips preserve original temporal continuity as they were in long sequences, facilitating research in long-term motion generation and blending. We also improve upon previous generative masked modeling approaches. Our model, MoMask++, transforms motion into multi-scale token sequences that better exploit the token capacity, and learns to generate all tokens using a single generative masked transformer. MoMask++ achieves state-of-the-art performance on both HumanML3D and SnapMoGen benchmarks. Additionally, we demonstrate the ability to process casual user prompts by employing an LLM to reformat inputs to align with the expressivity and narration style of SnapMoGen. Project webpage: https://snap-research.github.io/SnapMoGen/

Authors:Linlan Huang, Xusheng Cao, Haori Lu, Yifan Meng, Fei Yang, Xialei Liu
Title: Mind the Gap: Preserving and Compensating for the Modality Gap in CLIP-Based Continual Learning
Abstract:
Continual learning aims to enable models to learn sequentially from continuously incoming data while retaining performance on previously learned tasks. With the Contrastive Language-Image Pre-trained model (CLIP) exhibiting strong capabilities across various downstream tasks, there has been growing interest in leveraging CLIP for continual learning in such scenarios. Most existing works overlook the inherent modality gap in CLIP, a key factor in its generalization and adaptability. In this paper, we analyze the variations in the modality gap during the fine-tuning of vision-language pre-trained models. Our observations reveal that the modality gap effectively reflects the extent to which pre-trained knowledge is preserved. Based on these insights, we propose a simple yet effective method, MG-CLIP, that improves CLIP's performance in class-incremental learning. Our approach leverages modality gap preservation to mitigate forgetting and modality gap compensation to enhance the capacity for new data, introducing a novel modality-gap-based perspective for continual learning. Extensive experiments on multiple benchmarks demonstrate that our method outperforms existing approaches without requiring additional replay data. Our code is available at https://github.com/linlany/MindtheGap.
中文: 本文提出MG-CLIP方法,通过利用CLIP中的模态差距来保留预训练知识并适应新数据,从而在无需额外回放数据的情况下提升持续学习性能。
English: This paper introduces MG-CLIP, a method that leverages the modality gap in CLIP to enhance continual learning by preserving pre-trained knowledge and adapting to new data, achieving superior performance without extra replay data.

Authors:Di Wen, Kunyu Peng, Kailun Yang, Yufan Chen, Ruiping Liu, Junwei Zheng, Alina Roitberg, Danda Pani Paudel, Luc Van Gool, Rainer Stiefelhagen
Title: RoHOI: Robustness Benchmark for Human-Object Interaction Detection
Abstract:
Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support. However, models trained on clean datasets degrade in real-world conditions due to unforeseen corruptions, leading to inaccurate prediction. To address this, we introduce the first robustness benchmark for HOI detection, evaluating model resilience under diverse challenges. Despite advances, current models struggle with environmental variability, occlusions, and noise. Our benchmark, RoHOI, includes 20 corruption types based on the HICO-DET and V-COCO datasets and a new robustness-focused metric. We systematically analyze existing models in the HOI field, revealing significant performance drops under corruptions. To improve robustness, we propose a Semantic-Aware Masking-based Progressive Learning (SAMPL) strategy to guide the model to be optimized based on holistic and partial cues, thus dynamically adjusting the model's optimization to enhance robust feature learning. Extensive experiments show that our approach outperforms state-of-the-art methods, setting a new standard for robust HOI detection. Benchmarks, datasets, and code will be made publicly available at https://github.com/Kratos-Wen/RoHOI.
中文: 本研究提出了首个用于人-物交互检测的鲁棒性基准RoHOI,并设计了一种基于语义感知掩码的渐进学习策略,有效提升了模型在现实干扰下的稳健性,性能优于现有方法。
English: The study introduces RoHOI, the first robustness benchmark for Human-Object Interaction detection, and proposes a Semantic-Aware Masking-based Progressive Learning strategy that significantly enhances model resilience against real-world corruptions, outperforming existing methods.

Authors:Yiyang Chen, Shanshan Zhao, Lunhao Duan, Changxing Ding, Dacheng Tao
Title: Harnessing Text-to-Image Diffusion Models for Point Cloud Self-Supervised Learning
Abstract:
Diffusion-based models, widely used in text-to-image generation, have proven effective in 2D representation learning. Recently, this framework has been extended to 3D self-supervised learning by constructing a conditional point generator for enhancing 3D representations. However, its performance remains constrained by the 3D diffusion model, which is trained on the available 3D datasets with limited size. We hypothesize that the robust capabilities of text-to-image diffusion models, particularly Stable Diffusion (SD), which is trained on large-scale datasets, can help overcome these limitations. To investigate this hypothesis, we propose PointSD, a framework that leverages the SD model for 3D self-supervised learning. By replacing the SD model's text encoder with a 3D encoder, we train a point-to-image diffusion model that allows point clouds to guide the denoising of rendered noisy images. With the trained point-to-image diffusion model, we use noise-free images as the input and point clouds as the condition to extract SD features. Next, we train a 3D backbone by aligning its features with these SD features, thereby facilitating direct semantic learning. Comprehensive experiments on downstream point cloud tasks and ablation studies demonstrate that the SD model can enhance point cloud self-supervised learning. Code is publicly available at https://github.com/wdttt/PointSD.
中文摘要:PointSD框架利用预训练的Stable Diffusion模型,通过训练点云到图像的扩散模型将三维点云特征与语义图像特征对齐,从而增强三维自监督学习。
English Summary: PointSD is a framework that leverages the pre-trained Stable Diffusion model to enhance 3D self-supervised learning by training a point-to-image diffusion model that aligns 3D point cloud features with semantic image features.

Authors:Seungwoo Kim, Khai Loong Aw, Klemen Kotar, Cristobal Eyzaguirre, Wanhee Lee, Yunong Liu, Jared Watrous, Stefan Stojanov, Juan Carlos Niebles, Jiajun Wu, Daniel L. K. Yamins
Title: Taming generative video models for zero-shot optical flow extraction
Abstract:
Extracting optical flow from videos remains a core computer vision problem. Motivated by the success of large general-purpose models, we ask whether frozen self-supervised video models trained only for future frame prediction can be prompted, without fine-tuning, to output flow. Prior work reading out depth or illumination from video generators required fine-tuning, which is impractical for flow where labels are scarce and synthetic datasets suffer from a sim-to-real gap. Inspired by the Counterfactual World Model (CWM) paradigm, which can obtain point-wise correspondences by injecting a small tracer perturbation into a next-frame predictor and tracking its propagation, we extend this idea to generative video models. We explore several popular architectures and find that successful zero-shot flow extraction in this manner is aided by three model properties: (1) distributional prediction of future frames (avoiding blurry or noisy outputs); (2) factorized latents that treat each spatio-temporal patch independently; and (3) random-access decoding that can condition on any subset of future pixels. These properties are uniquely present in the recent Local Random Access Sequence (LRAS) architecture. Building on LRAS, we propose KL-tracing: a novel test-time procedure that injects a localized perturbation into the first frame, rolls out the model one step, and computes the Kullback-Leibler divergence between perturbed and unperturbed predictive distributions. Without any flow-specific fine-tuning, our method outperforms state-of-the-art models on real-world TAP-Vid DAVIS dataset (16.6% relative improvement for endpoint error) and synthetic TAP-Vid Kubric (4.7% relative improvement). Our results indicate that counterfactual prompting of controllable generative video models is a scalable and effective alternative to supervised or photometric-loss approaches for high-quality flow.

Authors:Hanene F. Z. Brachemi Meftah, Wassim Hamidouche, Sid Ahmed Fezza, Olivier Déforges
Title: VIP: Visual Information Protection through Adversarial Attacks on Vision-Language Models
Abstract:
Recent years have witnessed remarkable progress in developing Vision-Language Models (VLMs) capable of processing both textual and visual inputs. These models have demonstrated impressive performance, leading to their widespread adoption in various applications. However, this widespread raises serious concerns regarding user privacy, particularly when models inadvertently process or expose private visual information. In this work, we frame the preservation of privacy in VLMs as an adversarial attack problem. We propose a novel attack strategy that selectively conceals information within designated Region Of Interests (ROIs) in an image, effectively preventing VLMs from accessing sensitive content while preserving the semantic integrity of the remaining image. Unlike conventional adversarial attacks that often disrupt the entire image, our method maintains high coherence in unmasked areas. Experimental results across three state-of-the-art VLMs namely LLaVA, Instruct-BLIP, and BLIP2-T5 demonstrate up to 98% reduction in detecting targeted ROIs, while maintaining global image semantics intact, as confirmed by high similarity scores between clean and adversarial outputs. We believe that this work contributes to a more privacy conscious use of multimodal models and offers a practical tool for further research, with the source code publicly available at: https://github.com/hbrachemi/Vlm_defense-attack.
Chinese: 本研究提出了一种新颖的对抗性攻击方法,通过选择性遮蔽图像中的敏感区域来保护用户隐私免受视觉语言模型的侵害,同时保持图像整体语义完整性,在三种先进VLM上实现了目标检测率最高降低98%的效果。
English: This study introduces a novel adversarial attack method that selectively conceals sensitive regions in images to protect user privacy from Vision-Language Models while maintaining overall image semantics, achieving up to 98% reduction in targeted detection across three advanced VLMs.

Authors:Chenyu Wang, Cai Zhou, Sharut Gupta, Zongyu Lin, Stefanie Jegelka, Stephen Bates, Tommi Jaakkola
Title: Learning Diffusion Models with Flexible Representation Guidance
Abstract:
Diffusion models can be improved with additional guidance towards more effective representations of input. Indeed, prior empirical work has already shown that aligning internal representations of the diffusion model with those of pre-trained models improves generation quality. In this paper, we present a systematic framework for incorporating representation guidance into diffusion models. We provide alternative decompositions of denoising models along with their associated training criteria, where the decompositions determine when and how the auxiliary representations are incorporated. Guided by our theoretical insights, we introduce two new strategies for enhancing representation alignment in diffusion models. First, we pair examples with target representations either derived from themselves or arisen from different synthetic modalities, and subsequently learn a joint model over the multimodal pairs. Second, we design an optimal training curriculum that balances representation learning and data generation. Our experiments across image, protein sequence, and molecule generation tasks demonstrate superior performance as well as accelerated training. In particular, on the class-conditional ImageNet $256\times 256$ benchmark, our guidance results in $23.3$ times faster training than the original SiT-XL as well as four times speedup over the state-of-the-art method REPA. The code is available at https://github.com/ChenyuWang-Monica/REED.
Chinese: 本文提出了一种系统框架,通过引入表征指导来增强扩散模型,在多个任务中提升了生成质量并加速训练,如在ImageNet上实现了23.3倍的训练加速。
English: This paper introduces a systematic framework to enhance diffusion models by incorporating representation guidance, which accelerates training and improves generation quality across various tasks, as demonstrated by a 23.3 times faster training speed on ImageNet.

Authors:Mahdiyar Molahasani, Azadeh Motamedi, Michael Greenspan, Il-Min Kim, Ali Etemad
Title: PRISM: Reducing Spurious Implicit Biases in Vision-Language Models with LLM-Guided Embedding Projection
Abstract:
We introduce Projection-based Reduction of Implicit Spurious bias in vision-language Models (PRISM), a new data-free and task-agnostic solution for bias mitigation in VLMs like CLIP. VLMs often inherit and amplify biases in their training data, leading to skewed predictions. PRISM is designed to debias VLMs without relying on predefined bias categories or additional external data. It operates in two stages: first, an LLM is prompted with simple class prompts to generate scene descriptions that contain spurious correlations. Next, PRISM uses our novel contrastive-style debiasing loss to learn a projection that maps the embeddings onto a latent space that minimizes spurious correlations while preserving the alignment between image and text embeddings.Extensive experiments demonstrate that PRISM outperforms current debiasing methods on the commonly used Waterbirds and CelebA datasets We make our code public at: https://github.com/MahdiyarMM/PRISM.
Chinese: PRISM是一种无需外部数据且任务无关的新方法,通过使用大语言模型生成包含虚假相关性的场景描述,并应用对比式去偏损失学习投影映射,在保持图像-文本对齐的同时有效减少视觉语言模型中的偏见。
English: PRISM is a novel data-free and task-agnostic method that mitigates bias in vision-language models by using an LLM to generate scene descriptions with spurious correlations and applying a contrastive debiasing loss to learn a projection that reduces these biases while maintaining image-text alignment.

Authors:Kun Jing, Luoyu Chen, Jungang Xu, Jianwei Tai, Yiyu Wang, Shuaimin Li
Title: Zero-Shot Neural Architecture Search with Weighted Response Correlation
Abstract:
Neural architecture search (NAS) is a promising approach for automatically designing neural network architectures. However, the architecture estimation of NAS is computationally expensive and time-consuming because of training multiple architectures from scratch. Although existing zero-shot NAS methods use training-free proxies to accelerate the architecture estimation, their effectiveness, stability, and generality are still lacking. We present a novel training-free estimation proxy called weighted response correlation (WRCor). WRCor utilizes correlation coefficient matrices of responses across different input samples to calculate the proxy scores of estimated architectures, which can measure their expressivity and generalizability. Experimental results on proxy evaluation demonstrate that WRCor and its voting proxies are more efficient estimation strategies than existing proxies. We also apply them with different search strategies in architecture search. Experimental results on architecture search show that our zero-shot NAS algorithm outperforms most existing NAS algorithms in different search spaces. Our NAS algorithm can discover an architecture with a 22.1% test error on the ImageNet-1k dataset within 4 GPU hours. All codes are publicly available at https://github.com/kunjing96/ZSNAS-WRCor.git.
中文: 本文提出了一种名为加权响应相关性(WRCor)的新型免训练代理方法,用于神经架构搜索,它能有效评估架构的表达能力和泛化性,在代理评估和架构搜索中均优于现有方法,并在极少的GPU时间内于ImageNet-1k数据集上取得了优异结果。
English: The paper introduces a novel training-free proxy called weighted response correlation (WRCor) for neural architecture search, which efficiently estimates architecture expressivity and generalizability, outperforming existing methods in both proxy evaluation and architecture search while achieving competitive results on ImageNet-1k within minimal GPU time.

Authors:Hangjie Yuan, Weihua Chen, Jun Cen, Hu Yu, Jingyun Liang, Shuning Chang, Zhihui Lin, Tao Feng, Pengwei Liu, Jiazheng Xing, Hao Luo, Jiasheng Tang, Fan Wang, Yi Yang
Title: Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective
Abstract:
Autoregressive large language models (LLMs) have unified a vast range of language tasks, inspiring preliminary efforts in autoregressive video generation. Existing autoregressive video generators either diverge from standard LLM architectures, depend on bulky external text encoders, or incur prohibitive latency due to next-token decoding. In this paper, we introduce Lumos-1, an autoregressive video generator that retains the LLM architecture with minimal architectural modifications. To inject spatiotemporal correlations in LLMs, we identify the efficacy of incorporating 3D RoPE and diagnose its imbalanced frequency spectrum ranges. Therefore, we propose MM-RoPE, a RoPE scheme that preserves the original textual RoPE while providing comprehensive frequency spectra and scaled 3D positions for modeling multimodal spatiotemporal data. Moreover, Lumos-1 resorts to a token dependency strategy that obeys intra-frame bidirectionality and inter-frame temporal causality. Based on this dependency strategy, we identify the issue of frame-wise loss imbalance caused by spatial information redundancy and solve it by proposing Autoregressive Discrete Diffusion Forcing (AR-DF). AR-DF introduces temporal tube masking during training with a compatible inference-time masking policy to avoid quality degradation. By using memory-efficient training techniques, we pre-train Lumos-1 on only 48 GPUs, achieving performance comparable to EMU3 on GenEval, COSMOS-Video2World on VBench-I2V, and OpenSoraPlan on VBench-T2V. Code and models are available at https://github.com/alibaba-damo-academy/Lumos.
中文摘要:Lumos-1是一种保持标准LLM架构的自回归视频生成器,通过MM-RoPE实现时空建模,并采用AR-DF解决帧间损失不平衡问题,仅用48个GPU进行高效训练即达到与主流模型相当的性能。
English Summary: Lumos-1 is an autoregressive video generator that maintains the standard LLM architecture with minimal changes, incorporating MM-RoPE for spatiotemporal modeling and AR-DF to address frame-wise loss imbalance, achieving competitive performance with efficient training on just 48 GPUs.

Authors:Shaocong Dong, Lihe Ding, Xiao Chen, Yaokun Li, Yuxin Wang, Yucheng Wang, Qi Wang, Jaehyeok Kim, Chenjian Gao, Zhanpeng Huang, Zibin Wang, Tianfan Xue, Dan Xu
Title: From One to More: Contextual Part Latents for 3D Generation
Abstract:
Recent advances in 3D generation have transitioned from multi-view 2D rendering approaches to 3D-native latent diffusion frameworks that exploit geometric priors in ground truth data. Despite progress, three key limitations persist: (1) Single-latent representations fail to capture complex multi-part geometries, causing detail degradation; (2) Holistic latent coding neglects part independence and interrelationships critical for compositional design; (3) Global conditioning mechanisms lack fine-grained controllability. Inspired by human 3D design workflows, we propose CoPart - a part-aware diffusion framework that decomposes 3D objects into contextual part latents for coherent multi-part generation. This paradigm offers three advantages: i) Reduces encoding complexity through part decomposition; ii) Enables explicit part relationship modeling; iii) Supports part-level conditioning. We further develop a mutual guidance strategy to fine-tune pre-trained diffusion models for joint part latent denoising, ensuring both geometric coherence and foundation model priors. To enable large-scale training, we construct Partverse - a novel 3D part dataset derived from Objaverse through automated mesh segmentation and human-verified annotations. Extensive experiments demonstrate CoPart's superior capabilities in part-level editing, articulated object generation, and scene composition with unprecedented controllability.

Authors:Rei Tamaru, Pei Li, Bin Ran
Title: Geo-ORBIT: A Federated Digital Twin Framework for Scene-Adaptive Lane Geometry Detection
Abstract:
Digital Twins (DT) have the potential to transform traffic management and operations by creating dynamic, virtual representations of transportation systems that sense conditions, analyze operations, and support decision-making. A key component for DT of the transportation system is dynamic roadway geometry sensing. However, existing approaches often rely on static maps or costly sensors, limiting scalability and adaptability. Additionally, large-scale DTs that collect and analyze data from multiple sources face challenges in privacy, communication, and computational efficiency. To address these challenges, we introduce Geo-ORBIT (Geometrical Operational Roadway Blueprint with Integrated Twin), a unified framework that combines real-time lane detection, DT synchronization, and federated meta-learning. At the core of Geo-ORBIT is GeoLane, a lightweight lane detection model that learns lane geometries from vehicle trajectory data using roadside cameras. We extend this model through Meta-GeoLane, which learns to personalize detection parameters for local entities, and FedMeta-GeoLane, a federated learning strategy that ensures scalable and privacy-preserving adaptation across roadside deployments. Our system is integrated with CARLA and SUMO to create a high-fidelity DT that renders highway scenarios and captures traffic flows in real-time. Extensive experiments across diverse urban scenes show that FedMeta-GeoLane consistently outperforms baseline and meta-learning approaches, achieving lower geometric error and stronger generalization to unseen locations while drastically reducing communication overhead. This work lays the foundation for flexible, context-aware infrastructure modeling in DTs. The framework is publicly available at https://github.com/raynbowy23/FedMeta-GeoLane.git.
中文摘要:Geo-ORBIT框架通过整合实时车道检测、数字孪生同步和联邦元学习,解决了交通数字孪生在可扩展性、隐私保护和计算效率方面的挑战,大量实验证明其性能优于现有方法。
English Summary: The Geo-ORBIT framework introduces a unified approach combining real-time lane detection, digital twin synchronization, and federated meta-learning to overcome scalability, privacy, and efficiency challenges in transportation digital twins, demonstrating superior performance through extensive experiments.

Authors:Tianlong Ai, Tianzhu Liu, Haochen Jiang, Yanfeng Gu
Title: HieraRS: A Hierarchical Segmentation Paradigm for Remote Sensing Enabling Multi-Granularity Interpretation and Cross-Domain Transfer
Abstract:
Hierarchical land cover and land use (LCLU) classification aims to assign pixel-wise labels with multiple levels of semantic granularity to remote sensing (RS) imagery. However, existing deep learning-based methods face two major challenges: 1) They predominantly adopt a flat classification paradigm, which limits their ability to generate end-to-end multi-granularity hierarchical predictions aligned with tree-structured hierarchies used in practice. 2) Most cross-domain studies focus on performance degradation caused by sensor or scene variations, with limited attention to transferring LCLU models to cross-domain tasks with heterogeneous hierarchies (e.g., LCLU to crop classification). These limitations hinder the flexibility and generalization of LCLU models in practical applications. To address these challenges, we propose HieraRS, a novel hierarchical interpretation paradigm that enables multi-granularity predictions and supports the efficient transfer of LCLU models to cross-domain tasks with heterogeneous tree-structured hierarchies. We introduce the Bidirectional Hierarchical Consistency Constraint Mechanism (BHCCM), which can be seamlessly integrated into mainstream flat classification models to generate hierarchical predictions, while improving both semantic consistency and classification accuracy. Furthermore, we present TransLU, a dual-branch cross-domain transfer framework comprising two key components: Cross-Domain Knowledge Sharing (CDKS) and Cross-Domain Semantic Alignment (CDSA). TransLU supports dynamic category expansion and facilitates the effective adaptation of LCLU models to heterogeneous hierarchies. In addition, we construct MM-5B, a large-scale multi-modal hierarchical land use dataset featuring pixel-wise annotations. The code and MM-5B dataset will be released at: https://github.com/AI-Tianlong/HieraRS.
中文摘要:该摘要提出了HieraRS这一新颖的层次化解释范式,通过双向层次一致性约束机制和跨域迁移框架,解决了现有遥感影像分类方法在多层次语义粒度预测和跨异构层次结构迁移方面的局限性。
English Summary: The abstract introduces HieraRS, a novel hierarchical interpretation paradigm that addresses limitations in existing deep learning methods for land cover and land use classification by enabling multi-granularity predictions and supporting cross-domain transfer through innovative mechanisms like BHCCM and TransLU.

Authors:Yuqiang Lin, Sam Lockyer, Mingxuan Sui, Li Gan, Florian Stanek, Markus Zarbock, Wenbin Li, Adrian Evans, Nic Zhang
Title: RoundaboutHD: High-Resolution Real-World Urban Environment Benchmark for Multi-Camera Vehicle Tracking
Abstract:
The multi-camera vehicle tracking (MCVT) framework holds significant potential for smart city applications, including anomaly detection, traffic density estimation, and suspect vehicle tracking. However, current publicly available datasets exhibit limitations, such as overly simplistic scenarios, low-resolution footage, and insufficiently diverse conditions, creating a considerable gap between academic research and real-world scenario. To fill this gap, we introduce RoundaboutHD, a comprehensive, high-resolution multi-camera vehicle tracking benchmark dataset specifically designed to represent real-world roundabout scenarios. RoundaboutHD provides a total of 40 minutes of labelled video footage captured by four non-overlapping, high-resolution (4K resolution, 15 fps) cameras. In total, 512 unique vehicle identities are annotated across different camera views, offering rich cross-camera association data. RoundaboutHD offers temporal consistency video footage and enhanced challenges, including increased occlusions and nonlinear movement inside the roundabout. In addition to the full MCVT dataset, several subsets are also available for object detection, single camera tracking, and image-based vehicle re-identification (ReID) tasks. Vehicle model information and camera modelling/ geometry information are also included to support further analysis. We provide baseline results for vehicle detection, single-camera tracking, image-based vehicle re-identification, and multi-camera tracking. The dataset and the evaluation code are publicly available at: https://github.com/siri-rouser/RoundaboutHD.git
中文: RoundaboutHD数据集通过提供真实环岛场景的高分辨率连续视频和全面标注,弥补了现有多摄像头车辆追踪数据集的不足,并为多种追踪任务提供了基准结果。
English: The RoundaboutHD dataset addresses the limitations of existing multi-camera vehicle tracking benchmarks by providing high-resolution, temporally consistent footage from real-world roundabouts, complete with comprehensive annotations and baseline results for various tracking tasks.

Authors:Kongwu Huang, Shiyi Mu, Jun Jiang, Yuan Gao, Shugong Xu
Title: Unreal is all you need: Multimodal ISAC Data Simulation with Only One Engine
Abstract:
Scaling laws have achieved success in LLM and foundation models. To explore their potential in ISAC research, we propose Great-X. This single-engine multimodal data twin platform reconstructs the ray-tracing computation of Sionna within Unreal Engine and is deeply integrated with autonomous driving tools. This enables efficient and synchronized simulation of multimodal data, including CSI, RGB, Radar, and LiDAR. Based on this platform, we construct an open-source, large-scale, low-altitude UAV multimodal synaesthesia dataset named Great-MSD, and propose a baseline CSI-based UAV 3D localization algorithm, demonstrating its feasibility and generalizability across different CSI simulation engines. The related code and dataset will be made available at: https://github.com/hkw-xg/Great-MCD.
中文: Great-X 是一个集成光线追踪与自动驾驶工具的多模态数据孪生平台,可高效生成同步仿真数据;基于此构建的Great-MSD数据集与CSI无人机定位算法,验证了跨仿真引擎的可行性与泛化能力。
English: Great-X is a multimodal data twin platform that integrates ray-tracing simulation with autonomous driving tools to efficiently generate synchronized data, while the derived Great-MSD dataset and CSI-based localization algorithm demonstrate cross-engine feasibility for UAV applications.

Authors:Shuang Cui, Jinglin Xu, Yi Li, Xiongxin Tang, Jiangmeng Li, Jiahuan Zhou, Fanjiang Xu, Fuchun Sun, Hui Xiong
Title: BayesTTA: Continual-Temporal Test-Time Adaptation for Vision-Language Models via Gaussian Discriminant Analysis
Abstract:
Vision-language models (VLMs) such as CLIP achieve strong zero-shot recognition but degrade significantly under \textit{temporally evolving distribution shifts} common in real-world scenarios (e.g., gradual illumination or seasonal changes). Existing continual test-time adaptation (CTTA) methods are typically built around sudden and severe distribution shifts and neglect temporal continuity, leading to three core defects: limited memory cache restricts long-range distribution modeling, causing catastrophic forgetting; entropy-based confidence becomes unreliable under temporal drift, worsening error accumulation; and static visual representations misalign with evolving inputs. We formalize this practical problem as \textit{Continual-Temporal Test-Time Adaptation (CT-TTA)}, where test distributions evolve gradually over time. To address it, we propose \textit{BayesTTA}, a Bayesian adaptation framework that enforces temporally consistent predictions and dynamically aligns visual representations. Specifically, BayesTTA incrementally estimates class-conditional Gaussian mixture distributions without storing raw data, adaptively selects covariance structures through statistical hypothesis testing, and performs calibrated inference using Gaussian discriminant analysis (GDA). These calibrated predictions supervise self-paced adaptation of normalization layers, ensuring efficient and stable representation alignment. We establish a comprehensive CT-TTA benchmark across four temporally evolving datasets and further evaluate generalization on ten standard TTA datasets. Extensive experiments show that BayesTTA consistently outperforms state-of-the-art methods, achieving significant gains while maintaining efficiency. Code is available at \href{https://github.com/cuishuang99/BayesTTA}{https://github.com/cuishuang99/BayesTTA}.
Chinese: 视觉语言模型在处理现实世界中逐渐变化的分布偏移时表现不佳,因此我们提出了BayesTTA这一贝叶斯框架,通过确保时间一致性和动态表征对齐,显著超越了现有方法的性能。
English: Vision-language models struggle with gradual real-world distribution shifts, so we propose BayesTTA, a Bayesian framework that ensures temporal consistency and dynamic representation alignment to significantly outperform existing methods.

Authors:Junyu Chen, Yihua Gao, Mingyong Li
Title: Visual Semantic Description Generation with MLLMs for Image-Text Matching
Abstract:
Image-text matching (ITM) aims to address the fundamental challenge of aligning visual and textual modalities, which inherently differ in their representations, continuous, high-dimensional image features vs. discrete, structured text. We propose a novel framework that bridges the modality gap by leveraging multimodal large language models (MLLMs) as visual semantic parsers. By generating rich Visual Semantic Descriptions (VSD), MLLMs provide semantic anchor that facilitate cross-modal alignment. Our approach combines: (1) Instance-level alignment by fusing visual features with VSD to enhance the linguistic expressiveness of image representations, and (2) Prototype-level alignment through VSD clustering to ensure category-level consistency. These modules can be seamlessly integrated into existing ITM models. Extensive experiments on Flickr30K and MSCOCO demonstrate substantial performance improvements. The approach also exhibits remarkable zero-shot generalization to cross-domain tasks, including news and remote sensing ITM. The code and model checkpoints are available at https://github.com/Image-Text-Matching/VSD.
中文摘要:本研究提出了一种新颖框架,利用多模态大语言模型作为视觉语义解析器来弥合图像-文本匹配中的模态差异,在标准基准测试中取得显著性能提升,并展现出跨领域的强大零样本泛化能力。
English Summary: This study introduces a novel framework that uses multimodal large language models as visual semantic parsers to bridge the modality gap in image-text matching, achieving significant performance improvements on standard benchmarks and demonstrating strong zero-shot generalization across domains.

Authors:Enyu Liu, En Yu, Sijia Chen, Wenbing Tao
Title: Disentangling Instance and Scene Contexts for 3D Semantic Scene Completion
Abstract:
3D Semantic Scene Completion (SSC) has gained increasing attention due to its pivotal role in 3D perception. Recent advancements have primarily focused on refining voxel-level features to construct 3D scenes. However, treating voxels as the basic interaction units inherently limits the utilization of class-level information, which is proven critical for enhancing the granularity of completion results. To address this, we propose \textbf{D}isentangling Instance and Scene Contexts (DISC), a novel dual-stream paradigm that enhances learning for both instance and scene categories through separated optimization. Specifically, we replace voxel queries with discriminative class queries, which incorporate class-specific geometric and semantic priors. Additionally, we exploit the intrinsic properties of classes to design specialized decoding modules, facilitating targeted interactions and efficient class-level information flow. Experimental results demonstrate that DISC achieves state-of-the-art (SOTA) performance on both SemanticKITTI and SSCBench-KITTI-360 benchmarks, with mIoU scores of 17.35 and 20.55, respectively. Remarkably, DISC even outperforms multi-frame SOTA methods using only single-frame input and significantly improves instance category performance, surpassing both single-frame and multi-frame SOTA instance mIoU by 17.9\% and 11.9\%, respectively, on the SemanticKITTI hidden test. The code is available at https://github.com/Enyu-Liu/DISC.
中文: 提出的DISC方法通过双流范式使用类别查询来分离实例与场景优化,从而增强3D语义场景补全,在多个基准测试中取得最优性能,并在实例类别上实现显著提升。
English: The proposed DISC method introduces a dual-stream paradigm using class queries to enhance 3D semantic scene completion by separating instance and scene optimization, achieving state-of-the-art results on major benchmarks with significant improvements in instance category performance.

Authors:Inye Na, Nejung Rue, Jiwon Chung, Hyunjin Park
Title: RadiomicsRetrieval: A Customizable Framework for Medical Image Retrieval Using Radiomics Features
Abstract:
Medical image retrieval is a valuable field for supporting clinical decision-making, yet current methods primarily support 2D images and require fully annotated queries, limiting clinical flexibility. To address this, we propose RadiomicsRetrieval, a 3D content-based retrieval framework bridging handcrafted radiomics descriptors with deep learning-based embeddings at the tumor level. Unlike existing 2D approaches, RadiomicsRetrieval fully exploits volumetric data to leverage richer spatial context in medical images. We employ a promptable segmentation model (e.g., SAM) to derive tumor-specific image embeddings, which are aligned with radiomics features extracted from the same tumor via contrastive learning. These representations are further enriched by anatomical positional embedding (APE). As a result, RadiomicsRetrieval enables flexible querying based on shape, location, or partial feature sets. Extensive experiments on both lung CT and brain MRI public datasets demonstrate that radiomics features significantly enhance retrieval specificity, while APE provides global anatomical context essential for location-based searches. Notably, our framework requires only minimal user prompts (e.g., a single point), minimizing segmentation overhead and supporting diverse clinical scenarios. The capability to query using either image embeddings or selected radiomics attributes highlights its adaptability, potentially benefiting diagnosis, treatment planning, and research on large-scale medical imaging repositories. Our code is available at https://github.com/nainye/RadiomicsRetrieval.
中文摘要:RadiomicsRetrieval提出了一种三维医学图像检索框架,通过对比学习将影像组学特征与深度学习嵌入相结合,仅需少量标记即可实现基于肿瘤特征的灵活检索,在肺部和脑部影像数据上显著优于现有二维方法。
English Summary: RadiomicsRetrieval introduces a 3D medical image retrieval framework that combines radiomics features with deep learning embeddings through contrastive learning, enabling flexible tumor-level queries using minimal prompts while outperforming existing 2D methods across lung and brain imaging datasets.

Authors:Heng Li, Qingcai Chen, Xiangping Wu
Title: Dual Dimensions Geometric Representation Learning Based Document Dewarping
Abstract:
Document image dewarping remains a challenging task in the deep learning era. While existing methods have improved by leveraging text line awareness, they typically focus only on a single horizontal dimension. In this paper, we propose a fine-grained deformation perception model that focuses on Dual Dimensions of document horizontal-vertical-lines to improve document Dewarping called D2Dewarp. It can perceive distortion trends in different directions across document details. To combine the horizontal and vertical granularity features, an effective fusion module based on X and Y coordinate is designed to facilitate interaction and constraint between the two dimensions for feature complementarity. Due to the lack of annotated line features in current public dewarping datasets, we also propose an automatic fine-grained annotation method using public document texture images and an automatic rendering engine to build a new large-scale distortion training dataset. The code and dataset will be publicly released. On public Chinese and English benchmarks, both quantitative and qualitative results show that our method achieves better rectification results compared with the state-of-the-art methods. The dataset will be publicly available at https://github.com/xiaomore/DocDewarpHV
中文: 本文提出D2Dewarp双维度文档校正模型,通过基于坐标的融合模块整合水平与垂直线特征,并利用自动标注的新数据集在基准测试中实现了更优的校正效果。
English: This paper introduces D2Dewarp, a dual-dimensional document dewarping model that leverages horizontal and vertical line features through a coordinate-based fusion module, achieving superior rectification results on benchmarks by utilizing a novel automatically annotated dataset.

Authors:Zesong Yang, Bangbang Yang, Wenqi Dong, Chenxuan Cao, Liyuan Cui, Yuewen Ma, Zhaopeng Cui, Hujun Bao
Title: InstaScene: Towards Complete 3D Instance Decomposition and Reconstruction from Cluttered Scenes
Abstract:
Humans can naturally identify and mentally complete occluded objects in cluttered environments. However, imparting similar cognitive ability to robotics remains challenging even with advanced reconstruction techniques, which models scenes as undifferentiated wholes and fails to recognize complete object from partial observations. In this paper, we propose InstaScene, a new paradigm towards holistic 3D perception of complex scenes with a primary goal: decomposing arbitrary instances while ensuring complete reconstruction. To achieve precise decomposition, we develop a novel spatial contrastive learning by tracing rasterization of each instance across views, significantly enhancing semantic supervision in cluttered scenes. To overcome incompleteness from limited observations, we introduce in-situ generation that harnesses valuable observations and geometric cues, effectively guiding 3D generative models to reconstruct complete instances that seamlessly align with the real world. Experiments on scene decomposition and object completion across complex real-world and synthetic scenes demonstrate that our method achieves superior decomposition accuracy while producing geometrically faithful and visually intact objects.
中文: 本文提出InstaScene新范式,通过空间对比学习和原位生成技术,使机器人能够分解复杂场景中的被遮挡物体并实现完整重建,在真实与合成场景实验中展现出优越的分解精度与几何保真度。
English: This paper introduces InstaScene, a novel approach that enables robots to decompose complex scenes into individual objects and reconstruct them completely, overcoming challenges of occlusion and clutter through spatial contrastive learning and in-situ generation techniques.

Authors:Zhanxin Gao, Beier Zhu, Liang Yao, Jian Yang, Ying Tai
Title: Subject-Consistent and Pose-Diverse Text-to-Image Generation
Abstract:
Subject-consistent generation (SCG)-aiming to maintain a consistent subject identity across diverse scenes-remains a challenge for text-to-image (T2I) models. Existing training-free SCG methods often achieve consistency at the cost of layout and pose diversity, hindering expressive visual storytelling. To address the limitation, we propose subject-Consistent and pose-Diverse T2I framework, dubbed as CoDi, that enables consistent subject generation with diverse pose and layout. Motivated by the progressive nature of diffusion, where coarse structures emerge early and fine details are refined later, CoDi adopts a two-stage strategy: Identity Transport (IT) and Identity Refinement (IR). IT operates in the early denoising steps, using optimal transport to transfer identity features to each target image in a pose-aware manner. This promotes subject consistency while preserving pose diversity. IR is applied in the later denoising steps, selecting the most salient identity features to further refine subject details. Extensive qualitative and quantitative results on subject consistency, pose diversity, and prompt fidelity demonstrate that CoDi achieves both better visual perception and stronger performance across all metrics. The code is provided in https://github.com/NJU-PCALab/CoDi.
中文摘要:CoDi框架通过身份特征传输和细节优化两阶段策略,在保持主体一致性的同时实现了姿态与布局的多样性,全面提升了文本到图像生成的视觉效果与性能指标。
English Summary: The proposed CoDi framework overcomes the trade-off between subject consistency and pose diversity in text-to-image generation by implementing a two-stage diffusion process that transfers identity features early and refines details later.

Authors:Shishuai Hu, Zehui Liao, Liangli Zhen, Huazhu Fu, Yong Xia
Title: Cycle Context Verification for In-Context Medical Image Segmentation
Abstract:
In-context learning (ICL) is emerging as a promising technique for achieving universal medical image segmentation, where a variety of objects of interest across imaging modalities can be segmented using a single model. Nevertheless, its performance is highly sensitive to the alignment between the query image and in-context image-mask pairs. In a clinical scenario, the scarcity of annotated medical images makes it challenging to select optimal in-context pairs, and fine-tuning foundation ICL models on contextual data is infeasible due to computational costs and the risk of catastrophic forgetting. To address this challenge, we propose Cycle Context Verification (CCV), a novel framework that enhances ICL-based medical image segmentation by enabling self-verification of predictions and accordingly enhancing contextual alignment. Specifically, CCV employs a cyclic pipeline in which the model initially generates a segmentation mask for the query image. Subsequently, the roles of the query and an in-context pair are swapped, allowing the model to validate its prediction by predicting the mask of the original in-context image. The accuracy of this secondary prediction serves as an implicit measure of the initial query segmentation. A query-specific prompt is introduced to alter the query image and updated to improve the measure, thereby enhancing the alignment between the query and in-context pairs. We evaluated CCV on seven medical image segmentation datasets using two ICL foundation models, demonstrating its superiority over existing methods. Our results highlight CCV's ability to enhance ICL-based segmentation, making it a robust solution for universal medical image segmentation. The code will be available at https://github.com/ShishuaiHu/CCV.
中文:提出的循环上下文验证(CCV)框架通过循环角色交换机制实现预测的自我验证,无需微调即可提升上下文对齐效果,在多个医学图像分割数据集上展现出优越性能,增强了基于上下文学习的通用医学图像分割能力。
English: The proposed Cycle Context Verification (CCV) framework enhances in-context learning for universal medical image segmentation by enabling self-verification of predictions through a cyclic role-swapping mechanism, improving contextual alignment without requiring fine-tuning and demonstrating superior performance across multiple datasets.

Authors:Jihao Gu, Fei Wang, Kun Li, Yanyan Wei, Zhiliang Wu, Dan Guo
Title: MM-Gesture: Towards Precise Micro-Gesture Recognition through Multimodal Fusion
Abstract:
In this paper, we present MM-Gesture, the solution developed by our team HFUT-VUT, which ranked 1st in the micro-gesture classification track of the 3rd MiGA Challenge at IJCAI 2025, achieving superior performance compared to previous state-of-the-art methods. MM-Gesture is a multimodal fusion framework designed specifically for recognizing subtle and short-duration micro-gestures (MGs), integrating complementary cues from joint, limb, RGB video, Taylor-series video, optical-flow video, and depth video modalities. Utilizing PoseConv3D and Video Swin Transformer architectures with a novel modality-weighted ensemble strategy, our method further enhances RGB modality performance through transfer learning pre-trained on the larger MA-52 dataset. Extensive experiments on the iMiGUE benchmark, including ablation studies across different modalities, validate the effectiveness of our proposed approach, achieving a top-1 accuracy of 73.213%. Code is available at: https://github.com/momiji-bit/MM-Gesture.
中文总结:我们团队提出的MM-Gesture多模态融合框架在IJCAI 2025微手势识别挑战赛中夺冠,通过整合多种视觉模态在iMiGUE基准上实现了73.213%的准确率。
English Summary: Our team's MM-Gesture multimodal framework achieved first place in the IJCAI 2025 micro-gesture challenge by integrating multiple visual modalities and achieving 73.213% accuracy on the iMiGUE benchmark.

Authors:Jia-Xuan Jiang, Jiashuai Liu, Hongtao Wu, Yifeng Wu, Zhong Wang, Qi Bi, Yefeng Zheng
Title: Single Domain Generalization for Multimodal Cross-Cancer Prognosis via Dirac Rebalancer and Distribution Entanglement
Abstract:
Deep learning has shown remarkable performance in integrating multimodal data for survival prediction. However, existing multimodal methods mainly focus on single cancer types and overlook the challenge of generalization across cancers. In this work, we are the first to reveal that multimodal prognosis models often generalize worse than unimodal ones in cross-cancer scenarios, despite the critical need for such robustness in clinical practice. To address this, we propose a new task: Cross-Cancer Single Domain Generalization for Multimodal Prognosis, which evaluates whether models trained on a single cancer type can generalize to unseen cancers. We identify two key challenges: degraded features from weaker modalities and ineffective multimodal integration. To tackle these, we introduce two plug-and-play modules: Sparse Dirac Information Rebalancer (SDIR) and Cancer-aware Distribution Entanglement (CADE). SDIR mitigates the dominance of strong features by applying Bernoulli-based sparsification and Dirac-inspired stabilization to enhance weaker modality signals. CADE, designed to synthesize the target domain distribution, fuses local morphological cues and global gene expression in latent space. Experiments on a four-cancer-type benchmark demonstrate superior generalization, laying the foundation for practical, robust cross-cancer multimodal prognosis. Code is available at https://github.com/HopkinsKwong/MCCSDG
中文: 本研究针对多模态生存预测提出跨癌症泛化新任务,通过可插拔模块SDIR和CADE分别实现模态特征重平衡与目标域分布合成,在四类癌症基准测试中展现出卓越的泛化能力,为临床实践奠定坚实基础。
English: This study introduces a novel cross-cancer generalization task for multimodal survival prediction and proposes two plug-and-play modules—SDIR and CADE—that significantly enhance model robustness by rebalancing modality contributions and synthesizing target domain distributions, validated through superior performance on a four-cancer benchmark.

Authors:Kui Jiang, Shiyu Liu, Junjun Jiang, Hongxun Yao, Xiaopeng Fan
Title: M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation
Abstract:
Audio-driven talking head generation holds significant potential for film production. While existing 3D methods have advanced motion modeling and content synthesis, they often produce rendering artifacts, such as motion blur, temporal jitter, and local penetration, due to limitations in representing stable, fine-grained motion fields. Through systematic analysis, we reformulate talking head generation into a unified framework comprising three steps: video preprocessing, motion representation, and rendering reconstruction. This framework underpins our proposed M2DAO-Talker, which addresses current limitations via multi-granular motion decoupling and alternating optimization. Specifically, we devise a novel 2D portrait preprocessing pipeline to extract frame-wise deformation control conditions (motion region segmentation masks, and camera parameters) to facilitate motion representation. To ameliorate motion modeling, we elaborate a multi-granular motion decoupling strategy, which independently models non-rigid (oral and facial) and rigid (head) motions for improved reconstruction accuracy. Meanwhile, a motion consistency constraint is developed to ensure head-torso kinematic consistency, thereby mitigating penetration artifacts caused by motion aliasing. In addition, an alternating optimization strategy is designed to iteratively refine facial and oral motion parameters, enabling more realistic video generation. Experiments across multiple datasets show that M2DAO-Talker achieves state-of-the-art performance, with the 2.43 dB PSNR improvement in generation quality and 0.64 gain in user-evaluated video realness versus TalkingGaussian while with 150 FPS inference speed. Our project homepage is https://m2dao-talker.github.io/M2DAO-Talk.github.io.
中文摘要:本研究提出M2DAO-Talker框架,通过多粒度运动解耦和交替优化策略解决音频驱动说话头生成中的渲染伪影问题,在多个数据集上实现了最先进的生成质量和真实感表现。
English Summary: This study introduces M2DAO-Talker, a unified framework for audio-driven talking head generation that addresses rendering artifacts through multi-granular motion decoupling and alternating optimization, achieving state-of-the-art performance with significant improvements in generation quality and realness.

Authors:J. D. Peiffer, Kunal Shah, Irina Djuraskovic, Shawana Anarwala, Kayan Abdou, Rujvee Patel, Prakash Jayabalan, Brenton Pennicooke, R. James Cotton
Title: Portable Biomechanics Laboratory: Clinically Accessible Movement Analysis from a Handheld Smartphone
Abstract:
The way a person moves is a direct reflection of their neurological and musculoskeletal health, yet it remains one of the most underutilized vital signs in clinical practice. Although clinicians visually observe movement impairments, they lack accessible and validated methods to objectively measure movement in routine care. This gap prevents wider use of biomechanical measurements in practice, which could enable more sensitive outcome measures or earlier identification of impairment. We present our Portable Biomechanics Laboratory (PBL), which includes a secure, cloud-enabled smartphone app for data collection and a novel algorithm for fitting biomechanical models to this data. We extensively validated PBL's biomechanical measures using a large, clinically representative dataset. Next, we tested the usability and utility of our system in neurosurgery and sports medicine clinics. We found joint angle errors within 3 degrees across participants with neurological injury, lower-limb prosthesis users, pediatric inpatients, and controls. In addition to being easy to use, gait metrics computed from the PBL showed high reliability and were sensitive to clinical differences. For example, in individuals undergoing decompression surgery for cervical myelopathy, the mJOA score is a common patient-reported outcome measure; we found that PBL gait metrics correlated with mJOA scores and demonstrated greater responsiveness to surgical intervention than the patient-reported outcomes. These findings support the use of handheld smartphone video as a scalable, low-burden tool for capturing clinically meaningful biomechanical data, offering a promising path toward accessible monitoring of mobility impairments. We release the first clinically validated method for measuring whole-body kinematics from handheld smartphone video at https://intelligentsensingandrehabilitation.github.io/MonocularBiomechanics/ .

Authors:Jason Kahei Tam, Murilo Gustineli, Anthony Miyaguchi
Title: Transfer Learning and Mixup for Fine-Grained Few-Shot Fungi Classification
Abstract:
Accurate identification of fungi species presents a unique challenge in computer vision due to fine-grained inter-species variation and high intra-species variation. This paper presents our approach for the FungiCLEF 2025 competition, which focuses on few-shot fine-grained visual categorization (FGVC) using the FungiTastic Few-Shot dataset. Our team (DS@GT) experimented with multiple vision transformer models, data augmentation, weighted sampling, and incorporating textual information. We also explored generative AI models for zero-shot classification using structured prompting but found them to significantly underperform relative to vision-based models. Our final model outperformed both competition baselines and highlighted the effectiveness of domain specific pretraining and balanced sampling strategies. Our approach ranked 35/74 on the private test set in post-completion evaluation, this suggests additional work can be done on metadata selection and domain-adapted multi-modal learning. Our code is available at https://github.com/dsgt-arc/fungiclef-2025.
中文摘要:本文介绍了DS@GT团队在FungiCLEF 2025竞赛中采用视觉Transformer结合数据增强与文本信息的方法,通过领域特定预训练提升了真菌细粒度分类性能,同时指出了在元数据选择和多模态学习方面的改进空间。
English Summary: This paper details DS@GT's FungiCLEF 2025 competition approach using vision transformers with data augmentation and textual information, achieving improved performance through domain-specific pretraining while identifying potential enhancements in metadata selection and multimodal learning.

Authors:Xiwen Chen, Peijie Qiu, Wenhui Zhu, Hao Wang, Huayu Li, Xuanzhao Dong, Xiaotong Sun, Xiaobing Yu, Yalin Wang, Abolfazl Razi, Aristeidis Sotiras
Title: Cracking Instance Jigsaw Puzzles: An Alternative to Multiple Instance Learning for Whole Slide Image Analysis
Abstract:
While multiple instance learning (MIL) has shown to be a promising approach for histopathological whole slide image (WSI) analysis, its reliance on permutation invariance significantly limits its capacity to effectively uncover semantic correlations between instances within WSIs. Based on our empirical and theoretical investigations, we argue that approaches that are not permutation-invariant but better capture spatial correlations between instances can offer more effective solutions. In light of these findings, we propose a novel alternative to existing MIL for WSI analysis by learning to restore the order of instances from their randomly shuffled arrangement. We term this task as cracking an instance jigsaw puzzle problem, where semantic correlations between instances are uncovered. To tackle the instance jigsaw puzzles, we propose a novel Siamese network solution, which is theoretically justified by optimal transport theory. We validate the proposed method on WSI classification and survival prediction tasks, where the proposed method outperforms the recent state-of-the-art MIL competitors. The code is available at https://github.com/xiwenc1/MIL-JigsawPuzzles.
中文: 本研究提出了一种新颖的多示例学习方法,通过解决实例拼图难题来更好地捕捉组织病理学全切片图像中的空间相关性,在分类和生存预测任务中优于现有最先进方法。
English: This study introduces a novel multiple instance learning approach for histopathological whole slide image analysis by solving instance jigsaw puzzles to better capture spatial correlations, which outperforms current state-of-the-art methods in classification and survival prediction tasks.

Authors:Chong Cheng, Yu Hu, Sicheng Yu, Beizhen Zhao, Zijian Wang, Hao Wang
Title: RegGS: Unposed Sparse Views Gaussian Splatting with 3DGS Registration
Abstract:
3D Gaussian Splatting (3DGS) has demonstrated its potential in reconstructing scenes from unposed images. However, optimization-based 3DGS methods struggle with sparse views due to limited prior knowledge. Meanwhile, feed-forward Gaussian approaches are constrained by input formats, making it challenging to incorporate more input views. To address these challenges, we propose RegGS, a 3D Gaussian registration-based framework for reconstructing unposed sparse views. RegGS aligns local 3D Gaussians generated by a feed-forward network into a globally consistent 3D Gaussian representation. Technically, we implement an entropy-regularized Sinkhorn algorithm to efficiently solve the optimal transport Mixture 2-Wasserstein $(\text{MW}_2)$ distance, which serves as an alignment metric for Gaussian mixture models (GMMs) in $\mathrm{Sim}(3)$ space. Furthermore, we design a joint 3DGS registration module that integrates the $\text{MW}_2$ distance, photometric consistency, and depth geometry. This enables a coarse-to-fine registration process while accurately estimating camera poses and aligning the scene. Experiments on the RE10K and ACID datasets demonstrate that RegGS effectively registers local Gaussians with high fidelity, achieving precise pose estimation and high-quality novel-view synthesis. Project page: https://3dagentworld.github.io/reggs/.

Authors:Evgenii Rudakov, Jonathan Shock, Otto Lappi, Benjamin Ultan Cowley
Title: SSSUMO: Real-Time Semi-Supervised Submovement Decomposition
Abstract:
This paper introduces a SSSUMO, semi-supervised deep learning approach for submovement decomposition that achieves state-of-the-art accuracy and speed. While submovement analysis offers valuable insights into motor control, existing methods struggle with reconstruction accuracy, computational cost, and validation, due to the difficulty of obtaining hand-labeled data. We address these challenges using a semi-supervised learning framework. This framework learns from synthetic data, initially generated from minimum-jerk principles and then iteratively refined through adaptation to unlabeled human movement data. Our fully convolutional architecture with differentiable reconstruction significantly surpasses existing methods on both synthetic and diverse human motion datasets, demonstrating robustness even in high-noise conditions. Crucially, the model operates in real-time (less than a millisecond per input second), a substantial improvement over optimization-based techniques. This enhanced performance facilitates new applications in human-computer interaction, rehabilitation medicine, and motor control studies. We demonstrate the model's effectiveness across diverse human-performed tasks such as steering, rotation, pointing, object moving, handwriting, and mouse-controlled gaming, showing notable improvements particularly on challenging datasets where traditional methods largely fail. Training and benchmarking source code, along with pre-trained model weights, are made publicly available at https://github.com/dolphin-in-a-coma/sssumo.
中文: 本文提出的SSSUMO是一种半监督深度学习框架,用于子运动分解,实现了顶尖的精度和实时处理能力,显著提升了人机交互和运动控制研究中的应用潜力。
English: This paper presents SSSUMO, a semi-supervised deep learning method for submovement decomposition that achieves top-tier accuracy and real-time processing, enhancing applications in human-computer interaction and motor control studies.

Authors:Helen Qu, Sang Michael Xie
Title: Impact of Pretraining Word Co-occurrence on Compositional Generalization in Multimodal Models
Abstract:
CLIP and large multimodal models (LMMs) have better accuracy on examples involving concepts that are highly represented in the training data. However, the role of concept combinations in the training data on compositional generalization is largely unclear -- for instance, how does accuracy vary when a common object appears in an uncommon pairing with another object? In this paper, we investigate how word co-occurrence statistics in the pretraining dataset (a proxy for co-occurrence of visual concepts) impacts CLIP/LMM performance. To disentangle the effects of word co-occurrence frequencies from single-word frequencies, we measure co-occurrence with pointwise mutual information (PMI), which normalizes the joint probability of two words co-occurring by the probability of co-occurring independently. Using synthetically generated images with a variety of concept pairs, we show a strong correlation between PMI in the CLIP pretraining data and zero-shot accuracy in CLIP models trained on LAION-400M (r=0.97 and 14% accuracy gap between images in the top and bottom 5% of PMI values), demonstrating that even accuracy on common concepts is affected by the combination of concepts in the image. Leveraging this finding, we reproduce this effect in natural images by editing them to contain pairs with varying PMI, resulting in a correlation of r=0.75. Finally, we demonstrate that this behavior in CLIP transfers to LMMs built on top of CLIP (r=0.70 for TextVQA, r=0.62 for VQAv2). Our findings highlight the need for algorithms and architectures that improve compositional generalization in multimodal models without scaling the training data combinatorially. Our code is available at https://github.com/helenqu/multimodal-pretraining-pmi.
Chinese: 本研究发现,CLIP及大型多模态模型的准确性受训练数据中概念共现频率的显著影响,通过点间互信息衡量,即使常见物体在异常配对时也会影响模型表现。
English: This study reveals that CLIP and large multimodal models' accuracy is strongly influenced by the co-occurrence frequency of concepts in their training data, as measured by pointwise mutual information, which affects performance even on common objects when paired unusually.

Authors:Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, Zhuochen Wang, Zhaoxiang Zhang
Title: Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology
Abstract:
Models like OpenAI-o3 pioneer visual grounded reasoning by dynamically referencing visual regions, just like human "thinking with images". However, no benchmark exists to evaluate these capabilities holistically. To bridge this gap, we propose TreeBench (Traceable Evidence Evaluation Benchmark), a diagnostic benchmark built on three principles: (1) focused visual perception of subtle targets in complex scenes, (2) traceable evidence via bounding box evaluation, and (3) second-order reasoning to test object interactions and spatial hierarchies beyond simple object localization. Prioritizing images with dense objects, we initially sample 1K high-quality images from SA-1B, and incorporate eight LMM experts to manually annotate questions, candidate options, and answers for each image. After three stages of quality control, TreeBench consists of 405 challenging visual question-answering pairs, even the most advanced models struggle with this benchmark, where none of them reach 60% accuracy, e.g., OpenAI-o3 scores only 54.87. Furthermore, we introduce TreeVGR (Traceable Evidence Enhanced Visual Grounded Reasoning), a training paradigm to supervise localization and reasoning jointly with reinforcement learning, enabling accurate localizations and explainable reasoning pathways. Initialized from Qwen2.5-VL-7B, it improves V* Bench (+16.8), MME-RealWorld (+12.6), and TreeBench (+13.4), proving traceability is key to advancing vision-grounded reasoning. The code is available at https://github.com/Haochen-Wang409/TreeVGR.
中文摘要:TreeBench是一个基于可追溯证据和复杂推理的诊断性基准,用于全面评估视觉接地推理能力,即使先进模型如OpenAI-o3在其挑战性任务中表现不佳,而提出的TreeVGR训练范式通过联合监督定位与推理,显著提升了多项基准的性能。
English Summary: TreeBench is a diagnostic benchmark designed to evaluate visual grounded reasoning by focusing on traceable evidence and complex reasoning, revealing that even advanced models like OpenAI-o3 struggle with its challenging tasks, while the proposed TreeVGR training paradigm significantly improves performance by integrating localization and reasoning.

Authors:Mingkai Jia, Wei Yin, Xiaotao Hu, Jiaxin Guo, Xiaoyang Guo, Qian Zhang, Xiao-Xiao Long, Ping Tan
Title: MGVQ: Could VQ-VAE Beat VAE? A Generalizable Tokenizer with Multi-group Quantization
Abstract:
Vector Quantized Variational Autoencoders (VQ-VAEs) are fundamental models that compress continuous visual data into discrete tokens. Existing methods have tried to improve the quantization strategy for better reconstruction quality, however, there still exists a large gap between VQ-VAEs and VAEs. To narrow this gap, we propose MGVQ, a novel method to augment the representation capability of discrete codebooks, facilitating easier optimization for codebooks and minimizing information loss, thereby enhancing reconstruction quality. Specifically, we propose to retain the latent dimension to preserve encoded features and incorporate a set of sub-codebooks for quantization. Furthermore, we construct comprehensive zero-shot benchmarks featuring resolutions of 512p and 2k to evaluate the reconstruction performance of existing methods rigorously. MGVQ achieves the state-of-the-art performance on both ImageNet and 8 zero-shot benchmarks across all VQ-VAEs. Notably, compared with SD-VAE, we outperform them on ImageNet significantly, with rFID 0.49 v.s. 0.91, and achieve superior PSNR on all zero-shot benchmarks. These results highlight the superiority of MGVQ in reconstruction and pave the way for preserving fidelity in HD image processing tasks. Code will be publicly available at https://github.com/MKJia/MGVQ.
中文: MGVQ通过保留潜在维度和引入多子码本增强离散码本表示能力,显著减少信息损失,在ImageNet和零样本基准测试中实现了最优重建性能。
English: MGVQ enhances VQ-VAE performance by preserving latent dimensions and using multiple sub-codebooks to reduce information loss, achieving state-of-the-art reconstruction quality on ImageNet and zero-shot benchmarks.

Authors:Shivam Duggal, Sanghyun Byun, William T. Freeman, Antonio Torralba, Phillip Isola
Title: Single-pass Adaptive Image Tokenization for Minimum Program Search
Abstract:
According to Algorithmic Information Theory (AIT) -- Intelligent representations compress data into the shortest possible program that can reconstruct its content, exhibiting low Kolmogorov Complexity (KC). In contrast, most visual representation learning systems use fixed-length representations for all inputs, ignoring variations in complexity or familiarity. Recent adaptive tokenization methods address this by allocating variable-length representations but typically require test-time search over multiple encodings to find the most predictive one. Inspired by Kolmogorov Complexity principles, we propose a single-pass adaptive tokenizer, KARL, which predicts the appropriate number of tokens for an image in a single forward pass, halting once its approximate KC is reached. The token count serves as a proxy for the minimum description length. KARL's training procedure closely resembles the Upside-Down Reinforcement Learning paradigm, as it learns to conditionally predict token halting based on a desired reconstruction quality. KARL matches the performance of recent adaptive tokenizers while operating in a single pass. We present scaling laws for KARL, analyzing the role of encoder/decoder size, continuous vs. discrete tokenization and more. Additionally, we offer a conceptual study drawing an analogy between Adaptive Image Tokenization and Algorithmic Information Theory, examining the predicted image complexity (KC) across axes such as structure vs. noise and in- vs. out-of-distribution familiarity -- revealing alignment with human intuition.
中文:受算法信息理论启发,KARL是一种单次前向处理的自适应分词器,能根据图像的近似柯氏复杂度动态预测最佳分词数量,在保持与多轮处理模型相当性能的同时,其复杂度预测结果与人类直觉高度吻合。
English: Inspired by Algorithmic Information Theory, KARL is a single-pass adaptive tokenizer that dynamically predicts the optimal number of tokens for images based on their approximate Kolmogorov complexity, matching the performance of multi-pass methods while aligning predicted complexity with human intuition.

Authors:Weihao Xia, Cengiz Oztireli
Title: Multigranular Evaluation for Brain Visual Decoding
Abstract:
Existing evaluation protocols for brain visual decoding predominantly rely on coarse metrics that obscure inter-model differences, lack neuroscientific foundation, and fail to capture fine-grained visual distinctions. To address these limitations, we introduce BASIC, a unified, multigranular evaluation framework that jointly quantifies structural fidelity, inferential alignment, and contextual coherence between decoded and ground truth images. For the structural level, we introduce a hierarchical suite of segmentation-based metrics, including foreground, semantic, instance, and component masks, anchored in granularity-aware correspondence across mask structures. For the semantic level, we extract structured scene representations encompassing objects, attributes, and relationships using multimodal large language models, enabling detailed, scalable, and context-rich comparisons with ground-truth stimuli. We benchmark a diverse set of visual decoding methods across multiple stimulus-neuroimaging datasets within this unified evaluation framework. Together, these criteria provide a more discriminative, interpretable, and comprehensive foundation for measuring brain visual decoding methods.

Authors:JingLi Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, Jiangmiao Pang
Title: OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding
Abstract:
Recent advances in multimodal large language models (MLLMs) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. The Online aspect emphasizes the need to process and reason over incrementally acquired observations, while the Spatio-Temporal component requires integrating current visual inputs with historical memory to support dynamic spatial reasoning. OST-Bench better reflects the challenges of real-world embodied perception. Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes. We evaluate several leading MLLMs on OST-Bench and observe that they fall short on tasks requiring complex spatio-temporal reasoning. Under the online setting, their accuracy declines as the exploration horizon extends and the memory grows. Through further experimental analysis, we identify common error patterns across models and find that both complex clue-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes, highlighting the core challenges that must be addressed to improve online embodied reasoning. To foster further research and development in the field, our codes, dataset, and benchmark are available. Our project page is: https://rbler1234.github.io/OSTBench.github.io/
中文: 本文提出OST-Bench这一新型基准测试,通过主动场景探索评估多模态大模型的在线时空推理能力,发现现有模型在处理动态记忆和复杂空间任务方面存在明显不足,揭示了该领域亟待解决的核心挑战。
English: This paper introduces OST-Bench, a novel benchmark for evaluating multimodal language models' online spatio-temporal reasoning abilities through active scene exploration, revealing their limitations in handling dynamic memory and complex spatial tasks despite current advancements.

Authors:Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han
Title: Scaling RL to Long Videos
Abstract:
We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 104K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In our experiments, LongVILA-R1-7B achieves strong performance on video benchmarks, reaching 65.1% and 71.1% accuracy on VideoMME without and with subtitles, respectively, and consistently outperforming LongVILA-7B across multiple benchmarks. Moreover, LongVILA-R1-7B supports processing up to 8,192 video frames per video, and configurable FPS settings. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames).
中文: 本文提出了一种全栈框架,通过强化学习提升视觉语言模型在长视频中的推理能力,包含专用数据集、两阶段训练流程和高效基础设施,实现了卓越性能并最高提速2.1倍。
English: This paper introduces a full-stack framework that enhances vision-language models' reasoning for long videos through reinforcement learning, featuring a specialized dataset, two-stage training pipeline, and efficient infrastructure, achieving superior performance and up to 2.1x training speedup.

Authors:Sizhen Bian, Mengxi Liu, Vitor Fortes Rey, Daniel Geissler, Paul Lukowicz
Title: TinierHAR: Towards Ultra-Lightweight Deep Learning Models for Efficient Human Activity Recognition on Edge Devices
Abstract:
Human Activity Recognition (HAR) on resource-constrained wearable devices demands inference models that harmonize accuracy with computational efficiency. This paper introduces TinierHAR, an ultra-lightweight deep learning architecture that synergizes residual depthwise separable convolutions, gated recurrent units (GRUs), and temporal aggregation to achieve SOTA efficiency without compromising performance. Evaluated across 14 public HAR datasets, TinierHAR reduces Parameters by 2.7x (vs. TinyHAR) and 43.3x (vs. DeepConvLSTM), and MACs by 6.4x and 58.6x, respectively, while maintaining the averaged F1-scores. Beyond quantitative gains, this work provides the first systematic ablation study dissecting the contributions of spatial-temporal components across proposed TinierHAR, prior SOTA TinyHAR, and the classical DeepConvLSTM, offering actionable insights for designing efficient HAR systems. We finally discussed the findings and suggested principled design guidelines for future efficient HAR. To catalyze edge-HAR research, we open-source all materials in this work for future benchmarking\footnote{https://github.com/zhaxidele/TinierHAR}
中文:本文提出TinierHAR超轻量深度学习模型,通过时空组件的创新融合,在保持性能的同时实现了最优计算效率,并在多个数据集上得到验证。
English: This paper presents TinierHAR, an ultra-lightweight deep learning model that achieves state-of-the-art computational efficiency while maintaining performance through innovative integration of spatial-temporal components, validated across multiple datasets.

Authors:Jinhong Wang, Tajamul Ashraf, Zongyan Han, Jorma Laaksonen, Rao Mohammad Anwer
Title: MIRA: A Novel Framework for Fusing Modalities in Medical RAG
Abstract:
Multimodal Large Language Models (MLLMs) have significantly advanced AI-assisted medical diagnosis, but they often generate factually inconsistent responses that deviate from established medical knowledge. Retrieval-Augmented Generation (RAG) enhances factual accuracy by integrating external sources, but it presents two key challenges. First, insufficient retrieval can miss critical information, whereas excessive retrieval can introduce irrelevant or misleading content, disrupting model output. Second, even when the model initially provides correct answers, over-reliance on retrieved data can lead to factual errors. To address these issues, we introduce the Multimodal Intelligent Retrieval and Augmentation (MIRA) framework, designed to optimize factual accuracy in MLLM. MIRA consists of two key components: (1) a calibrated Rethinking and Rearrangement module that dynamically adjusts the number of retrieved contexts to manage factual risk, and (2) A medical RAG framework integrating image embeddings and a medical knowledge base with a query-rewrite module for efficient multimodal reasoning. This enables the model to effectively integrate both its inherent knowledge and external references. Our evaluation of publicly available medical VQA and report generation benchmarks demonstrates that MIRA substantially enhances factual accuracy and overall performance, achieving new state-of-the-art results. Code is released at https://github.com/mbzuai-oryx/MIRA.
中文: MIRA框架通过动态优化检索过程和整合多模态医学知识,解决了多模态大语言模型的事实不一致性问题,实现了最先进的诊断准确性。
English: The MIRA framework addresses factual inaccuracies in Multimodal Large Language Models by dynamically optimizing retrieval processes and integrating multimodal medical knowledge, achieving state-of-the-art diagnostic accuracy.

Authors:Jiayi Wu, Tianfu Wang, Md Abu Bakr Siddique, Md Jahidul Islam, Cornelia Fermuller, Yiannis Aloimonos, Christopher A. Metzler
Title: Single-Step Latent Diffusion for Underwater Image Restoration
Abstract:
Underwater image restoration algorithms seek to restore the color, contrast, and appearance of a scene that is imaged underwater. They are a critical tool in applications ranging from marine ecology and aquaculture to underwater construction and archaeology. While existing pixel-domain diffusion-based image restoration approaches are effective at restoring simple scenes with limited depth variation, they are computationally intensive and often generate unrealistic artifacts when applied to scenes with complex geometry and significant depth variation. In this work we overcome these limitations by combining a novel network architecture (SLURPP) with an accurate synthetic data generation pipeline. SLURPP combines pretrained latent diffusion models -- which encode strong priors on the geometry and depth of scenes -- with an explicit scene decomposition -- which allows one to model and account for the effects of light attenuation and backscattering. To train SLURPP we design a physics-based underwater image synthesis pipeline that applies varied and realistic underwater degradation effects to existing terrestrial image datasets. This approach enables the generation of diverse training data with dense medium/degradation annotations. We evaluate our method extensively on both synthetic and real-world benchmarks and demonstrate state-of-the-art performance. Notably, SLURPP is over 200X faster than existing diffusion-based methods while offering ~ 3 dB improvement in PSNR on synthetic benchmarks. It also offers compelling qualitative improvements on real-world data. Project website https://tianfwang.github.io/slurpp/.
中文: 本文提出SLURPP新型水下图像复原方法,通过融合潜在扩散模型与显式场景分解技术,有效解决复杂水下退化问题,在速度和质量上均显著超越现有方法。
English: This paper introduces SLURPP, a novel underwater image restoration method that combines latent diffusion models with explicit scene decomposition to effectively address complex underwater degradation, achieving significant speed and quality improvements over existing approaches.

Authors:Pierre Marza, Leo Fillioux, Sofiène Boutaj, Kunal Mahatha, Christian Desrosiers, Pablo Piantanida, Jose Dolz, Stergios Christodoulidis, Maria Vakalopoulou
Title: THUNDER: Tile-level Histopathology image UNDERstanding benchmark
Abstract:
Progress in a research field can be hard to assess, in particular when many concurrent methods are proposed in a short period of time. This is the case in digital pathology, where many foundation models have been released recently to serve as feature extractors for tile-level images, being used in a variety of downstream tasks, both for tile- and slide-level problems. Benchmarking available methods then becomes paramount to get a clearer view of the research landscape. In particular, in critical domains such as healthcare, a benchmark should not only focus on evaluating downstream performance, but also provide insights about the main differences between methods, and importantly, further consider uncertainty and robustness to ensure a reliable usage of proposed models. For these reasons, we introduce THUNDER, a tile-level benchmark for digital pathology foundation models, allowing for efficient comparison of many models on diverse datasets with a series of downstream tasks, studying their feature spaces and assessing the robustness and uncertainty of predictions informed by their embeddings. THUNDER is a fast, easy-to-use, dynamic benchmark that can already support a large variety of state-of-the-art foundation, as well as local user-defined models for direct tile-based comparison. In this paper, we provide a comprehensive comparison of 23 foundation models on 16 different datasets covering diverse tasks, feature analysis, and robustness. The code for THUNDER is publicly available at https://github.com/MICS-Lab/thunder.
中文: THUNDER 是一个用于数字病理学的动态基准,能够跨多种数据集高效比较基础模型,评估其性能、特征空间、鲁棒性和不确定性,以推动医疗领域模型评估的可靠性。
English: THUNDER is a dynamic benchmark for digital pathology that enables efficient comparison of foundation models across diverse datasets, assessing performance, feature spaces, robustness, and uncertainty to advance reliable model evaluation in healthcare.

Authors:Yuchen Zhu, Cheng Shi, Dingyou Wang, Jiajin Tang, Zhengxuan Wei, Yu Wu, Guanbin Li, Sibei Yang
Title: Rethinking Query-based Transformer for Continual Image Segmentation
Abstract:
Class-incremental/Continual image segmentation (CIS) aims to train an image segmenter in stages, where the set of available categories differs at each stage. To leverage the built-in objectness of query-based transformers, which mitigates catastrophic forgetting of mask proposals, current methods often decouple mask generation from the continual learning process. This study, however, identifies two key issues with decoupled frameworks: loss of plasticity and heavy reliance on input data order. To address these, we conduct an in-depth investigation of the built-in objectness and find that highly aggregated image features provide a shortcut for queries to generate masks through simple feature alignment. Based on this, we propose SimCIS, a simple yet powerful baseline for CIS. Its core idea is to directly select image features for query assignment, ensuring "perfect alignment" to preserve objectness, while simultaneously allowing queries to select new classes to promote plasticity. To further combat catastrophic forgetting of categories, we introduce cross-stage consistency in selection and an innovative "visual query"-based replay mechanism. Experiments demonstrate that SimCIS consistently outperforms state-of-the-art methods across various segmentation tasks, settings, splits, and input data orders. All models and codes will be made publicly available at https://github.com/SooLab/SimCIS.
中文: 本文提出SimCIS这一新颖的类增量图像分割框架,通过直接特征对齐保持对象性,结合跨阶段一致性和视觉查询回放机制增强可塑性,在多种设置下均优于现有方法。
English: This paper introduces SimCIS, a novel framework for class-incremental image segmentation that preserves objectness through direct feature alignment and enhances plasticity with cross-stage consistency and a visual query replay mechanism, outperforming existing methods across various settings.

Authors:Jiaxin Huang, Ziwen Li, Hanlve Zhang, Runnan Chen, Xiao He, Yandong Guo, Wenping Wang, Tongliang Liu, Mingming Gong
Title: SURPRISE3D: A Dataset for Spatial Understanding and Reasoning in Complex 3D Scenes
Abstract:
The integration of language and 3D perception is critical for embodied AI and robotic systems to perceive, understand, and interact with the physical world. Spatial reasoning, a key capability for understanding spatial relationships between objects, remains underexplored in current 3D vision-language research. Existing datasets often mix semantic cues (e.g., object name) with spatial context, leading models to rely on superficial shortcuts rather than genuinely interpreting spatial relationships. To address this gap, we introduce S\textsc{urprise}3D, a novel dataset designed to evaluate language-guided spatial reasoning segmentation in complex 3D scenes. S\textsc{urprise}3D consists of more than 200k vision language pairs across 900+ detailed indoor scenes from ScanNet++ v2, including more than 2.8k unique object classes. The dataset contains 89k+ human-annotated spatial queries deliberately crafted without object name, thereby mitigating shortcut biases in spatial understanding. These queries comprehensively cover various spatial reasoning skills, such as relative position, narrative perspective, parametric perspective, and absolute distance reasoning. Initial benchmarks demonstrate significant challenges for current state-of-the-art expert 3D visual grounding methods and 3D-LLMs, underscoring the necessity of our dataset and the accompanying 3D Spatial Reasoning Segmentation (3D-SRS) benchmark suite. S\textsc{urprise}3D and 3D-SRS aim to facilitate advancements in spatially aware AI, paving the way for effective embodied interaction and robotic planning. The code and datasets can be found in https://github.com/liziwennba/SUPRISE.
中文: S\textsc{urprise}3D数据集通过提供超过20万个排除物体名称的人类标注空间查询视觉语言对,弥补了3D视觉语言研究中的空白,挑战现有模型真正理解空间关系,推动空间感知人工智能在具身交互中的发展。
English: The S\textsc{urprise}3D dataset addresses the gap in 3D vision-language research by providing over 200k vision-language pairs with human-annotated spatial queries that exclude object names, challenging current models to genuinely interpret spatial relationships and advancing spatially aware AI for embodied interaction.

Authors:Mélanie Roschewitz, Raghav Mehta, Fabio de Sousa Ribeiro, Ben Glocker
Title: Where are we with calibration under dataset shift in image classification?
Abstract:
We conduct an extensive study on the state of calibration under real-world dataset shift for image classification. Our work provides important insights on the choice of post-hoc and in-training calibration techniques, and yields practical guidelines for all practitioners interested in robust calibration under shift. We compare various post-hoc calibration methods, and their interactions with common in-training calibration strategies (e.g., label smoothing), across a wide range of natural shifts, on eight different classification tasks across several imaging domains. We find that: (i) simultaneously applying entropy regularisation and label smoothing yield the best calibrated raw probabilities under dataset shift, (ii) post-hoc calibrators exposed to a small amount of semantic out-of-distribution data (unrelated to the task) are most robust under shift, (iii) recent calibration methods specifically aimed at increasing calibration under shifts do not necessarily offer significant improvements over simpler post-hoc calibration methods, (iv) improving calibration under shifts often comes at the cost of worsening in-distribution calibration. Importantly, these findings hold for randomly initialised classifiers, as well as for those finetuned from foundation models, the latter being consistently better calibrated compared to models trained from scratch. Finally, we conduct an in-depth analysis of ensembling effects, finding that (i) applying calibration prior to ensembling (instead of after) is more effective for calibration under shifts, (ii) for ensembles, OOD exposure deteriorates the ID-shifted calibration trade-off, (iii) ensembling remains one of the most effective methods to improve calibration robustness and, combined with finetuning from foundation models, yields best calibration results overall.
中文摘要:本研究通过图像分类在真实数据集偏移下的校准分析发现,熵正则化与标签平滑结合能产生最佳校准概率,使用分布外数据的后处理校准方法更具鲁棒性,而集成学习与基础模型微调相结合可实现最优校准效果。
English Summary: This comprehensive study on image classification calibration under real-world dataset shift reveals that combining entropy regularization with label smoothing produces the best-calibrated probabilities, while post-hoc methods using out-of-distribution data show superior robustness, with ensemble methods and foundation model fine-tuning delivering optimal calibration performance.

Authors:Dren Fazlija, Monty-Maximilian Zühlke, Johanna Schrader, Arkadij Orlov, Clara Stein, Iyiola E. Olatunji, Daniel Kudenko
Title: SCOOTER: A Human Evaluation Framework for Unrestricted Adversarial Examples
Abstract:
Unrestricted adversarial attacks aim to fool computer vision models without being constrained by $\ell_p$-norm bounds to remain imperceptible to humans, for example, by changing an object's color. This allows attackers to circumvent traditional, norm-bounded defense strategies such as adversarial training or certified defense strategies. However, due to their unrestricted nature, there are also no guarantees of norm-based imperceptibility, necessitating human evaluations to verify just how authentic these adversarial examples look. While some related work assesses this vital quality of adversarial attacks, none provide statistically significant insights. This issue necessitates a unified framework that supports and streamlines such an assessment for evaluating and comparing unrestricted attacks. To close this gap, we introduce SCOOTER - an open-source, statistically powered framework for evaluating unrestricted adversarial examples. Our contributions are: $(i)$ best-practice guidelines for crowd-study power, compensation, and Likert equivalence bounds to measure imperceptibility; $(ii)$ the first large-scale human vs. model comparison across 346 human participants showing that three color-space attacks and three diffusion-based attacks fail to produce imperceptible images. Furthermore, we found that GPT-4o can serve as a preliminary test for imperceptibility, but it only consistently detects adversarial examples for four out of six tested attacks; $(iii)$ open-source software tools, including a browser-based task template to collect annotations and analysis scripts in Python and R; $(iv)$ an ImageNet-derived benchmark dataset containing 3K real images, 7K adversarial examples, and over 34K human ratings. Our findings demonstrate that automated vision systems do not align with human perception, reinforcing the need for a ground-truth SCOOTER benchmark.
中文: 无限制对抗攻击通过改变颜色等特征规避传统防御,但缺乏人类不可感知性的保证,为此开发了SCOOTER框架,该框架通过大规模人类研究评估此类攻击,并揭示自动化视觉系统与人类感知之间存在偏差。
English: Unrestricted adversarial attacks bypass traditional defenses by altering features like color, but lack guarantees of human imperceptibility, leading to the development of SCOOTER, a framework that evaluates such attacks through large-scale human studies and reveals a misalignment between automated vision systems and human perception.

Authors:David Pujol-Perich, Sergio Escalera, Albert Clapés
Title: Sparse-Dense Side-Tuner for efficient Video Temporal Grounding
Abstract:
Video Temporal Grounding (VTG) involves Moment Retrieval (MR) and Highlight Detection (HD) based on textual queries. For this, most methods rely solely on final-layer features of frozen large pre-trained backbones, limiting their adaptability to new domains. While full fine-tuning is often impractical, parameter-efficient fine-tuning -- and particularly side-tuning (ST) -- has emerged as an effective alternative. However, prior ST approaches this problem from a frame-level refinement perspective, overlooking the inherent sparse nature of MR. To address this, we propose the Sparse-Dense Side-Tuner (SDST), the first anchor-free ST architecture for VTG. We also introduce the Reference-based Deformable Self-Attention, a novel mechanism that enhances the context modeling of the deformable attention -- a key limitation of existing anchor-free methods. Additionally, we present the first effective integration of InternVideo2 backbone into an ST framework, showing its profound implications in performance. Overall, our method significantly improves existing ST methods, achieving highly competitive or SOTA results on QVHighlights, TACoS, and Charades-STA, while reducing up to a 73% the parameter count w.r.t. the existing SOTA methods. The code is publicly accessible at https://github.com/davidpujol/SDST.
中文: 提出的稀疏-密集侧调器(SDST)采用无锚点侧调架构和新型注意力机制,在极大减少参数的同时显著提升了视频时序定位性能。
English: The proposed Sparse-Dense Side-Tuner (SDST) introduces an anchor-free side-tuning architecture with novel attention mechanisms that significantly enhance video temporal grounding performance while drastically reducing parameter requirements.

Authors:Ethan Dack, Chengliang Dai
Title: Understanding Dataset Bias in Medical Imaging: A Case Study on Chest X-rays
Abstract:
Recent works have revisited the infamous task ``Name That Dataset'', demonstrating that non-medical datasets contain underlying biases and that the dataset origin task can be solved with high accuracy. In this work, we revisit the same task applied to popular open-source chest X-ray datasets. Medical images are naturally more difficult to release for open-source due to their sensitive nature, which has led to certain open-source datasets being extremely popular for research purposes. By performing the same task, we wish to explore whether dataset bias also exists in these datasets. To extend our work, we apply simple transformations to the datasets, repeat the same task, and perform an analysis to identify and explain any detected biases. Given the importance of AI applications in medical imaging, it's vital to establish whether modern methods are taking shortcuts or are focused on the relevant pathology. We implement a range of different network architectures on the datasets: NIH, CheXpert, MIMIC-CXR and PadChest. We hope this work will encourage more explainable research being performed in medical imaging and the creation of more open-source datasets in the medical domain. Our code can be found here: https://github.com/eedack01/x_ray_ds_bias.
中文: 本研究通过应用“数据集识别”任务及变换方法,探究了常用开源胸部X光数据集中的偏差问题,旨在验证AI模型是否依赖捷径而非病理特征,并在NIH、CheXpert、MIMIC-CXR和PadChest数据集上采用了多种网络架构进行分析。
English: This study investigates dataset bias in popular open-source chest X-ray datasets by applying the "Name That Dataset" task and transformations, aiming to determine if AI models rely on shortcuts rather than pathology, using multiple network architectures on NIH, CheXpert, MIMIC-CXR, and PadChest datasets.

Authors:Wei Shang, Dongwei Ren, Wanying Zhang, Pengfei Zhu, Qinghua Hu, Wangmeng Zuo
Title: Motion-Aware Adaptive Pixel Pruning for Efficient Local Motion Deblurring
Abstract:
Local motion blur in digital images originates from the relative motion between dynamic objects and static imaging systems during exposure. Existing deblurring methods face significant challenges in addressing this problem due to their inefficient allocation of computational resources and inadequate handling of spatially varying blur patterns. To overcome these limitations, we first propose a trainable mask predictor that identifies blurred regions in the image. During training, we employ blur masks to exclude sharp regions. For inference optimization, we implement structural reparameterization by converting $3\times 3$ convolutions to computationally efficient $1\times 1$ convolutions, enabling pixel-level pruning of sharp areas to reduce computation. Second, we develop an intra-frame motion analyzer that translates relative pixel displacements into motion trajectories, establishing adaptive guidance for region-specific blur restoration. Our method is trained end-to-end using a combination of reconstruction loss, reblur loss, and mask loss guided by annotated blur masks. Extensive experiments demonstrate superior performance over state-of-the-art methods on both local and global blur datasets while reducing FLOPs by 49\% compared to SOTA models (e.g., LMD-ViT). The source code is available at https://github.com/shangwei5/M2AENet.
Chinese: 本文提出了一种高效去模糊方法,通过可训练的掩码预测器识别模糊区域和帧内运动分析器指导恢复,在降低49%计算成本的同时实现了卓越性能。
English: This paper introduces an efficient deblurring method that uses a trainable mask predictor to identify blurred regions and an intra-frame motion analyzer to guide restoration, achieving superior performance while reducing computational costs by 49%.

Authors:Peixian Zhuang, Yijian Wang, Zhenqi Fu, Hongliang Zhang, Sam Kwong, Chongyi Li
Title: Tree-Mamba: A Tree-Aware Mamba for Underwater Monocular Depth Estimation
Abstract:
Underwater Monocular Depth Estimation (UMDE) is a critical task that aims to estimate high-precision depth maps from underwater degraded images caused by light absorption and scattering effects in marine environments. Recently, Mamba-based methods have achieved promising performance across various vision tasks; however, they struggle with the UMDE task because their inflexible state scanning strategies fail to model the structural features of underwater images effectively. Meanwhile, existing UMDE datasets usually contain unreliable depth labels, leading to incorrect object-depth relationships between underwater images and their corresponding depth maps. To overcome these limitations, we develop a novel tree-aware Mamba method, dubbed Tree-Mamba, for estimating accurate monocular depth maps from underwater degraded images. Specifically, we propose a tree-aware scanning strategy that adaptively constructs a minimum spanning tree based on feature similarity. The spatial topological features among the tree nodes are then flexibly aggregated through bottom-up and top-down traversals, enabling stronger multi-scale feature representation capabilities. Moreover, we construct an underwater depth estimation benchmark (called BlueDepth), which consists of 38,162 underwater image pairs with reliable depth labels. This benchmark serves as a foundational dataset for training existing deep learning-based UMDE methods to learn accurate object-depth relationships. Extensive experiments demonstrate the superiority of the proposed Tree-Mamba over several leading methods in both qualitative results and quantitative evaluations with competitive computational efficiency. Code and dataset will be available at https://wyjgr.github.io/Tree-Mamba.html.

Authors:Feng Liu, Lingna Gu, Chen Shi, Xiaolan Fu
Title: Action Unit Enhance Dynamic Facial Expression Recognition
Abstract:
Dynamic Facial Expression Recognition(DFER) is a rapidly evolving field of research that focuses on the recognition of time-series facial expressions. While previous research on DFER has concentrated on feature learning from a deep learning perspective, we put forward an AU-enhanced Dynamic Facial Expression Recognition architecture, namely AU-DFER, that incorporates AU-expression knowledge to enhance the effectiveness of deep learning modeling. In particular, the contribution of the Action Units(AUs) to different expressions is quantified, and a weight matrix is designed to incorporate a priori knowledge. Subsequently, the knowledge is integrated with the learning outcomes of a conventional deep learning network through the introduction of AU loss. The design is incorporated into the existing optimal model for dynamic expression recognition for the purpose of validation. Experiments are conducted on three recent mainstream open-source approaches to DFER on the principal datasets in this field. The results demonstrate that the proposed architecture outperforms the state-of-the-art(SOTA) methods without the need for additional arithmetic and generally produces improved results. Furthermore, we investigate the potential of AU loss function redesign to address data label imbalance issues in established dynamic expression datasets. To the best of our knowledge, this is the first attempt to integrate quantified AU-expression knowledge into various DFER models. We also devise strategies to tackle label imbalance, or minor class problems. Our findings suggest that employing a diverse strategy of loss function design can enhance the effectiveness of DFER. This underscores the criticality of addressing data imbalance challenges in mainstream datasets within this domain. The source code is available at https://github.com/Cross-Innovation-Lab/AU-DFER.
中文摘要:AU-DFER架构通过设计权重矩阵和AU损失函数,将量化的动作单元知识融入深度学习模型,在提升动态表情识别性能的同时有效解决了数据标签不平衡问题,其表现优于现有最优方法。
English Summary: The AU-DFER architecture enhances dynamic facial expression recognition by integrating quantified Action Unit knowledge with deep learning through a designed weight matrix and AU loss, demonstrating superior performance over existing methods while addressing data imbalance issues.

Authors:Marc Lafon, Yannis Karmim, Julio Silva-Rodríguez, Paul Couairon, Clément Rambour, Raphaël Fournier-Sniehotta, Ismail Ben Ayed, Jose Dolz, Nicolas Thome
Title: ViLU: Learning Vision-Language Uncertainties for Failure Prediction
Abstract:
Reliable Uncertainty Quantification (UQ) and failure prediction remain open challenges for Vision-Language Models (VLMs). We introduce ViLU, a new Vision-Language Uncertainty quantification framework that contextualizes uncertainty estimates by leveraging all task-relevant textual representations. ViLU constructs an uncertainty-aware multi-modal representation by integrating the visual embedding, the predicted textual embedding, and an image-conditioned textual representation via cross-attention. Unlike traditional UQ methods based on loss prediction, ViLU trains an uncertainty predictor as a binary classifier to distinguish correct from incorrect predictions using a weighted binary cross-entropy loss, making it loss-agnostic. In particular, our proposed approach is well-suited for post-hoc settings, where only vision and text embeddings are available without direct access to the model itself. Extensive experiments on diverse datasets show the significant gains of our method compared to state-of-the-art failure prediction methods. We apply our method to standard classification datasets, such as ImageNet-1k, as well as large-scale image-caption datasets like CC12M and LAION-400M. Ablation studies highlight the critical role of our architecture and training in achieving effective uncertainty quantification. Our code is publicly available and can be found here: https://github.com/ykrmm/ViLU.
中文: ViLU是一种新颖的视觉语言不确定性量化框架,通过整合多模态表示并训练一个与损失无关的二元分类器作为不确定性预测器,显著提升了故障预测能力,在多个数据集上展现出卓越性能。
English: ViLU is a novel vision-language uncertainty quantification framework that improves failure prediction by integrating multimodal representations and training an uncertainty predictor as a loss-agnostic binary classifier, demonstrating superior performance across diverse datasets.

Authors:Ruixiang Chen, Guolei Sun, Yawei Li, Jie Qin, Luca Benini
Title: HiM2SAM: Enhancing SAM2 with Hierarchical Motion Estimation and Memory Optimization towards Long-term Tracking
Abstract:
This paper presents enhancements to the SAM2 framework for video object tracking task, addressing challenges such as occlusions, background clutter, and target reappearance. We introduce a hierarchical motion estimation strategy, combining lightweight linear prediction with selective non-linear refinement to improve tracking accuracy without requiring additional training. In addition, we optimize the memory bank by distinguishing long-term and short-term memory frames, enabling more reliable tracking under long-term occlusions and appearance changes. Experimental results show consistent improvements across different model scales. Our method achieves state-of-the-art performance on LaSOT and LaSOText with the large model, achieving 9.6% and 7.2% relative improvements in AUC over the original SAM2, and demonstrates even larger relative gains on smaller models, highlighting the effectiveness of our trainless, low-overhead improvements for boosting long-term tracking performance. The code is available at https://github.com/LouisFinner/HiM2SAM.
Chinese: 本文通过引入分层运动估计策略和优化记忆库,改进了SAM2框架的视频目标跟踪能力,在无需额外训练的情况下,在多个基准测试中实现了最先进的性能。
English: This paper enhances the SAM2 framework for video object tracking by introducing a hierarchical motion estimation strategy and an optimized memory bank, achieving state-of-the-art performance on benchmarks without requiring additional training.

Authors:Kuiyuan Sun, Yuxuan Zhang, Jichao Zhang, Jiaming Liu, Wei Wang, Niculae Sebe, Yao Zhao
Title: Stable-Hair v2: Real-World Hair Transfer via Multiple-View Diffusion Model
Abstract:
While diffusion-based methods have shown impressive capabilities in capturing diverse and complex hairstyles, their ability to generate consistent and high-quality multi-view outputs -- crucial for real-world applications such as digital humans and virtual avatars -- remains underexplored. In this paper, we propose Stable-Hair v2, a novel diffusion-based multi-view hair transfer framework. To the best of our knowledge, this is the first work to leverage multi-view diffusion models for robust, high-fidelity, and view-consistent hair transfer across multiple perspectives. We introduce a comprehensive multi-view training data generation pipeline comprising a diffusion-based Bald Converter, a data-augment inpainting model, and a face-finetuned multi-view diffusion model to generate high-quality triplet data, including bald images, reference hairstyles, and view-aligned source-bald pairs. Our multi-view hair transfer model integrates polar-azimuth embeddings for pose conditioning and temporal attention layers to ensure smooth transitions between views. To optimize this model, we design a novel multi-stage training strategy consisting of pose-controllable latent IdentityNet training, hair extractor training, and temporal attention training. Extensive experiments demonstrate that our method accurately transfers detailed and realistic hairstyles to source subjects while achieving seamless and consistent results across views, significantly outperforming existing methods and establishing a new benchmark in multi-view hair transfer. Code is publicly available at https://github.com/sunkymepro/StableHairV2.
中文摘要:Stable-Hair v2提出了一种基于扩散模型的新型多视角发型迁移框架,通过极坐标嵌入和时序注意力层实现跨视角的连贯高质量发型转换,其性能显著优于现有方法。
English Summary: Stable-Hair v2 introduces a novel diffusion-based framework that achieves robust, high-fidelity multi-view hair transfer through innovative pose conditioning and temporal attention mechanisms, significantly outperforming existing methods.

Authors:Chunyan Wang, Dong Zhang, Jinhui Tang
Title: Diffusion-Guided Knowledge Distillation for Weakly-Supervised Low-Light Semantic Segmentation
Abstract:
Weakly-supervised semantic segmentation aims to assign category labels to each pixel using weak annotations, significantly reducing manual annotation costs. Although existing methods have achieved remarkable progress in well-lit scenarios, their performance significantly degrades in low-light environments due to two fundamental limitations: severe image quality degradation (e.g., low contrast, noise, and color distortion) and the inherent constraints of weak supervision. These factors collectively lead to unreliable class activation maps and semantically ambiguous pseudo-labels, ultimately compromising the model's ability to learn discriminative feature representations. To address these problems, we propose Diffusion-Guided Knowledge Distillation for Weakly-Supervised Low-light Semantic Segmentation (DGKD-WLSS), a novel framework that synergistically combines Diffusion-Guided Knowledge Distillation (DGKD) with Depth-Guided Feature Fusion (DGF2). DGKD aligns normal-light and low-light features via diffusion-based denoising and knowledge distillation, while DGF2 integrates depth maps as illumination-invariant geometric priors to enhance structural feature learning. Extensive experiments demonstrate the effectiveness of DGKD-WLSS, which achieves state-of-the-art performance in weakly supervised semantic segmentation tasks under low-light conditions. The source codes have been released at:https://github.com/ChunyanWang1/DGKD-WLSS.
中文: 提出的DGKD-WLSS框架通过结合扩散引导的知识蒸馏与深度引导特征融合,有效提升了弱光环境下弱监督语义分割的性能,达到了当前最优水平。
English: The proposed DGKD-WLSS framework enhances weakly-supervised semantic segmentation in low-light conditions by combining diffusion-guided knowledge distillation with depth-guided feature fusion, achieving state-of-the-art performance.

Authors:Joelle Hanna, Linus Scheibenreif, Damian Borth
Title: MAPEX: Modality-Aware Pruning of Experts for Remote Sensing Foundation Models
Abstract:
Remote sensing data is commonly used for tasks such as flood mapping, wildfire detection, or land-use studies. For each task, scientists carefully choose appropriate modalities or leverage data from purpose-built instruments. Recent work on remote sensing foundation models pre-trains computer vision models on large amounts of remote sensing data. These large-scale models tend to focus on specific modalities, often optical RGB or multispectral data. For many important applications, this introduces a mismatch between the application modalities and the pre-training data. Moreover, the large size of foundation models makes them expensive and difficult to fine-tune on typically small datasets for each task. We address this mismatch with MAPEX, a remote sensing foundation model based on mixture-of-modality experts. MAPEX is pre-trained on multi-modal remote sensing data with a novel modality-conditioned token routing mechanism that elicits modality-specific experts. To apply the model on a specific task, we propose a modality aware pruning technique, which only retains experts specialized for the task modalities. This yields efficient modality-specific models while simplifying fine-tuning and deployment for the modalities of interest. We experimentally validate MAPEX on diverse remote sensing datasets and show strong performance compared to fully supervised training and state-of-the-art remote sensing foundation models. Code is available at https://github.com/HSG-AIML/MAPEX.
中文摘要:MAPEX是一种基于模态专家混合的遥感基础模型,通过新颖的模态条件令牌路由机制处理多模态数据,并采用模态感知剪枝技术生成高效的任务专用模型,在多种遥感数据集上表现出优越性能。
English Summary: MAPEX is a remote sensing foundation model that uses a mixture-of-modality experts approach to address the modality mismatch between pre-training data and application needs, enabling efficient task-specific adaptation through modality-aware pruning.

Authors:Shuaijin Wan
Title: GGMotion: Group Graph Dynamics-Kinematics Networks for Human Motion Prediction
Abstract:
Human motion is a continuous physical process in 3D space, governed by complex dynamic and kinematic constraints. Existing methods typically represent the human pose as an abstract graph structure, neglecting the intrinsic physical dependencies between joints, which increases learning difficulty and makes the model prone to generating unrealistic motions. In this paper, we propose GGMotion, a group graph dynamics-kinematics network that models human topology in groups to better leverage dynamics and kinematics priors. To preserve the geometric equivariance in 3D space, we propose a novel radial field for the graph network that captures more comprehensive spatio-temporal dependencies by aggregating joint features through spatial and temporal edges. Inter-group and intra-group interaction modules are employed to capture the dependencies of joints at different scales. Combined with equivariant multilayer perceptrons (MLP), joint position features are updated in each group through parallelized dynamics-kinematics propagation to improve physical plausibility. Meanwhile, we introduce an auxiliary loss to supervise motion priors during training. Extensive experiments on three standard benchmarks, including Human3.6M, CMU-Mocap, and 3DPW, demonstrate the effectiveness and superiority of our approach, achieving a significant performance margin in short-term motion prediction. The code is available at https://github.com/inkcat520/GGMotion.git.
中文:提出的GGMotion模型通过分组建模人体关节,更好地融合动力学和运动学先验,采用创新的径向场图网络和交互模块增强时空依赖性与物理合理性,在多个标准数据集上取得了优越的预测性能。
English: The proposed GGMotion model improves human motion prediction by grouping joints to better incorporate physical dynamics and kinematics, using a novel graph network with radial fields and interaction modules to enhance spatial-temporal dependencies and physical realism, achieving superior results on standard benchmarks.

Authors:Jiaxu Wan, Xu Wang, Mengwei Xie, Xinyuan Chang, Xinran Liu, Zheng Pan, Mu Xu, Ding Yuan
Title: Driving by Hybrid Navigation: An Online HD-SD Map Association Framework and Benchmark for Autonomous Vehicles
Abstract:
Autonomous vehicles rely on global standard-definition (SD) maps for road-level route planning and online local high-definition (HD) maps for lane-level navigation. However, recent work concentrates on construct online HD maps, often overlooking the association of global SD maps with online HD maps for hybrid navigation, making challenges in utilizing online HD maps in the real world. Observing the lack of the capability of autonomous vehicles in navigation, we introduce \textbf{O}nline \textbf{M}ap \textbf{A}ssociation, the first benchmark for the association of hybrid navigation-oriented online maps, which enhances the planning capabilities of autonomous vehicles. Based on existing datasets, the OMA contains 480k of roads and 260k of lane paths and provides the corresponding metrics to evaluate the performance of the model. Additionally, we propose a novel framework, named Map Association Transformer, as the baseline method, using path-aware attention and spatial attention mechanisms to enable the understanding of geometric and topological correspondences. The code and dataset can be accessed at https://github.com/WallelWan/OMA-MAT.
中文: 本文提出了首个面向混合导航的在线地图关联基准OMA,将全局标清地图与在线高清地图相结合,并引入基于路径和空间注意力机制的Map Association Transformer作为基线方法,以提升自动驾驶车辆的规划能力。
English: This paper introduces the Online Map Association (OMA) benchmark, the first to link global standard-definition maps with online high-definition maps for hybrid navigation in autonomous vehicles, and proposes the Map Association Transformer as a baseline method to enhance planning capabilities.

Authors:Ling Zhou, Runtian Yuan, Yi Liu, Yuejie Zhang, Rui Feng, Shang Gao
Title: Dual Semantic-Aware Network for Noise Suppressed Ultrasound Video Segmentation
Abstract:
Ultrasound imaging is a prevalent diagnostic tool known for its simplicity and non-invasiveness. However, its inherent characteristics often introduce substantial noise, posing considerable challenges for automated lesion or organ segmentation in ultrasound video sequences. To address these limitations, we propose the Dual Semantic-Aware Network (DSANet), a novel framework designed to enhance noise robustness in ultrasound video segmentation by fostering mutual semantic awareness between local and global features. Specifically, we introduce an Adjacent-Frame Semantic-Aware (AFSA) module, which constructs a channel-wise similarity matrix to guide feature fusion across adjacent frames, effectively mitigating the impact of random noise without relying on pixel-level relationships. Additionally, we propose a Local-and-Global Semantic-Aware (LGSA) module that reorganizes and fuses temporal unconditional local features, which capture spatial details independently at each frame, with conditional global features that incorporate temporal context from adjacent frames. This integration facilitates multi-level semantic representation, significantly improving the model's resilience to noise interference. Extensive evaluations on four benchmark datasets demonstrate that DSANet substantially outperforms state-of-the-art methods in segmentation accuracy. Moreover, since our model avoids pixel-level feature dependencies, it achieves significantly higher inference FPS than video-based methods, and even surpasses some image-based models. Code can be found in \href{https://github.com/ZhouL2001/DSANet}{DSANet}
中文: DSANet通过双语义感知模块增强超声视频分割的噪声鲁棒性,利用局部与全局特征融合在保持高精度的同时实现超越现有方法的推理速度。
English: DSANet introduces dual semantic-aware modules to enhance ultrasound video segmentation by improving noise robustness through local-global feature integration, achieving superior accuracy and faster inference than existing methods.

Authors:Jingjing Jiang, Chao Ma, Xurui Song, Hanwang Zhang, Jun Luo
Title: Corvid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning
Abstract:
Recent advancements in multimodal large language models (MLLMs) have demonstrated exceptional performance in multimodal perception and understanding. However, leading open-source MLLMs exhibit significant limitations in complex and structured reasoning, particularly in tasks requiring deep reasoning for decision-making and problem-solving. In this work, we present Corvid, an MLLM with enhanced chain-of-thought (CoT) reasoning capabilities. Architecturally, Corvid incorporates a hybrid vision encoder for informative visual representation and a meticulously designed connector (GateMixer) to facilitate cross-modal alignment. To enhance Corvid's CoT reasoning capabilities, we introduce MCoT-Instruct-287K, a high-quality multimodal CoT instruction-following dataset, refined and standardized from diverse public reasoning sources. Leveraging this dataset, we fine-tune Corvid with a two-stage CoT-formatted training approach to progressively enhance its step-by-step reasoning abilities. Furthermore, we propose an effective inference-time scaling strategy that enables Corvid to mitigate over-reasoning and under-reasoning through self-verification. Extensive experiments demonstrate that Corvid outperforms existing o1-like MLLMs and state-of-the-art MLLMs with similar parameter scales, with notable strengths in mathematical reasoning and science problem-solving. Project page: https://mm-vl.github.io/corvid.

Authors:Yongtang Bao, Chengjie Tang, Yuze Wang, Haojie Li
Title: Seg-Wild: Interactive Segmentation based on 3D Gaussian Splatting for Unconstrained Image Collections
Abstract:
Reconstructing and segmenting scenes from unconstrained photo collections obtained from the Internet is a novel but challenging task. Unconstrained photo collections are easier to get than well-captured photo collections. These unconstrained images suffer from inconsistent lighting and transient occlusions, which makes segmentation challenging. Previous segmentation methods cannot address transient occlusions or accurately restore the scene's lighting conditions. Therefore, we propose Seg-Wild, an interactive segmentation method based on 3D Gaussian Splatting for unconstrained image collections, suitable for in-the-wild scenes. We integrate multi-dimensional feature embeddings for each 3D Gaussian and calculate the feature similarity between the feature embeddings and the segmentation target to achieve interactive segmentation in the 3D scene. Additionally, we introduce the Spiky 3D Gaussian Cutter (SGC) to smooth abnormal 3D Gaussians. We project the 3D Gaussians onto a 2D plane and calculate the ratio of 3D Gaussians that need to be cut using the SAM mask. We also designed a benchmark to evaluate segmentation quality in in-the-wild scenes. Experimental results demonstrate that compared to previous methods, Seg-Wild achieves better segmentation results and reconstruction quality. Our code will be available at https://github.com/Sugar0725/Seg-Wild.
中文:Seg-Wild是一种基于3D高斯抛雪的交互式分割方法,通过多维特征嵌入和尖峰3D高斯切割器处理无约束照片集中的光照不一致和瞬时遮挡问题,在野外场景中实现了更优的分割效果和重建质量。
English: Seg-Wild is an interactive segmentation method based on 3D Gaussian Splatting that addresses inconsistent lighting and transient occlusions in unconstrained photo collections, achieving superior segmentation and reconstruction quality through multi-dimensional feature embeddings and a Spiky 3D Gaussian Cutter.

Authors:Haotian Wang, Aoran Xiao, Xiaoqin Zhang, Meng Yang, Shijian Lu
Title: PacGDC: Label-Efficient Generalizable Depth Completion with Projection Ambiguity and Consistency
Abstract:
Generalizable depth completion enables the acquisition of dense metric depth maps for unseen environments, offering robust perception capabilities for various downstream tasks. However, training such models typically requires large-scale datasets with metric depth labels, which are often labor-intensive to collect. This paper presents PacGDC, a label-efficient technique that enhances data diversity with minimal annotation effort for generalizable depth completion. PacGDC builds on novel insights into inherent ambiguities and consistencies in object shapes and positions during 2D-to-3D projection, allowing the synthesis of numerous pseudo geometries for the same visual scene. This process greatly broadens available geometries by manipulating scene scales of the corresponding depth maps. To leverage this property, we propose a new data synthesis pipeline that uses multiple depth foundation models as scale manipulators. These models robustly provide pseudo depth labels with varied scene scales, affecting both local objects and global layouts, while ensuring projection consistency that supports generalization. To further diversify geometries, we incorporate interpolation and relocation strategies, as well as unlabeled images, extending the data coverage beyond the individual use of foundation models. Extensive experiments show that PacGDC achieves remarkable generalizability across multiple benchmarks, excelling in diverse scene semantics/scales and depth sparsity/patterns under both zero-shot and few-shot settings. Code: https://github.com/Wang-xjtu/PacGDC.
中文: PacGDC是一种标签高效技术,通过场景尺度操纵合成伪几何并利用深度基础模型,以最少标注工作增强泛化深度补全的数据多样性,在多种基准测试中展现出卓越性能。
English: PacGDC is a label-efficient technique that enhances data diversity for generalizable depth completion by synthesizing pseudo geometries through scene scale manipulation and leveraging depth foundation models, achieving strong performance across diverse benchmarks with minimal annotation effort.

Authors:Sherry X. Chen, Yi Wei, Luowei Zhou, Suren Kumar
Title: ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation
Abstract:
Recent advances in instruction-guided image editing underscore the need for effective automated evaluation. While Vision-Language Models (VLMs) have been explored as judges, open-source models struggle with alignment, and proprietary models lack transparency and cost efficiency. Additionally, no public training datasets exist to fine-tune open-source VLMs, only small benchmarks with diverse evaluation schemes. To address this, we introduce ADIEE, an automated dataset creation approach which is then used to train a scoring model for instruction-guided image editing evaluation. We generate a large-scale dataset with over 100K samples and use it to fine-tune a LLaVA-NeXT-8B model modified to decode a numeric score from a custom token. The resulting scorer outperforms all open-source VLMs and Gemini-Pro 1.5 across all benchmarks, achieving a 0.0696 (+17.24%) gain in score correlation with human ratings on AURORA-Bench, and improving pair-wise comparison accuracy by 4.03% (+7.21%) on GenAI-Bench and 4.75% (+9.35%) on AURORA-Bench, respectively, compared to the state-of-the-art. The scorer can act as a reward model, enabling automated best edit selection and model fine-tuning. Notably, the proposed scorer can boost MagicBrush model's average evaluation score on ImagenHub from 5.90 to 6.43 (+8.98%). Our code and models are available at https://github.com/SherryXTChen/ADIEE.git.
Chinese Summary: 本文提出ADIEE方法,通过自动生成大规模数据集训练图像编辑评估模型,在多项基准测试中超越现有开源和专有模型,显著提升了与人类评分的相关性及比较准确率。
English Summary: This paper introduces ADIEE, an automated dataset creation method that trains a scoring model to evaluate instruction-guided image editing, achieving superior performance over existing models with significant improvements in human rating correlation and comparison accuracy.

Authors:Heet Nitinkumar Dalsania
Title: Label-Efficient Chest X-ray Diagnosis via Partial CLIP Adaptation
Abstract:
Modern deep learning implementations for medical imaging usually rely on large labeled datasets. These datasets are often difficult to obtain due to privacy concerns, high costs, and even scarcity of cases. In this paper, a label-efficient strategy is proposed for chest X-ray diagnosis that seeks to reflect real-world hospital scenarios. The experiments use the NIH Chest X-ray14 dataset and a pre-trained CLIP ViT-B/32 model. The model is adapted via partial fine-tuning of its visual encoder and then evaluated using zero-shot and few-shot learning with 1-16 labeled examples per disease class. The tests demonstrate that CLIP's pre-trained vision-language features can be effectively adapted to few-shot medical imaging tasks, achieving over 20\% improvement in mean AUC score as compared to the zero-shot baseline. The key aspect of this work is to attempt to simulate internal hospital workflows, where image archives exist but annotations are sparse. This work evaluates a practical and scalable solution for both common and rare disease diagnosis. Additionally this research is intended for academic and experimental purposes only and has not been peer reviewed yet. All code is found at https://github.com/heet007-code/CLIP-disease-xray.
中文: 本文提出了一种标签高效策略,利用预训练的CLIP模型进行胸部X光诊断,通过少样本学习在医学影像数据稀缺的情况下,实现了平均AUC分数超过20%的提升。
English: This paper introduces a label-efficient strategy using a pre-trained CLIP model for chest X-ray diagnosis, achieving over 20% improvement in mean AUC score through few-shot learning to address data scarcity in medical imaging.

Authors:Priyank Pathak, Yogesh S. Rawat
Title: Colors See Colors Ignore: Clothes Changing ReID with Color Disentanglement
Abstract:
Clothes-Changing Re-Identification (CC-ReID) aims to recognize individuals across different locations and times, irrespective of clothing. Existing methods often rely on additional models or annotations to learn robust, clothing-invariant features, making them resource-intensive. In contrast, we explore the use of color - specifically foreground and background colors - as a lightweight, annotation-free proxy for mitigating appearance bias in ReID models. We propose Colors See, Colors Ignore (CSCI), an RGB-only method that leverages color information directly from raw images or video frames. CSCI efficiently captures color-related appearance bias ('Color See') while disentangling it from identity-relevant ReID features ('Color Ignore'). To achieve this, we introduce S2A self-attention, a novel self-attention to prevent information leak between color and identity cues within the feature space. Our analysis shows a strong correspondence between learned color embeddings and clothing attributes, validating color as an effective proxy when explicit clothing labels are unavailable. We demonstrate the effectiveness of CSCI on both image and video ReID with extensive experiments on four CC-ReID datasets. We improve the baseline by Top-1 2.9% on LTCC and 5.0% on PRCC for image-based ReID, and 1.0% on CCVID and 2.5% on MeVID for video-based ReID without relying on additional supervision. Our results highlight the potential of color as a cost-effective solution for addressing appearance bias in CC-ReID. Github: https://github.com/ppriyank/ICCV-CSCI-Person-ReID.
中文: 提出的CSCI方法利用图像颜色信息有效缓解换衣重识别中的外观偏差,无需额外标注即可在多个数据集上实现显著性能提升。
English: The proposed CSCI method utilizes color information from images to effectively mitigate appearance bias in clothes-changing re-identification without requiring additional annotations, achieving significant performance improvements across multiple datasets.

Authors:Renyang Liu, Guanlin Li, Tianwei Zhang, See-Kiong Ng
Title: Image Can Bring Your Memory Back: A Novel Multi-Modal Guided Attack against Image Generation Model Unlearning
Abstract:
Recent advances in image generation models (IGMs), particularly diffusion-based architectures such as Stable Diffusion (SD), have markedly enhanced the quality and diversity of AI-generated visual content. However, their generative capability has also raised significant ethical, legal, and societal concerns, including the potential to produce harmful, misleading, or copyright-infringing content. To mitigate these concerns, machine unlearning (MU) emerges as a promising solution by selectively removing undesirable concepts from pretrained models. Nevertheless, the robustness and effectiveness of existing unlearning techniques remain largely unexplored, particularly in the presence of multi-modal adversarial inputs. To bridge this gap, we propose Recall, a novel adversarial framework explicitly designed to compromise the robustness of unlearned IGMs. Unlike existing approaches that predominantly rely on adversarial text prompts, Recall exploits the intrinsic multi-modal conditioning capabilities of diffusion models by efficiently optimizing adversarial image prompts with guidance from a single semantically relevant reference image. Extensive experiments across ten state-of-the-art unlearning methods and diverse tasks show that Recall consistently outperforms existing baselines in terms of adversarial effectiveness, computational efficiency, and semantic fidelity with the original textual prompt. These findings reveal critical vulnerabilities in current unlearning mechanisms and underscore the need for more robust solutions to ensure the safety and reliability of generative models. Code and data are publicly available at \textcolor{blue}{https://github.com/ryliu68/RECALL}.
中文摘要:图像生成模型的进步引发了伦理担忧,机器学习遗忘成为解决方案,但提出的Recall框架通过对抗性图像提示揭示了现有遗忘方法的脆弱性,有效削弱了其鲁棒性。
English Summary: Recent advances in image generation models raise ethical concerns, prompting machine unlearning as a solution, but the proposed Recall framework exposes vulnerabilities in current unlearning methods by using adversarial image prompts to compromise their robustness effectively.

Authors:Cristina Mata, Kanchana Ranasinghe, Michael S. Ryoo
Title: CoPT: Unsupervised Domain Adaptive Segmentation using Domain-Agnostic Text Embeddings
Abstract:
Unsupervised domain adaptation (UDA) involves learning class semantics from labeled data within a source domain that generalize to an unseen target domain. UDA methods are particularly impactful for semantic segmentation, where annotations are more difficult to collect than in image classification. Despite recent advances in large-scale vision-language representation learning, UDA methods for segmentation have not taken advantage of the domain-agnostic properties of text. To address this, we present a novel Covariance-based Pixel-Text loss, CoPT, that uses domain-agnostic text embeddings to learn domain-invariant features in an image segmentation encoder. The text embeddings are generated through our LLM Domain Template process, where an LLM is used to generate source and target domain descriptions that are fed to a frozen CLIP model and combined. In experiments on four benchmarks we show that a model trained using CoPT achieves the new state of the art performance on UDA for segmentation. The code can be found at https://github.com/cfmata/CoPT.
Chinese: 本文提出CoPT,一种基于协方差的像素-文本损失方法,利用大语言模型生成的领域无关文本嵌入,在语义分割的无监督领域自适应任务中实现了最先进的性能。
English: This paper introduces CoPT, a novel covariance-based pixel-text loss that leverages domain-agnostic text embeddings from LLM-generated descriptions to achieve state-of-the-art performance in unsupervised domain adaptation for semantic segmentation.

Authors:Zhiwei Hu, Víctor Gutiérrez-Basulto, Zhiliang Xiang, Ru Li, Jeff Z. Pan
Title: Multi-level Mixture of Experts for Multimodal Entity Linking
Abstract:
Multimodal Entity Linking (MEL) aims to link ambiguous mentions within multimodal contexts to associated entities in a multimodal knowledge base. Existing approaches to MEL introduce multimodal interaction and fusion mechanisms to bridge the modality gap and enable multi-grained semantic matching. However, they do not address two important problems: (i) mention ambiguity, i.e., the lack of semantic content caused by the brevity and omission of key information in the mention's textual context; (ii) dynamic selection of modal content, i.e., to dynamically distinguish the importance of different parts of modal information. To mitigate these issues, we propose a Multi-level Mixture of Experts (MMoE) model for MEL. MMoE has four components: (i) the description-aware mention enhancement module leverages large language models to identify the WikiData descriptions that best match a mention, considering the mention's textual context; (ii) the multimodal feature extraction module adopts multimodal feature encoders to obtain textual and visual embeddings for both mentions and entities; (iii)-(iv) the intra-level mixture of experts and inter-level mixture of experts modules apply a switch mixture of experts mechanism to dynamically and adaptively select features from relevant regions of information. Extensive experiments demonstrate the outstanding performance of MMoE compared to the state-of-the-art. MMoE's code is available at: https://github.com/zhiweihu1103/MEL-MMoE.
中文摘要:本研究提出的多级专家混合模型通过利用大型语言模型和自适应特征选择机制,解决了多模态实体链接中的指称歧义和模态内容动态选择问题,实验证明其性能优于现有先进方法。
English Summary: The proposed Multi-level Mixture of Experts (MMoE) model addresses mention ambiguity and dynamic modality selection in Multimodal Entity Linking by leveraging large language models and adaptive feature selection mechanisms, demonstrating superior performance over existing methods.

Authors:Vatsal Agarwal, Matthew Gwilliam, Gefen Kohavi, Eshan Verma, Daniel Ulbricht, Abhinav Shrivastava
Title: Towards Multimodal Understanding via Stable Diffusion as a Task-Aware Feature Extractor
Abstract:
Recent advances in multimodal large language models (MLLMs) have enabled image-based question-answering capabilities. However, a key limitation is the use of CLIP as the visual encoder; while it can capture coarse global information, it often can miss fine-grained details that are relevant to the input query. To address these shortcomings, this work studies whether pre-trained text-to-image diffusion models can serve as instruction-aware visual encoders. Through an analysis of their internal representations, we find diffusion features are both rich in semantics and can encode strong image-text alignment. Moreover, we find that we can leverage text conditioning to focus the model on regions relevant to the input question. We then investigate how to align these features with large language models and uncover a leakage phenomenon, where the LLM can inadvertently recover information from the original diffusion prompt. We analyze the causes of this leakage and propose a mitigation strategy. Based on these insights, we explore a simple fusion strategy that utilizes both CLIP and conditional diffusion features. We evaluate our approach on both general VQA and specialized MLLM benchmarks, demonstrating the promise of diffusion models for visual understanding, particularly in vision-centric tasks that require spatial and compositional reasoning. Our project page can be found https://vatsalag99.github.io/mustafar/.

Authors:Tiezheng Zhang, Yitong Li, Yu-cheng Chou, Jieneng Chen, Alan Yuille, Chen Wei, Junfei Xiao
Title: Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models
Abstract:
Building state-of-the-art Vision-Language Models (VLMs) with strong captioning capabilities typically necessitates training on billions of high-quality image-text pairs, requiring millions of GPU hours. This paper introduces the Vision-Language-Vision (VLV) auto-encoder framework, which strategically leverages key pretrained components: a vision encoder, the decoder of a Text-to-Image (T2I) diffusion model, and subsequently, a Large Language Model (LLM). Specifically, we establish an information bottleneck by regularizing the language representation space, achieved through freezing the pretrained T2I diffusion decoder. Our VLV pipeline effectively distills knowledge from the text-conditioned diffusion model using continuous embeddings, demonstrating comprehensive semantic understanding via high-quality reconstructions. Furthermore, by fine-tuning a pretrained LLM to decode the intermediate language representations into detailed descriptions, we construct a state-of-the-art (SoTA) captioner comparable to leading models like GPT-4o and Gemini 2.0 Flash. Our method demonstrates exceptional cost-efficiency and significantly reduces data requirements; by primarily utilizing single-modal images for training and maximizing the utility of existing pretrained models (image encoder, T2I diffusion model, and LLM), it circumvents the need for massive paired image-text datasets, keeping the total training expenditure under $1,000 USD.
中文: 本文提出了一种成本高效的视觉-语言-视觉自编码器框架,通过利用预训练组件,仅需少量数据和低于1000美元的训练成本即可构建性能媲美顶尖模型的描述生成系统。
English: This paper introduces a cost-efficient Vision-Language-Vision auto-encoder framework that leverages pretrained components to build state-of-the-art captioners with minimal data and under $1,000 in training costs.

Authors:Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, Jingbo Wang
Title: Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data
Abstract:
Generating diverse and natural human motion sequences based on textual descriptions constitutes a fundamental and challenging research area within the domains of computer vision, graphics, and robotics. Despite significant advancements in this field, current methodologies often face challenges regarding zero-shot generalization capabilities, largely attributable to the limited size of training datasets. Moreover, the lack of a comprehensive evaluation framework impedes the advancement of this task by failing to identify directions for improvement. In this work, we aim to push text-to-motion into a new era, that is, to achieve the generalization ability of zero-shot. To this end, firstly, we develop an efficient annotation pipeline and introduce MotionMillion-the largest human motion dataset to date, featuring over 2,000 hours and 2 million high-quality motion sequences. Additionally, we propose MotionMillion-Eval, the most comprehensive benchmark for evaluating zero-shot motion generation. Leveraging a scalable architecture, we scale our model to 7B parameters and validate its performance on MotionMillion-Eval. Our results demonstrate strong generalization to out-of-domain and complex compositional motions, marking a significant step toward zero-shot human motion generation. The code is available at https://github.com/VankouF/MotionMillion-Codes.
中文: 本研究推出了迄今最大的包含200万序列的MotionMillion人体运动数据集和综合评估基准,通过70亿参数模型实现了零样本文本生成运动的强泛化能力。
English: This research introduces MotionMillion, the largest human motion dataset with over 2 million sequences, and a comprehensive evaluation benchmark to advance zero-shot text-to-motion generation, achieving strong generalization with a 7B-parameter model.

Authors:Ziyue Liu, Federico Girella, Yiming Wang, Davide Talon
Title: Evaluating Attribute Confusion in Fashion Text-to-Image Generation
Abstract:
Despite the rapid advances in Text-to-Image (T2I) generation models, their evaluation remains challenging in domains like fashion, involving complex compositional generation. Recent automated T2I evaluation methods leverage pre-trained vision-language models to measure cross-modal alignment. However, our preliminary study reveals that they are still limited in assessing rich entity-attribute semantics, facing challenges in attribute confusion, i.e., when attributes are correctly depicted but associated to the wrong entities. To address this, we build on a Visual Question Answering (VQA) localization strategy targeting one single entity at a time across both visual and textual modalities. We propose a localized human evaluation protocol and introduce a novel automatic metric, Localized VQAScore (L-VQAScore), that combines visual localization with VQA probing both correct (reflection) and miss-localized (leakage) attribute generation. On a newly curated dataset featuring challenging compositional alignment scenarios, L-VQAScore outperforms state-of-the-art T2I evaluation methods in terms of correlation with human judgments, demonstrating its strength in capturing fine-grained entity-attribute associations. We believe L-VQAScore can be a reliable and scalable alternative to subjective evaluations.

Authors:Hui Li, Pengfei Yang, Juanyang Chen, Le Dong, Yanxin Chen, Quan Wang
Title: MST-Distill: Mixture of Specialized Teachers for Cross-Modal Knowledge Distillation
Abstract:
Knowledge distillation as an efficient knowledge transfer technique, has achieved remarkable success in unimodal scenarios. However, in cross-modal settings, conventional distillation methods encounter significant challenges due to data and statistical heterogeneities, failing to leverage the complementary prior knowledge embedded in cross-modal teacher models. This paper empirically reveals two critical issues in existing approaches: distillation path selection and knowledge drift. To address these limitations, we propose MST-Distill, a novel cross-modal knowledge distillation framework featuring a mixture of specialized teachers. Our approach employs a diverse ensemble of teacher models across both cross-modal and multimodal configurations, integrated with an instance-level routing network that facilitates adaptive and dynamic distillation. This architecture effectively transcends the constraints of traditional methods that rely on monotonous and static teacher models. Additionally, we introduce a plug-in masking module, independently trained to suppress modality-specific discrepancies and reconstruct teacher representations, thereby mitigating knowledge drift and enhancing transfer effectiveness. Extensive experiments across five diverse multimodal datasets, spanning visual, audio, and text, demonstrate that our method significantly outperforms existing state-of-the-art knowledge distillation methods in cross-modal distillation tasks. The source code is available at https://github.com/Gray-OREO/MST-Distill.
Chinese: 本文提出MST-Distill跨模态知识蒸馏框架,通过混合专家教师模型和自适应路由网络解决蒸馏路径选择与知识漂移问题,在多种跨模态数据集上显著优于现有最优方法。
English: This paper introduces MST-Distill, a cross-modal knowledge distillation framework that utilizes a mixture of specialized teachers and an adaptive routing network to overcome limitations like distillation path selection and knowledge drift, significantly outperforming existing methods across diverse multimodal datasets.

Authors:Fei Teng, Kai Luo, Sheng Wu, Siyu Li, Pujun Guo, Jiale Wei, Kunyu Peng, Jiaming Zhang, Kailun Yang
Title: Hallucinating 360°: Panoramic Street-View Generation via Local Scenes Diffusion and Probabilistic Prompting
Abstract:
Panoramic perception holds significant potential for autonomous driving, enabling vehicles to acquire a comprehensive 360° surround view in a single shot. However, autonomous driving is a data-driven task. Complete panoramic data acquisition requires complex sampling systems and annotation pipelines, which are time-consuming and labor-intensive. Although existing street view generation models have demonstrated strong data regeneration capabilities, they can only learn from the fixed data distribution of existing datasets and cannot achieve high-quality, controllable panoramic generation. In this paper, we propose the first panoramic generation method Percep360 for autonomous driving. Percep360 enables coherent generation of panoramic data with control signals based on the stitched panoramic data. Percep360 focuses on two key aspects: coherence and controllability. Specifically, to overcome the inherent information loss caused by the pinhole sampling process, we propose the Local Scenes Diffusion Method (LSDM). LSDM reformulates the panorama generation as a spatially continuous diffusion process, bridging the gaps between different data distributions. Additionally, to achieve the controllable generation of panoramic images, we propose a Probabilistic Prompting Method (PPM). PPM dynamically selects the most relevant control cues, enabling controllable panoramic image generation. We evaluate the effectiveness of the generated images from three perspectives: image quality assessment (i.e., no-reference and with reference), controllability, and their utility in real-world Bird's Eye View (BEV) segmentation. Notably, the generated data consistently outperforms the original stitched images in no-reference quality metrics and enhances downstream perception models. The source code will be publicly available at https://github.com/Bryant-Teng/Percep360.
中文: 本文提出的Percep360是首个自动驾驶全景生成方法,通过局部场景扩散模型和概率提示机制实现连贯可控的全景数据合成,在图像质量和下游感知任务中均展现出显著优势。
English: This paper introduces Percep360, the first panoramic generation method for autonomous driving that ensures coherent and controllable data synthesis through a Local Scenes Diffusion Method and Probabilistic Prompting Method, significantly enhancing image quality and downstream perception tasks.

Authors:Yixin Zhao, Yuyi Zhang, Lianwen Jin
Title: MCCD: A Multi-Attribute Chinese Calligraphy Character Dataset Annotated with Script Styles, Dynasties, and Calligraphers
Abstract:
Research on the attribute information of calligraphy, such as styles, dynasties, and calligraphers, holds significant cultural and historical value. However, the styles of Chinese calligraphy characters have evolved dramatically through different dynasties and the unique touches of calligraphers, making it highly challenging to accurately recognize these different characters and their attributes. Furthermore, existing calligraphic datasets are extremely scarce, and most provide only character-level annotations without additional attribute information. This limitation has significantly hindered the in-depth study of Chinese calligraphy. To fill this gap, we present a novel Multi-Attribute Chinese Calligraphy Character Dataset (MCCD). The dataset encompasses 7,765 categories with a total of 329,715 isolated image samples of Chinese calligraphy characters, and three additional subsets were extracted based on the attribute labeling of the three types of script styles (10 types), dynasties (15 periods) and calligraphers (142 individuals). The rich multi-attribute annotations render MCCD well-suited diverse research tasks, including calligraphic character recognition, writer identification, and evolutionary studies of Chinese characters. We establish benchmark performance through single-task and multi-task recognition experiments across MCCD and all of its subsets. The experimental results demonstrate that the complexity of the stroke structure of the calligraphic characters, and the interplay between their different attributes, leading to a substantial increase in the difficulty of accurate recognition. MCCD not only fills a void in the availability of detailed calligraphy datasets but also provides valuable resources for advancing research in Chinese calligraphy and fostering advancements in multiple fields. The dataset is available at https://github.com/SCUT-DLVCLab/MCCD.
Chinese Summary: 本研究提出了多属性中国书法字符数据集(MCCD),通过提供包含字体、朝代和书法家详细标注的329,715个字符图像,填补了书法标注数据稀缺的空白,为书法识别与历史研究提供了重要资源。
English Summary: This research introduces the Multi-Attribute Chinese Calligraphy Character Dataset (MCCD), a comprehensive resource addressing the scarcity of annotated calligraphy data by providing 329,715 character images with detailed script, dynasty, and calligrapher attributes to advance recognition and historical studies.

Authors:Xuesong Li, Nassir Navab, Zhongliang Jiang
Title: Speckle2Self: Self-Supervised Ultrasound Speckle Reduction Without Clean Data
Abstract:
Image denoising is a fundamental task in computer vision, particularly in medical ultrasound (US) imaging, where speckle noise significantly degrades image quality. Although recent advancements in deep neural networks have led to substantial improvements in denoising for natural images, these methods cannot be directly applied to US speckle noise, as it is not purely random. Instead, US speckle arises from complex wave interference within the body microstructure, making it tissue-dependent. This dependency means that obtaining two independent noisy observations of the same scene, as required by pioneering Noise2Noise, is not feasible. Additionally, blind-spot networks also cannot handle US speckle noise due to its high spatial dependency. To address this challenge, we introduce Speckle2Self, a novel self-supervised algorithm for speckle reduction using only single noisy observations. The key insight is that applying a multi-scale perturbation (MSP) operation introduces tissue-dependent variations in the speckle pattern across different scales, while preserving the shared anatomical structure. This enables effective speckle suppression by modeling the clean image as a low-rank signal and isolating the sparse noise component. To demonstrate its effectiveness, Speckle2Self is comprehensively compared with conventional filter-based denoising algorithms and SOTA learning-based methods, using both realistic simulated US images and human carotid US images. Additionally, data from multiple US machines are employed to evaluate model generalization and adaptability to images from unseen domains. Project page: https://noseefood.github.io/us-speckle2self/
中文摘要:Speckle2Self是一种新型自监督算法,通过多尺度扰动操作从单幅噪声图像中分离组织依赖性斑点噪声与解剖结构,实现超声图像去噪。
English Summary: Speckle2Self is a self-supervised algorithm that uses multi-scale perturbation to reduce ultrasound speckle noise from single noisy observations by separating tissue-dependent noise from anatomical structures.

Authors:Xu Yang, Shaoli Huang, Shenbo Xie, Xuelin Chen, Yifei Liu, Changxing Ding
Title: Democratizing High-Fidelity Co-Speech Gesture Video Generation
Abstract:
Co-speech gesture video generation aims to synthesize realistic, audio-aligned videos of speakers, complete with synchronized facial expressions and body gestures. This task presents challenges due to the significant one-to-many mapping between audio and visual content, further complicated by the scarcity of large-scale public datasets and high computational demands. We propose a lightweight framework that utilizes 2D full-body skeletons as an efficient auxiliary condition to bridge audio signals with visual outputs. Our approach introduces a diffusion model conditioned on fine-grained audio segments and a skeleton extracted from the speaker's reference image, predicting skeletal motions through skeleton-audio feature fusion to ensure strict audio coordination and body shape consistency. The generated skeletons are then fed into an off-the-shelf human video generation model with the speaker's reference image to synthesize high-fidelity videos. To democratize research, we present CSG-405-the first public dataset with 405 hours of high-resolution videos across 71 speech types, annotated with 2D skeletons and diverse speaker demographics. Experiments show that our method exceeds state-of-the-art approaches in visual quality and synchronization while generalizing across speakers and contexts. Code, models, and CSG-405 are publicly released at https://mpi-lab.github.io/Democratizing-CSG/

Authors:Eya Cherif, Arthur Ouaknine, Luke A. Brown, Phuong D. Dao, Kyle R. Kovach, Bing Lu, Daniel Mederer, Hannes Feilhauer, Teja Kattenborn, David Rolnick
Title: GreenHyperSpectra: A multi-source hyperspectral dataset for global vegetation trait prediction
Abstract:
Plant traits such as leaf carbon content and leaf mass are essential variables in the study of biodiversity and climate change. However, conventional field sampling cannot feasibly cover trait variation at ecologically meaningful spatial scales. Machine learning represents a valuable solution for plant trait prediction across ecosystems, leveraging hyperspectral data from remote sensing. Nevertheless, trait prediction from hyperspectral data is challenged by label scarcity and substantial domain shifts (\eg across sensors, ecological distributions), requiring robust cross-domain methods. Here, we present GreenHyperSpectra, a pretraining dataset encompassing real-world cross-sensor and cross-ecosystem samples designed to benchmark trait prediction with semi- and self-supervised methods. We adopt an evaluation framework encompassing in-distribution and out-of-distribution scenarios. We successfully leverage GreenHyperSpectra to pretrain label-efficient multi-output regression models that outperform the state-of-the-art supervised baseline. Our empirical analyses demonstrate substantial improvements in learning spectral representations for trait prediction, establishing a comprehensive methodological framework to catalyze research at the intersection of representation learning and plant functional traits assessment. All code and data are available at: https://github.com/echerif18/HyspectraSSL.
中文: GreenHyperSpectra是一个跨传感器数据集,通过自监督学习实现了稳健的植物性状预测,在解决领域偏移和标签稀缺问题的同时超越了现有监督方法的性能。
English: GreenHyperSpectra is a cross-sensor dataset enabling robust plant trait prediction through self-supervised learning, outperforming supervised methods while addressing domain shifts and label scarcity.

Authors:Antonella Barisic Kulas, Andreja Jurasovic, Stjepan Bogdan
Title: Unlocking Thermal Aerial Imaging: Synthetic Enhancement of UAV Datasets
Abstract:
Thermal imaging from unmanned aerial vehicles (UAVs) holds significant potential for applications in search and rescue, wildlife monitoring, and emergency response, especially under low-light or obscured conditions. However, the scarcity of large-scale, diverse thermal aerial datasets limits the advancement of deep learning models in this domain, primarily due to the high cost and logistical challenges of collecting thermal data. In this work, we introduce a novel procedural pipeline for generating synthetic thermal images from an aerial perspective. Our method integrates arbitrary object classes into existing thermal backgrounds by providing control over the position, scale, and orientation of the new objects, while aligning them with the viewpoints of the background. We enhance existing thermal datasets by introducing new object categories, specifically adding a drone class in urban environments to the HIT-UAV dataset and an animal category to the MONET dataset. In evaluating these datasets for object detection task, we showcase strong performance across both new and existing classes, validating the successful expansion into new applications. Through comparative analysis, we show that thermal detectors outperform their visible-light-trained counterparts and highlight the importance of replicating aerial viewing angles. Project page: https://github.com/larics/thermal_aerial_synthetic.
Chinese: 本研究提出了一种生成合成热成像航拍图像的程序化流程,通过为数据集增加新物体类别,验证了热成像检测器在物体识别任务中优于可见光模型的性能。
English: This study introduces a procedural pipeline for generating synthetic thermal aerial images, enhancing datasets with new object classes and demonstrating superior object detection performance compared to visible-light models.

Authors:Yan Hon Michael Chung, Donghyeok Choi
Title: Finetuning Vision-Language Models as OCR Systems for Low-Resource Languages: A Case Study of Manchu
Abstract:
Manchu, a critically endangered language essential for understanding early modern Eastern Eurasian history, lacks effective OCR systems that can handle real-world historical documents. This study develops high-performing OCR systems by fine-tuning three open-source vision-language models (LLaMA-3.2-11B, Qwen2.5-VL-7B, Qwen2.5-VL-3B) on 60,000 synthetic Manchu word images using parameter-efficient training. LLaMA-3.2-11B achieved exceptional performance with 98.3\% word accuracy and 0.0024 character error rate on synthetic data, while crucially maintaining 93.1\% accuracy on real-world handwritten documents. Comparative evaluation reveals substantial advantages over traditional approaches: while a CRNN baseline achieved 99.8\% synthetic accuracy, it suffered severe degradation to 72.5\% on real documents. Our approach demonstrates effective synthetic-to-real domain transfer, providing a cost-effective solution deployable on accessible infrastructure. This work establishes a transferable framework for endangered language OCR that removes technical and financial barriers in digital humanities, enabling historians and linguists to process historical archives without specialized computing resources. Code and model weights are available at https://github.com/mic7ch1/ManchuAI-OCR.
中文摘要:本研究通过基于合成数据微调视觉语言模型,为濒危满语开发了高性能OCR系统,在合成和真实文档上均取得优异准确率,同时克服了传统方法在领域迁移中的局限性。
English Summary: This study develops high-performing OCR systems for the endangered Manchu language by fine-tuning vision-language models on synthetic data, achieving exceptional accuracy on both synthetic and real-world documents while overcoming traditional approaches' limitations in domain transfer.

Authors:Daojie Peng, Jiahang Cao, Qiang Zhang, Jun Ma
Title: LOVON: Legged Open-Vocabulary Object Navigator
Abstract:
Object navigation in open-world environments remains a formidable and pervasive challenge for robotic systems, particularly when it comes to executing long-horizon tasks that require both open-world object detection and high-level task planning. Traditional methods often struggle to integrate these components effectively, and this limits their capability to deal with complex, long-range navigation missions. In this paper, we propose LOVON, a novel framework that integrates large language models (LLMs) for hierarchical task planning with open-vocabulary visual detection models, tailored for effective long-range object navigation in dynamic, unstructured environments. To tackle real-world challenges including visual jittering, blind zones, and temporary target loss, we design dedicated solutions such as Laplacian Variance Filtering for visual stabilization. We also develop a functional execution logic for the robot that guarantees LOVON's capabilities in autonomous navigation, task adaptation, and robust task completion. Extensive evaluations demonstrate the successful completion of long-sequence tasks involving real-time detection, search, and navigation toward open-vocabulary dynamic targets. Furthermore, real-world experiments across different legged robots (Unitree Go2, B2, and H1-2) showcase the compatibility and appealing plug-and-play feature of LOVON.
中文: LOVON是一种创新框架,通过结合大语言模型与开放词汇视觉检测技术,使机器人能够在动态环境中实现长距离目标导航,并采用拉普拉斯方差滤波等专门方案解决视觉抖动和临时目标丢失等实际问题。
English: LOVON is a novel framework that integrates large language models with open-vocabulary visual detection to enable robots to perform long-range object navigation in dynamic environments, addressing challenges like visual jittering and temporary target loss through specialized solutions such as Laplacian Variance Filtering.

Authors:Mahshid Shiri, Cigdem Beyan, Vittorio Murino
Title: MADPOT: Medical Anomaly Detection with CLIP Adaptation and Partial Optimal Transport
Abstract:
Medical anomaly detection (AD) is challenging due to diverse imaging modalities, anatomical variations, and limited labeled data. We propose a novel approach combining visual adapters and prompt learning with Partial Optimal Transport (POT) and contrastive learning (CL) to improve CLIP's adaptability to medical images, particularly for AD. Unlike standard prompt learning, which often yields a single representation, our method employs multiple prompts aligned with local features via POT to capture subtle abnormalities. CL further enforces intra-class cohesion and inter-class separation. Our method achieves state-of-the-art results in few-shot, zero-shot, and cross-dataset scenarios without synthetic data or memory banks. The code is available at https://github.com/mahshid1998/MADPOT.
Chinese: 本研究提出了一种新方法,结合视觉适配器和提示学习与部分最优传输及对比学习,以提升CLIP在医学异常检测中的适应性,无需合成数据即在多种场景下取得最优结果。
English: This study introduces a novel method that integrates visual adapters and prompt learning with Partial Optimal Transport and contrastive learning to enhance CLIP's performance in medical anomaly detection, achieving top results in various scenarios without synthetic data.

Authors:Miaojing Shi, Xiaowen Zhang, Zijie Yue, Yong Luo, Cairong Zhao, Li Li
Title: Text-promptable Object Counting via Quantity Awareness Enhancement
Abstract:
Recent advances in large vision-language models (VLMs) have shown remarkable progress in solving the text-promptable object counting problem. Representative methods typically specify text prompts with object category information in images. This however is insufficient for training the model to accurately distinguish the number of objects in the counting task. To this end, we propose QUANet, which introduces novel quantity-oriented text prompts with a vision-text quantity alignment loss to enhance the model's quantity awareness. Moreover, we propose a dual-stream adaptive counting decoder consisting of a Transformer stream, a CNN stream, and a number of Transformer-to-CNN enhancement adapters (T2C-adapters) for density map prediction. The T2C-adapters facilitate the effective knowledge communication and aggregation between the Transformer and CNN streams. A cross-stream quantity ranking loss is proposed in the end to optimize the ranking orders of predictions from the two streams. Extensive experiments on standard benchmarks such as FSC-147, CARPK, PUCPR+, and ShanghaiTech demonstrate our model's strong generalizability for zero-shot class-agnostic counting. Code is available at https://github.com/viscom-tongji/QUANet
中文:QUANet通过引入面向数量的文本提示和双流解码器增强模型的数量感知能力,在多个基准测试中展现出优异的零样本计数泛化性能。
English: QUANet introduces quantity-oriented text prompts and a dual-stream decoder with cross-stream enhancement to improve object counting accuracy, demonstrating strong zero-shot performance across multiple benchmarks.

Authors:Boyuan Tian, Qizhe Gao, Siran Xianyu, Xiaotong Cui, Minjia Zhang
Title: FlexGaussian: Flexible and Cost-Effective Training-Free Compression for 3D Gaussian Splatting
Abstract:
3D Gaussian splatting has become a prominent technique for representing and rendering complex 3D scenes, due to its high fidelity and speed advantages. However, the growing demand for large-scale models calls for effective compression to reduce memory and computation costs, especially on mobile and edge devices with limited resources. Existing compression methods effectively reduce 3D Gaussian parameters but often require extensive retraining or fine-tuning, lacking flexibility under varying compression constraints. In this paper, we introduce FlexGaussian, a flexible and cost-effective method that combines mixed-precision quantization with attribute-discriminative pruning for training-free 3D Gaussian compression. FlexGaussian eliminates the need for retraining and adapts easily to diverse compression targets. Evaluation results show that FlexGaussian achieves up to 96.4% compression while maintaining high rendering quality (<1 dB drop in PSNR), and is deployable on mobile devices. FlexGaussian delivers high compression ratios within seconds, being 1.7-2.1x faster than state-of-the-art training-free methods and 10-100x faster than training-involved approaches. The code is being prepared and will be released soon at: https://github.com/Supercomputing-System-AI-Lab/FlexGaussian
中文: FlexGaussian提出了一种无需训练的3D高斯溅射压缩方法,通过混合精度量化和属性判别剪枝实现高达96.4%的压缩率,在保持渲染质量的同时可快速部署于移动设备。
English: FlexGaussian introduces a training-free compression method for 3D Gaussian splatting, combining mixed-precision quantization and pruning to achieve up to 96.4% compression with minimal quality loss and rapid deployment on mobile devices.

Authors:Yifan Yang, Peili Song, Enfan Lan, Dong Liu, Jingtai Liu
Title: MK-Pose: Category-Level Object Pose Estimation via Multimodal-Based Keypoint Learning
Abstract:
Category-level object pose estimation, which predicts the pose of objects within a known category without prior knowledge of individual instances, is essential in applications like warehouse automation and manufacturing. Existing methods relying on RGB images or point cloud data often struggle with object occlusion and generalization across different instances and categories. This paper proposes a multimodal-based keypoint learning framework (MK-Pose) that integrates RGB images, point clouds, and category-level textual descriptions. The model uses a self-supervised keypoint detection module enhanced with attention-based query generation, soft heatmap matching and graph-based relational modeling. Additionally, a graph-enhanced feature fusion module is designed to integrate local geometric information and global context. MK-Pose is evaluated on CAMERA25 and REAL275 dataset, and is further tested for cross-dataset capability on HouseCat6D dataset. The results demonstrate that MK-Pose outperforms existing state-of-the-art methods in both IoU and average precision without shape priors. Codes will be released at \href{https://github.com/yangyifanYYF/MK-Pose}{https://github.com/yangyifanYYF/MK-Pose}.
中文: 本文提出MK-Pose多模态关键点学习框架,融合RGB图像、点云和类别文本描述,在无需形状先验的情况下,显著提升了类别级物体姿态估计在基准数据集上的性能表现。
English: This paper introduces MK-Pose, a multimodal keypoint learning framework that integrates RGB images, point clouds, and textual descriptions to enhance category-level object pose estimation, achieving superior performance on benchmark datasets without relying on shape priors.

Authors:Hongjie Wu, Mingqin Zhang, Linchao He, Ji-Zhe Zhou, Jiancheng Lv
Title: Enhancing Diffusion Model Stability for Image Restoration via Gradient Management
Abstract:
Diffusion models have shown remarkable promise for image restoration by leveraging powerful priors. Prominent methods typically frame the restoration problem within a Bayesian inference framework, which iteratively combines a denoising step with a likelihood guidance step. However, the interactions between these two components in the generation process remain underexplored. In this paper, we analyze the underlying gradient dynamics of these components and identify significant instabilities. Specifically, we demonstrate conflicts between the prior and likelihood gradient directions, alongside temporal fluctuations in the likelihood gradient itself. We show that these instabilities disrupt the generative process and compromise restoration performance. To address these issues, we propose Stabilized Progressive Gradient Diffusion (SPGD), a novel gradient management technique. SPGD integrates two synergistic components: (1) a progressive likelihood warm-up strategy to mitigate gradient conflicts; and (2) adaptive directional momentum (ADM) smoothing to reduce fluctuations in the likelihood gradient. Extensive experiments across diverse restoration tasks demonstrate that SPGD significantly enhances generation stability, leading to state-of-the-art performance in quantitative metrics and visually superior results. Code is available at https://github.com/74587887/SPGD.
中文摘要:本文提出SPGD这一新型梯度管理技术,通过解决梯度冲突和波动问题来稳定图像修复中的扩散模型,在多项任务中实现了最先进的性能。
English Summary: This paper introduces SPGD, a novel gradient management technique that stabilizes diffusion models for image restoration by addressing gradient conflicts and fluctuations, achieving state-of-the-art performance across various tasks.

Authors:Naoya Sogi, Takashi Shibata, Makoto Terao, Masanori Suganuma, Takayuki Okatani
Title: MS-DPPs: Multi-Source Determinantal Point Processes for Contextual Diversity Refinement of Composite Attributes in Text to Image Retrieval
Abstract:
Result diversification (RD) is a crucial technique in Text-to-Image Retrieval for enhancing the efficiency of a practical application. Conventional methods focus solely on increasing the diversity metric of image appearances. However, the diversity metric and its desired value vary depending on the application, which limits the applications of RD. This paper proposes a novel task called CDR-CA (Contextual Diversity Refinement of Composite Attributes). CDR-CA aims to refine the diversities of multiple attributes, according to the application's context. To address this task, we propose Multi-Source DPPs, a simple yet strong baseline that extends the Determinantal Point Process (DPP) to multi-sources. We model MS-DPP as a single DPP model with a unified similarity matrix based on a manifold representation. We also introduce Tangent Normalization to reflect contexts. Extensive experiments demonstrate the effectiveness of the proposed method. Our code is publicly available at https://github.com/NEC-N-SOGI/msdpp.
中文: 本文提出CDR-CA新任务,通过多源DPPs和切线归一化方法,实现了基于应用场景的复合属性多样性优化,实验证明该方法在文本-图像检索中具有显著效果。
English: This paper introduces CDR-CA, a novel task for refining attribute diversity in text-to-image retrieval based on application context, and proposes Multi-Source DPPs with Tangent Normalization as an effective solution, validated through extensive experiments.

Authors:Chengkun Li, Yuqi Tong, Kai Chen, Zhenya Yang, Ruiyang Li, Shi Qiu, Jason Ying-Kuen Chan, Pheng-Ann Heng, Qi Dou
Title: ClipGS: Clippable Gaussian Splatting for Interactive Cinematic Visualization of Volumetric Medical Data
Abstract:
The visualization of volumetric medical data is crucial for enhancing diagnostic accuracy and improving surgical planning and education. Cinematic rendering techniques significantly enrich this process by providing high-quality visualizations that convey intricate anatomical details, thereby facilitating better understanding and decision-making in medical contexts. However, the high computing cost and low rendering speed limit the requirement of interactive visualization in practical applications. In this paper, we introduce ClipGS, an innovative Gaussian splatting framework with the clipping plane supported, for interactive cinematic visualization of volumetric medical data. To address the challenges posed by dynamic interactions, we propose a learnable truncation scheme that automatically adjusts the visibility of Gaussian primitives in response to the clipping plane. Besides, we also design an adaptive adjustment model to dynamically adjust the deformation of Gaussians and refine the rendering performance. We validate our method on five volumetric medical data (including CT and anatomical slice data), and reach an average 36.635 PSNR rendering quality with 156 FPS and 16.1 MB model size, outperforming state-of-the-art methods in rendering quality and efficiency.

Authors:Qing Zhang, Guoquan Pei, Yan Wang
Title: Omni-Fusion of Spatial and Spectral for Hyperspectral Image Segmentation
Abstract:
Medical Hyperspectral Imaging (MHSI) has emerged as a promising tool for enhanced disease diagnosis, particularly in computational pathology, offering rich spectral information that aids in identifying subtle biochemical properties of tissues. Despite these advantages, effectively fusing both spatial-dimensional and spectral-dimensional information from MHSIs remains challenging due to its high dimensionality and spectral redundancy inherent characteristics. To solve the above challenges, we propose a novel spatial-spectral omni-fusion network for hyperspectral image segmentation, named as Omni-Fuse. Here, we introduce abundant cross-dimensional feature fusion operations, including a cross-dimensional enhancement module that refines both spatial and spectral features through bidirectional attention mechanisms, a spectral-guided spatial query selection to select the most spectral-related spatial feature as the query, and a two-stage cross-dimensional decoder which dynamically guide the model to focus on the selected spatial query. Despite of numerous attention blocks, Omni-Fuse remains efficient in execution. Experiments on two microscopic hyperspectral image datasets show that our approach can significantly improve the segmentation performance compared with the state-of-the-art methods, with over 5.73 percent improvement in DSC. Code available at: https://github.com/DeepMed-Lab-ECNU/Omni-Fuse.
Chinese Summary: Omni-Fuse网络通过创新的跨维度融合技术,有效整合医学高光谱成像中的空间和光谱信息,在保持计算效率的同时显著提升分割精度,DSC指标提高超过5.73%。
English Summary: The Omni-Fuse network introduces innovative cross-dimensional fusion techniques to effectively integrate spatial and spectral information in medical hyperspectral imaging, significantly enhancing segmentation accuracy with over 5.73% DSC improvement while maintaining computational efficiency.

Authors:Xu Shaowu, Jia Xibin, Gao Junyu, Sun Qianmei, Chang Jing, Fan Chao
Title: Cross-Modal Dual-Causal Learning for Long-Term Action Recognition
Abstract:
Long-term action recognition (LTAR) is challenging due to extended temporal spans with complex atomic action correlations and visual confounders. Although vision-language models (VLMs) have shown promise, they often rely on statistical correlations instead of causal mechanisms. Moreover, existing causality-based methods address modal-specific biases but lack cross-modal causal modeling, limiting their utility in VLM-based LTAR. This paper proposes \textbf{C}ross-\textbf{M}odal \textbf{D}ual-\textbf{C}ausal \textbf{L}earning (CMDCL), which introduces a structural causal model to uncover causal relationships between videos and label texts. CMDCL addresses cross-modal biases in text embeddings via textual causal intervention and removes confounders inherent in the visual modality through visual causal intervention guided by the debiased text. These dual-causal interventions enable robust action representations to address LTAR challenges. Experimental results on three benchmarks including Charades, Breakfast and COIN, demonstrate the effectiveness of the proposed model. Our code is available at https://github.com/xushaowu/CMDCL.
中文:本文提出跨模态双因果学习(CMDCL)方法,通过双因果干预解决长期动作识别中的跨模态偏差和视觉混杂因素,在多个基准测试中验证了其有效性。
English: This paper introduces Cross-Modal Dual-Causal Learning (CMDCL), a method that uses dual-causal interventions to address cross-modal biases and visual confounders in long-term action recognition, demonstrating effectiveness across multiple benchmarks.

Authors:Qianyu Zhang, Bolun Zheng, Lingyu Zhu, Hangjia Pan, Zunjie Zhu, Zongpeng Li, Shiqi Wang
Title: Capturing Stable HDR Videos Using a Dual-Camera System
Abstract:
High Dynamic Range (HDR) video acquisition using the alternating exposure (AE) paradigm has garnered significant attention due to its cost-effectiveness with a single consumer camera. However, despite progress driven by deep neural networks, these methods remain prone to temporal flicker in real-world applications due to inter-frame exposure inconsistencies. To address this challenge while maintaining the cost-effectiveness of the AE paradigm, we propose a novel learning-based HDR video generation solution. Specifically, we propose a dual-stream HDR video generation paradigm that decouples temporal luminance anchoring from exposure-variant detail reconstruction, overcoming the inherent limitations of the AE paradigm. To support this, we design an asynchronous dual-camera system (DCS), which enables independent exposure control across two cameras, eliminating the need for synchronization typically required in traditional multi-camera setups. Furthermore, an exposure-adaptive fusion network (EAFNet) is formulated for the DCS system. EAFNet integrates a pre-alignment subnetwork that aligns features across varying exposures, ensuring robust feature extraction for subsequent fusion, an asymmetric cross-feature fusion subnetwork that emphasizes reference-based attention to effectively merge these features across exposures, and a reconstruction subnetwork to mitigate ghosting artifacts and preserve fine details. Extensive experimental evaluations demonstrate that the proposed method achieves state-of-the-art performance across various datasets, showing the remarkable potential of our solution in HDR video reconstruction. The codes and data captured by DCS will be available at https://zqqqyu.github.io/DCS-HDR/.

Authors:Yang Chen, Yueqi Duan, Haowen Sun, Jiwen Lu, Yap-Peng Tan
Title: Ambiguity-aware Point Cloud Segmentation by Adaptive Margin Contrastive Learning
Abstract:
This paper proposes an adaptive margin contrastive learning method for 3D semantic segmentation on point clouds. Most existing methods use equally penalized objectives, which ignore the per-point ambiguities and less discriminated features stemming from transition regions. However, as highly ambiguous points may be indistinguishable even for humans, their manually annotated labels are less reliable, and hard constraints over these points would lead to sub-optimal models. To address this, we first design AMContrast3D, a method comprising contrastive learning into an ambiguity estimation framework, tailored to adaptive objectives for individual points based on ambiguity levels. As a result, our method promotes model training, which ensures the correctness of low-ambiguity points while allowing mistakes for high-ambiguity points. As ambiguities are formulated based on position discrepancies across labels, optimization during inference is constrained by the assumption that all unlabeled points are uniformly unambiguous, lacking ambiguity awareness. Inspired by the insight of joint training, we further propose AMContrast3D++ integrating with two branches trained in parallel, where a novel ambiguity prediction module concurrently learns point ambiguities from generated embeddings. To this end, we design a masked refinement mechanism that leverages predicted ambiguities to enable the ambiguous embeddings to be more reliable, thereby boosting segmentation performance and enhancing robustness. Experimental results on 3D indoor scene datasets, S3DIS and ScanNet, demonstrate the effectiveness of the proposed method. Code is available at https://github.com/YangChenApril/AMContrast3D.
中文: 本文提出AMContrast3D自适应边界对比学习方法,通过基于点歧义程度调整惩罚来改进3D点云分割,其增强版AMContrast3D++通过并行训练和掩码细化机制进一步提升了分割性能。
English: This paper introduces AMContrast3D, an adaptive margin contrastive learning method that enhances 3D point cloud segmentation by adjusting penalties based on point ambiguity levels, and its improved version AMContrast3D++ further refines performance through parallel training and a masked refinement mechanism.

Authors:Qibiao Wu, Yagang Wang, Qian Zhang
Title: Airway Segmentation Network for Enhanced Tubular Feature Extraction
Abstract:
Manual annotation of airway regions in computed tomography images is a time-consuming and expertise-dependent task. Automatic airway segmentation is therefore a prerequisite for enabling rapid bronchoscopic navigation and the clinical deployment of bronchoscopic robotic systems. Although convolutional neural network methods have gained considerable attention in airway segmentation, the unique tree-like structure of airways poses challenges for conventional and deformable convolutions, which often fail to focus on fine airway structures, leading to missed segments and discontinuities. To address this issue, this study proposes a novel tubular feature extraction network, named TfeNet. TfeNet introduces a novel direction-aware convolution operation that first applies spatial rotation transformations to adjust the sampling positions of linear convolution kernels. The deformed kernels are then represented as line segments or polylines in 3D space. Furthermore, a tubular feature fusion module (TFFM) is designed based on asymmetric convolution and residual connection strategies, enhancing the network's focus on subtle airway structures. Extensive experiments conducted on one public dataset and two datasets used in airway segmentation challenges demonstrate that the proposed TfeNet achieves more accuracy and continuous airway structure predictions compared with existing methods. In particular, TfeNet achieves the highest overall score of 94.95% on the current largest airway segmentation dataset, Airway Tree Modeling(ATM22), and demonstrates advanced performance on the lung fibrosis dataset(AIIB23). The code is available at https://github.com/QibiaoWu/TfeNet.
中文摘要:本研究提出TfeNet这一新型管状特征提取网络,通过方向感知卷积和管状特征融合模块增强CT图像中气道分割的精度与连续性,在多个数据集上验证其优于现有方法的性能。
English Summary: This study introduces TfeNet, a novel tubular feature extraction network that enhances airway segmentation in CT images through direction-aware convolution and a tubular feature fusion module, achieving superior accuracy and continuity compared to existing methods.

Authors:Taekyung Kim, Dongyoon Han, Byeongho Heo, Jeongeun Park, Sangdoo Yun
Title: Token Bottleneck: One Token to Remember Dynamics
Abstract:
Deriving compact and temporally aware visual representations from dynamic scenes is essential for successful execution of sequential scene understanding tasks such as visual tracking and robotic manipulation. In this paper, we introduce Token Bottleneck (ToBo), a simple yet intuitive self-supervised learning pipeline that squeezes a scene into a bottleneck token and predicts the subsequent scene using minimal patches as hints. The ToBo pipeline facilitates the learning of sequential scene representations by conservatively encoding the reference scene into a compact bottleneck token during the squeeze step. In the expansion step, we guide the model to capture temporal dynamics by predicting the target scene using the bottleneck token along with few target patches as hints. This design encourages the vision backbone to embed temporal dependencies, thereby enabling understanding of dynamic transitions across scenes. Extensive experiments in diverse sequential tasks, including video label propagation and robot manipulation in simulated environments demonstrate the superiority of ToBo over baselines. Moreover, deploying our pre-trained model on physical robots confirms its robustness and effectiveness in real-world environments. We further validate the scalability of ToBo across different model scales.
中文: 本文提出Token Bottleneck (ToBo)方法,通过将场景压缩为瓶颈标记并用最少补丁预测后续场景,实现了跨多种任务和现实环境的有效序列场景理解。
English: This paper introduces Token Bottleneck (ToBo), a self-supervised learning method that compresses scenes into bottleneck tokens and predicts subsequent scenes using minimal patches, enabling effective sequential scene understanding across various tasks and real-world applications.

Authors:Mingjin Zeng, Nan Ouyang, Wenkang Wan, Lei Ao, Qing Cai, Kai Sheng
Title: ILNet: Trajectory Prediction with Inverse Learning Attention for Enhancing Intention Capture
Abstract:
Trajectory prediction for multi-agent interaction scenarios is a crucial challenge. Most advanced methods model agent interactions by efficiently factorized attention based on the temporal and agent axes. However, this static and foward modeling lacks explicit interactive spatio-temporal coordination, capturing only obvious and immediate behavioral intentions. Alternatively, the modern trajectory prediction framework refines the successive predictions by a fixed-anchor selection strategy, which is difficult to adapt in different future environments. It is acknowledged that human drivers dynamically adjust initial driving decisions based on further assumptions about the intentions of surrounding vehicles. Motivated by human driving behaviors, this paper proposes ILNet, a multi-agent trajectory prediction method with Inverse Learning (IL) attention and Dynamic Anchor Selection (DAS) module. IL Attention employs an inverse learning paradigm to model interactions at neighboring moments, introducing proposed intentions to dynamically encode the spatio-temporal coordination of interactions, thereby enhancing the model's ability to capture complex interaction patterns. Then, the learnable DAS module is proposed to extract multiple trajectory change keypoints as anchors in parallel with almost no increase in parameters. Experimental results show that the ILNet achieves state-of-the-art performance on the INTERACTION and Argoverse motion forecasting datasets. Particularly, in challenged interaction scenarios, ILNet achieves higher accuracy and more multimodal distributions of trajectories over fewer parameters. Our codes are available at https://github.com/mjZeng11/ILNet.
中文摘要:本文提出ILNet多智能体轨迹预测方法,通过逆向学习注意力机制和动态锚点选择模块,有效建模交互时空协调并适应不同环境,以更少参数实现了最优性能。
English Summary: The paper introduces ILNet, a multi-agent trajectory prediction method that uses Inverse Learning attention and a Dynamic Anchor Selection module to better model spatio-temporal interactions and adapt to different environments, achieving state-of-the-art performance with fewer parameters.

Authors:Weiran Li, Yeqiang Liu, Qiannan Guo, Yijie Wei, Hwa Liang Leo, Zhenbo Li
Title: When Trackers Date Fish: A Benchmark and Framework for Underwater Multiple Fish Tracking
Abstract:
Multiple object tracking (MOT) technology has made significant progress in terrestrial applications, but underwater tracking scenarios remain underexplored despite their importance to marine ecology and aquaculture. In this paper, we present Multiple Fish Tracking Dataset 2025 (MFT25), a comprehensive dataset specifically designed for underwater multiple fish tracking, featuring 15 diverse video sequences with 408,578 meticulously annotated bounding boxes across 48,066 frames. Our dataset captures various underwater environments, fish species, and challenging conditions including occlusions, similar appearances, and erratic motion patterns. Additionally, we introduce Scale-aware and Unscented Tracker (SU-T), a specialized tracking framework featuring an Unscented Kalman Filter (UKF) optimized for non-linear swimming patterns of fish and a novel Fish-Intersection-over-Union (FishIoU) matching that accounts for the unique morphological characteristics of aquatic species. Extensive experiments demonstrate that our SU-T baseline achieves state-of-the-art performance on MFT25, with 34.1 HOTA and 44.6 IDF1, while revealing fundamental differences between fish tracking and terrestrial object tracking scenarios. The dataset and codes are released at https://vranlee.github.io/SU-T/.

Authors:Emerson P. Grabke, Babak Taati, Masoom A. Haider
Title: Mitigating Multi-Sequence 3D Prostate MRI Data Scarcity through Domain Adaptation using Locally-Trained Latent Diffusion Models for Prostate Cancer Detection
Abstract:
Objective: Latent diffusion models (LDMs) could mitigate data scarcity challenges affecting machine learning development for medical image interpretation. The recent CCELLA LDM improved prostate cancer detection performance using synthetic MRI for classifier training but was limited to the axial T2-weighted (AxT2) sequence, did not investigate inter-institutional domain shift, and prioritized radiology over histopathology outcomes. We propose CCELLA++ to address these limitations and improve clinical utility. Methods: CCELLA++ expands CCELLA for simultaneous biparametric prostate MRI (bpMRI) generation, including the AxT2, high b-value diffusion series (HighB) and apparent diffusion coefficient map (ADC). Domain adaptation was investigated by pretraining classifiers on real or LDM-generated synthetic data from an internal institution, followed with fine-tuning on progressively smaller fractions of an out-of-distribution, external dataset. Results: CCELLA++ improved 3D FID for HighB and ADC but not AxT2 (0.013, 0.012, 0.063 respectively) sequences compared to CCELLA (0.060). Classifier pretraining with CCELLA++ bpMRI outperformed real bpMRI in AP and AUC for all domain adaptation scenarios. CCELLA++ pretraining achieved highest classifier performance below 50% (n=665) external dataset volume. Conclusion: Synthetic bpMRI generated by our method can improve downstream classifier generalization and performance beyond real bpMRI or CCELLA-generated AxT2-only images. Future work should seek to quantify medical image sample quality, balance multi-sequence LDM training, and condition the LDM with additional information. Significance: The proposed CCELLA++ LDM can generate synthetic bpMRI that outperforms real data for domain adaptation with a limited target institution dataset. Our code is available at https://github.com/grabkeem/CCELLA-plus-plus
中文:CCELLA++通过生成合成双参数MRI序列提升前列腺癌检测能力,在有限外部数据集微调场景下,其分类器性能与领域适应效果均优于真实数据。
English: CCELLA++ enhances prostate cancer detection by generating synthetic biparametric MRI sequences that improve classifier performance and domain adaptation, outperforming real data when fine-tuning with limited external datasets.

Authors:Zhang Li, Biao Yang, Qiang Liu, Shuo Zhang, Zhiyin Ma, Liang Yin, Linger Deng, Yabo Sun, Yuliang Liu, Xiang Bai
Title: LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance
Abstract:
While large multi-modal models (LMMs) demonstrate promising capabilities in segmentation and comprehension, they still struggle with two limitations: inaccurate segmentation and hallucinated comprehension. These challenges stem primarily from constraints in weak visual comprehension and a lack of fine-grained perception. To alleviate these limitations, we propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation via two key components: (1) Semantic-Enhanced Feature Extractor (SEFE) improves object attribute inference by fusing semantic and pixel-level features, leading to more accurate segmentation; (2) Interleaved Local Visual Coupling (ILVC) autoregressively generates local descriptions after extracting local features based on segmentation masks, offering fine-grained supervision to mitigate hallucinations. Furthermore, we find that the precision of object segmentation is positively correlated with the latent related semantics of the token. To quantify this relationship and the model's potential semantic inferring ability, we introduce the Attributes Evaluation (AttrEval) dataset. Our experiments show that LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks. Code will be available at https://github.com/echo840/LIRA.
Chinese: LIRA框架通过融合语义增强特征提取和交错局部视觉耦合,解决了大型多模态模型在分割不准确和理解幻觉方面的局限,在两项任务中均实现了最先进的性能。
English: The LIRA framework addresses the limitations of inaccurate segmentation and hallucinated comprehension in large multi-modal models by integrating semantic-enhanced feature extraction and interleaved local visual coupling, achieving state-of-the-art performance in both tasks.

Authors:Ali Nasiri-Sarvi, Hassan Rivaz, Mahdi S. Hosseini
Title: SPARC: Concept-Aligned Sparse Autoencoders for Cross-Model and Cross-Modal Interpretability
Abstract:
Understanding how different AI models encode the same high-level concepts, such as objects or attributes, remains challenging because each model typically produces its own isolated representation. Existing interpretability methods like Sparse Autoencoders (SAEs) produce latent concepts individually for each model, resulting in incompatible concept spaces and limiting cross-model interpretability. To address this, we introduce SPARC (Sparse Autoencoders for Aligned Representation of Concepts), a new framework that learns a single, unified latent space shared across diverse architectures and modalities (e.g., vision models like DINO, and multimodal models like CLIP). SPARC's alignment is enforced through two key innovations: (1) a Global TopK sparsity mechanism, ensuring all input streams activate identical latent dimensions for a given concept; and (2) a Cross-Reconstruction Loss, which explicitly encourages semantic consistency between models. On Open Images, SPARC dramatically improves concept alignment, achieving a Jaccard similarity of 0.80, more than tripling the alignment compared to previous methods. SPARC creates a shared sparse latent space where individual dimensions often correspond to similar high-level concepts across models and modalities, enabling direct comparison of how different architectures represent identical concepts without requiring manual alignment or model-specific analysis. As a consequence of this aligned representation, SPARC also enables practical applications such as text-guided spatial localization in vision-only models and cross-model/cross-modal retrieval. Code and models are available at https://github.com/AtlasAnalyticsLab/SPARC.
中文: SPARC框架通过全局稀疏性和跨模型重构损失,为不同架构的AI模型创建了统一的稀疏潜在空间,显著提升了概念对齐度,实现了跨模型的直接概念比较和实际应用。
English: SPARC introduces a unified framework that creates a shared sparse latent space across diverse AI models, enabling direct comparison of concept representations and significantly improving alignment through global sparsity and cross-reconstruction mechanisms.

Authors:Jiangzhong Cao, Zekai Zeng, Xu Zhang, Huan Zhang, Chunling Fan, Gangyi Jiang, Weisi Lin
Title: Unveiling the Underwater World: CLIP Perception Model-Guided Underwater Image Enhancement
Abstract:
High-quality underwater images are essential for both machine vision tasks and viewers with their aesthetic appeal.However, the quality of underwater images is severely affected by light absorption and scattering. Deep learning-based methods for Underwater Image Enhancement (UIE) have achieved good performance. However, these methods often overlook considering human perception and lack sufficient constraints within the solution space. Consequently, the enhanced images often suffer from diminished perceptual quality or poor content restoration.To address these issues, we propose a UIE method with a Contrastive Language-Image Pre-Training (CLIP) perception loss module and curriculum contrastive regularization. Above all, to develop a perception model for underwater images that more aligns with human visual perception, the visual semantic feature extraction capability of the CLIP model is leveraged to learn an appropriate prompt pair to map and evaluate the quality of underwater images. This CLIP perception model is then incorporated as a perception loss module into the enhancement network to improve the perceptual quality of enhanced images. Furthermore, the CLIP perception model is integrated with the curriculum contrastive regularization to enhance the constraints imposed on the enhanced images within the CLIP perceptual space, mitigating the risk of both under-enhancement and over-enhancement. Specifically, the CLIP perception model is employed to assess and categorize the learning difficulty level of negatives in the regularization process, ensuring comprehensive and nuanced utilization of distorted images and negatives with varied quality levels. Extensive experiments demonstrate that our method outperforms state-of-the-art methods in terms of visual quality and generalization ability.
Chinese: 本研究提出了一种结合CLIP感知损失模块和课程对比正则化的水下图像增强方法,通过更贴合人类视觉感知的方式提升图像质量,有效避免增强不足或过度的问题。
English: This study introduces an underwater image enhancement method that integrates a CLIP-based perception loss module and curriculum contrastive regularization to better align with human visual perception and improve image quality by addressing under- and over-enhancement issues.

Authors:Inès Hyeonsu Kim, Seokju Cho, Jahyeok Koo, Junghyun Park, Jiahui Huang, Joon-Young Lee, Seungryong Kim
Title: Learning to Track Any Points from Human Motion
Abstract:
Human motion, with its inherent complexities, such as non-rigid deformations, articulated movements, clothing distortions, and frequent occlusions caused by limbs or other individuals, provides a rich and challenging source of supervision that is crucial for training robust and generalizable point trackers. Despite the suitability of human motion, acquiring extensive training data for point tracking remains difficult due to laborious manual annotation. Our proposed pipeline, AnthroTAP, addresses this by proposing an automated pipeline to generate pseudo-labeled training data, leveraging the Skinned Multi-Person Linear (SMPL) model. We first fit the SMPL model to detected humans in video frames, project the resulting 3D mesh vertices onto 2D image planes to generate pseudo-trajectories, handle occlusions using ray-casting, and filter out unreliable tracks based on optical flow consistency. A point tracking model trained on AnthroTAP annotated dataset achieves state-of-the-art performance on the TAP-Vid benchmark, surpassing other models trained on real videos while using 10,000 times less data and only 1 day in 4 GPUs, compared to 256 GPUs used in recent state-of-the-art.

Authors:Keyan Chen, Chenyang Liu, Bowen Chen, Jiafan Zhang, Zhengxia Zou, Zhenwei Shi
Title: RSRefSeg 2: Decoupling Referring Remote Sensing Image Segmentation with Foundation Models
Abstract:
Referring Remote Sensing Image Segmentation provides a flexible and fine-grained framework for remote sensing scene analysis via vision-language collaborative interpretation. Current approaches predominantly utilize a three-stage pipeline encompassing dual-modal encoding, cross-modal interaction, and pixel decoding. These methods demonstrate significant limitations in managing complex semantic relationships and achieving precise cross-modal alignment, largely due to their coupled processing mechanism that conflates target localization with boundary delineation. This architectural coupling amplifies error propagation under semantic ambiguity while restricting model generalizability and interpretability. To address these issues, we propose RSRefSeg 2, a decoupling paradigm that reformulates the conventional workflow into a collaborative dual-stage framework: coarse localization followed by fine segmentation. RSRefSeg 2 integrates CLIP's cross-modal alignment strength with SAM's segmentation generalizability through strategic foundation model collaboration. Specifically, CLIP is employed as the dual-modal encoder to activate target features within its pre-aligned semantic space and generate localization prompts. To mitigate CLIP's misactivation challenges in multi-entity scenarios described by referring texts, a cascaded second-order prompter is devised, which enhances precision through implicit reasoning via decomposition of text embeddings into complementary semantic subspaces. These optimized semantic prompts subsequently direct the SAM to generate pixel-level refined masks, thereby completing the semantic transmission pipeline. Extensive experiments (RefSegRS, RRSIS-D, and RISBench) demonstrate that RSRefSeg 2 surpasses contemporary methods in segmentation accuracy (+~3% gIoU) and complex semantic interpretation. Code is available at: https://github.com/KyanChen/RSRefSeg2.
中文摘要:RSRefSeg 2通过解耦的双阶段框架,将CLIP的跨模态对齐能力与SAM的分割泛化性相结合,显著提升了遥感图像分割精度和复杂语义理解能力。
English Summary: RSRefSeg 2 introduces a decoupled dual-stage framework that combines CLIP's cross-modal alignment with SAM's segmentation capabilities to significantly improve remote sensing image segmentation accuracy and semantic interpretation.

Authors:Aleksandar Jevtić, Christoph Reich, Felix Wimbauer, Oliver Hahn, Christian Rupprecht, Stefan Roth, Daniel Cremers
Title: Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion
Abstract:
Semantic scene completion (SSC) aims to infer both the 3D geometry and semantics of a scene from single images. In contrast to prior work on SSC that heavily relies on expensive ground-truth annotations, we approach SSC in an unsupervised setting. Our novel method, SceneDINO, adapts techniques from self-supervised representation learning and 2D unsupervised scene understanding to SSC. Our training exclusively utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner. Through a novel 3D feature distillation approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy. Linear probing our 3D features matches the segmentation accuracy of a current supervised SSC approach. Additionally, we showcase the domain generalization and multi-view consistency of SceneDINO, taking the first steps towards a strong foundation for single image 3D scene understanding.
中文: SceneDINO是一种新颖的无监督语义场景补全方法,利用自监督学习和多视角一致性从单张图像推断三维几何和语义,无需真实标注,实现了最先进的分割精度。
English: SceneDINO is a novel unsupervised method for semantic scene completion that leverages self-supervised learning and multi-view consistency to infer 3D geometry and semantics from single images without ground-truth annotations, achieving state-of-the-art segmentation accuracy.

Authors:Haoyu Wang, Lei Zhang, Wei Wei, Chen Ding, Yanning Zhang
Title: Prompt-Free Conditional Diffusion for Multi-object Image Augmentation
Abstract:
Diffusion models has underpinned much recent advances of dataset augmentation in various computer vision tasks. However, when involving generating multi-object images as real scenarios, most existing methods either rely entirely on text condition, resulting in a deviation between the generated objects and the original data, or rely too much on the original images, resulting in a lack of diversity in the generated images, which is of limited help to downstream tasks. To mitigate both problems with one stone, we propose a prompt-free conditional diffusion framework for multi-object image augmentation. Specifically, we introduce a local-global semantic fusion strategy to extract semantics from images to replace text, and inject knowledge into the diffusion model through LoRA to alleviate the category deviation between the original model and the target dataset. In addition, we design a reward model based counting loss to assist the traditional reconstruction loss for model training. By constraining the object counts of each category instead of pixel-by-pixel constraints, bridging the quantity deviation between the generated data and the original data while improving the diversity of the generated data. Experimental results demonstrate the superiority of the proposed method over several representative state-of-the-art baselines and showcase strong downstream task gain and out-of-domain generalization capabilities. Code is available at \href{https://github.com/00why00/PFCD}{here}.
中文: 本文提出了一种无需提示的条件扩散框架,通过局部-全局语义融合策略替代文本提示,并结合计数损失来增强多目标图像生成的多样性和数据对齐性,在下游任务中展现出卓越性能。
English: This paper introduces a prompt-free conditional diffusion framework that enhances multi-object image augmentation by replacing text prompts with a local-global semantic fusion strategy and incorporating a counting loss to improve diversity and alignment with original data, demonstrating superior performance in downstream tasks.

Authors:Zhihao Chen, Tao Chen, Chenhui Wang, Qi Gao, Huidong Xie, Chuang Niu, Ge Wang, Hongming Shan
Title: LangMamba: A Language-driven Mamba Framework for Low-dose CT Denoising with Vision-language Models
Abstract:
Low-dose computed tomography (LDCT) reduces radiation exposure but often degrades image quality, potentially compromising diagnostic accuracy. Existing deep learning-based denoising methods focus primarily on pixel-level mappings, overlooking the potential benefits of high-level semantic guidance. Recent advances in vision-language models (VLMs) suggest that language can serve as a powerful tool for capturing structured semantic information, offering new opportunities to improve LDCT reconstruction. In this paper, we introduce LangMamba, a Language-driven Mamba framework for LDCT denoising that leverages VLM-derived representations to enhance supervision from normal-dose CT (NDCT). LangMamba follows a two-stage learning strategy. First, we pre-train a Language-guided AutoEncoder (LangAE) that leverages frozen VLMs to map NDCT images into a semantic space enriched with anatomical information. Second, we synergize LangAE with two key components to guide LDCT denoising: Semantic-Enhanced Efficient Denoiser (SEED), which enhances NDCT-relevant local semantic while capturing global features with efficient Mamba mechanism, and Language-engaged Dual-space Alignment (LangDA) Loss, which ensures that denoised images align with NDCT in both perceptual and semantic spaces. Extensive experiments on two public datasets demonstrate that LangMamba outperforms conventional state-of-the-art methods, significantly improving detail preservation and visual fidelity. Remarkably, LangAE exhibits strong generalizability to unseen datasets, thereby reducing training costs. Furthermore, LangDA loss improves explainability by integrating language-guided insights into image reconstruction and offers a plug-and-play fashion. Our findings shed new light on the potential of language as a supervisory signal to advance LDCT denoising. The code is publicly available on https://github.com/hao1635/LangMamba.
中文: LangMamba提出了一种新颖的语言驱动框架,利用视觉语言模型增强语义指导,在低剂量CT去噪中显著提升细节保留和泛化能力,同时降低训练成本。
English: LangMamba introduces a novel language-driven framework for LDCT denoising that leverages vision-language models to enhance semantic guidance, outperforming existing methods in detail preservation and generalizability while reducing training costs.

Authors:Zhiyu Tan, Hao Yang, Luozheng Qin, Jia Gong, Mengping Yang, Hao Li
Title: Omni-Video: Democratizing Unified Video Understanding and Generation
Abstract:
Notable breakthroughs in unified understanding and generation modeling have led to remarkable advancements in image understanding, reasoning, production and editing, yet current foundational models predominantly focus on processing images, creating a gap in the development of unified models for video understanding and generation. This report presents Omni-Video, an efficient and effective unified framework for video understanding, generation, as well as instruction-based editing. Our key insight is to teach existing multimodal large language models (MLLMs) to produce continuous visual clues that are used as the input of diffusion decoders, which produce high-quality videos conditioned on these visual clues. To fully unlock the potential of our system for unified video modeling, we integrate several technical improvements: 1) a lightweight architectural design that respectively attaches a vision head on the top of MLLMs and a adapter before the input of diffusion decoders, the former produce visual tokens for the latter, which adapts these visual tokens to the conditional space of diffusion decoders; and 2) an efficient multi-stage training scheme that facilitates a fast connection between MLLMs and diffusion decoders with limited data and computational resources. We empirically demonstrate that our model exhibits satisfactory generalization abilities across video generation, editing and understanding tasks.
The abstract introduces Omni-Video, a unified framework that bridges the gap in video understanding and generation by enabling multimodal large language models to produce continuous visual clues for diffusion decoders, achieving high-quality video outputs with efficient architectural and training innovations.
English Summary:

Authors:Jiayi Song, Zihan Ye, Qingyuan Zhou, Weidong Yang, Ben Fei, Jingyi Xu, Ying He, Wanli Ouyang
Title: Reflections Unlock: Geometry-Aware Reflection Disentanglement in 3D Gaussian Splatting for Photorealistic Scenes Rendering
Abstract:
Accurately rendering scenes with reflective surfaces remains a significant challenge in novel view synthesis, as existing methods like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) often misinterpret reflections as physical geometry, resulting in degraded reconstructions. Previous methods rely on incomplete and non-generalizable geometric constraints, leading to misalignment between the positions of Gaussian splats and the actual scene geometry. When dealing with real-world scenes containing complex geometry, the accumulation of Gaussians further exacerbates surface artifacts and results in blurred reconstructions. To address these limitations, in this work, we propose Ref-Unlock, a novel geometry-aware reflection modeling framework based on 3D Gaussian Splatting, which explicitly disentangles transmitted and reflected components to better capture complex reflections and enhance geometric consistency in real-world scenes. Our approach employs a dual-branch representation with high-order spherical harmonics to capture high-frequency reflective details, alongside a reflection removal module providing pseudo reflection-free supervision to guide clean decomposition. Additionally, we incorporate pseudo-depth maps and a geometry-aware bilateral smoothness constraint to enhance 3D geometric consistency and stability in decomposition. Extensive experiments demonstrate that Ref-Unlock significantly outperforms classical GS-based reflection methods and achieves competitive results with NeRF-based models, while enabling flexible vision foundation models (VFMs) driven reflection editing. Our method thus offers an efficient and generalizable solution for realistic rendering of reflective scenes. Our code is available at https://ref-unlock.github.io/.
中文摘要:Ref-Unlock提出了一种基于3D高斯泼溅的几何感知反射建模框架,通过显式分离透射和反射分量,有效提升复杂反射场景的重建质量并实现反射编辑功能。
English Summary: Ref-Unlock introduces a geometry-aware reflection modeling framework using 3D Gaussian Splatting that explicitly separates transmitted and reflected components, significantly improving reconstruction quality and enabling reflection editing in complex reflective scenes.

Authors:Murilo Gustineli, Anthony Miyaguchi, Adrian Cheung, Divyansh Khattak
Title: Tile-Based ViT Inference with Visual-Cluster Priors for Zero-Shot Multi-Species Plant Identification
Abstract:
We describe DS@GT's second-place solution to the PlantCLEF 2025 challenge on multi-species plant identification in vegetation quadrat images. Our pipeline combines (i) a fine-tuned Vision Transformer ViTD2PC24All for patch-level inference, (ii) a 4x4 tiling strategy that aligns patch size with the network's 518x518 receptive field, and (iii) domain-prior adaptation through PaCMAP + K-Means visual clustering and geolocation filtering. Tile predictions are aggregated by majority vote and re-weighted with cluster-specific Bayesian priors, yielding a macro-averaged F1 of 0.348 (private leaderboard) while requiring no additional training. All code, configuration files, and reproducibility scripts are publicly available at https://github.com/dsgt-arc/plantclef-2025.
中文: DS@GT在PlantCLEF 2025竞赛中获得第二名的解决方案,通过微调视觉变换器结合分块策略和领域知识优化,无需额外训练即实现0.348的宏平均F1分数,所有代码均已开源。
English: DS@GT's second-place solution for PlantCLEF 2025 combines a fine-tuned Vision Transformer with tiling, clustering, and geolocation filtering, achieving a 0.348 F1 score without extra training, with all code publicly available.

Authors:Tongtong Cheng, Rongzhen Li, Yixin Xiong, Tao Zhang, Jing Wang, Kai Liu
Title: MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding
Abstract:
Accurate driving behavior recognition and reasoning are critical for autonomous driving video understanding. However, existing methods often tend to dig out the shallow causal, fail to address spurious correlations across modalities, and ignore the ego-vehicle level causality modeling. To overcome these limitations, we propose a novel Multimodal Causal Analysis Model (MCAM) that constructs latent causal structures between visual and language modalities. Firstly, we design a multi-level feature extractor to capture long-range dependencies. Secondly, we design a causal analysis module that dynamically models driving scenarios using a directed acyclic graph (DAG) of driving states. Thirdly, we utilize a vision-language transformer to align critical visual features with their corresponding linguistic expressions. Extensive experiments on the BDD-X, and CoVLA datasets demonstrate that MCAM achieves SOTA performance in visual-language causal relationship learning. Furthermore, the model exhibits superior capability in capturing causal characteristics within video sequences, showcasing its effectiveness for autonomous driving applications. The code is available at https://github.com/SixCorePeach/MCAM.
Chinese: 提出的多模态因果分析模型(MCAM)通过构建视觉与语言模态间的潜在因果结构,结合多层次特征提取和动态场景建模,在自动驾驶视频理解中实现了最先进的性能表现。
English: The proposed Multimodal Causal Analysis Model (MCAM) overcomes limitations in existing methods by constructing latent causal structures between visual and language modalities, achieving state-of-the-art performance in autonomous driving video understanding through multi-level feature extraction and dynamic scenario modeling.

Authors:Chang Liu, Ye Pan, Chenyang Ding, Susanto Rahardja, Xiaokang Yang
Title: MEDTalk: Multimodal Controlled 3D Facial Animation with Dynamic Emotions by Disentangled Embedding
Abstract:
Audio-driven emotional 3D facial animation aims to generate synchronized lip movements and vivid facial expressions. However, most existing approaches focus on static and predefined emotion labels, limiting their diversity and naturalness. To address these challenges, we propose MEDTalk, a novel framework for fine-grained and dynamic emotional talking head generation. Our approach first disentangles content and emotion embedding spaces from motion sequences using a carefully designed cross-reconstruction process, enabling independent control over lip movements and facial expressions. Beyond conventional audio-driven lip synchronization, we integrate audio and speech text, predicting frame-wise intensity variations and dynamically adjusting static emotion features to generate realistic emotional expressions. Furthermore, to enhance control and personalization, we incorporate multimodal inputs-including text descriptions and reference expression images-to guide the generation of user-specified facial expressions. With MetaHuman as the priority, our generated results can be conveniently integrated into the industrial production pipeline. The code is available at: https://github.com/SJTU-Lucy/MEDTalk.
中文: MEDTalk提出了一种新颖的动态情感三维面部动画框架,通过解耦运动序列中的内容与情感,融合音频和文本输入生成精细表情,并支持多模态控制实现个性化效果,兼容MetaHuman生产流程。
English: MEDTalk introduces a novel framework for dynamic emotional 3D facial animation by disentangling content and emotion from motion sequences, integrating audio and text inputs to generate fine-grained expressions while enabling multimodal control for personalized results compatible with MetaHuman pipelines.

Authors:Vera Soboleva, Aibek Alanov, Andrey Kuznetsov, Konstantin Sobolev
Title: T-LoRA: Single Image Diffusion Model Customization Without Overfitting
Abstract:
While diffusion model fine-tuning offers a powerful approach for customizing pre-trained models to generate specific objects, it frequently suffers from overfitting when training samples are limited, compromising both generalization capability and output diversity. This paper tackles the challenging yet most impactful task of adapting a diffusion model using just a single concept image, as single-image customization holds the greatest practical potential. We introduce T-LoRA, a Timestep-Dependent Low-Rank Adaptation framework specifically designed for diffusion model personalization. In our work we show that higher diffusion timesteps are more prone to overfitting than lower ones, necessitating a timestep-sensitive fine-tuning strategy. T-LoRA incorporates two key innovations: (1) a dynamic fine-tuning strategy that adjusts rank-constrained updates based on diffusion timesteps, and (2) a weight parametrization technique that ensures independence between adapter components through orthogonal initialization. Extensive experiments show that T-LoRA and its individual components outperform standard LoRA and other diffusion model personalization techniques. They achieve a superior balance between concept fidelity and text alignment, highlighting the potential of T-LoRA in data-limited and resource-constrained scenarios. Code is available at https://github.com/ControlGenAI/T-LoRA.
中文: 本文提出T-LoRA框架,通过基于时间步的动态秩调整和正交权重参数化技术,有效解决了单图像扩散模型定制中的过拟合问题,在概念保真度与文本对齐之间实现了更优平衡。
English: This paper introduces T-LoRA, a timestep-dependent low-rank adaptation framework that addresses overfitting in single-image diffusion model customization by implementing dynamic rank adjustment and orthogonal weight initialization, achieving superior balance between concept fidelity and text alignment.

Authors:Zhenghao Zhang, Junchao Liao, Xiangyu Meng, Long Qin, Weizhi Wang
Title: Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation
Abstract:
Recent advances in diffusion transformer models for motion-guided video generation, such as Tora, have shown significant progress. In this paper, we present Tora2, an enhanced version of Tora, which introduces several design improvements to expand its capabilities in both appearance and motion customization. Specifically, we introduce a decoupled personalization extractor that generates comprehensive personalization embeddings for multiple open-set entities, better preserving fine-grained visual details compared to previous methods. Building on this, we design a gated self-attention mechanism to integrate trajectory, textual description, and visual information for each entity. This innovation significantly reduces misalignment in multimodal conditioning during training. Moreover, we introduce a contrastive loss that jointly optimizes trajectory dynamics and entity consistency through explicit mapping between motion and personalization embeddings. Tora2 is, to our best knowledge, the first method to achieve simultaneous multi-entity customization of appearance and motion for video generation. Experimental results demonstrate that Tora2 achieves competitive performance with state-of-the-art customization methods while providing advanced motion control capabilities, which marks a critical advancement in multi-condition video generation. Project page: https://ali-videoai.github.io/Tora2_page/.

Authors:Xinyu Huang, Yuhao Dong, Weiwei Tian, Bo Li, Rui Feng, Ziwei Liu
Title: High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
Abstract:
State-of-the-art large multi-modal models (LMMs) face challenges when processing high-resolution images, as these inputs are converted into enormous visual tokens, many of which are irrelevant to the downstream task. In this paper, we propose Multi-turn Grounding-based Policy Optimization (MGPO), an end-to-end reinforcement learning (RL) framework that enables LMMs to iteratively focus on key visual regions by automatically cropping sub-images, based on model-predicted grounding coordinates within a multi-turn conversation framework. Compared to supervised fine-tuning (SFT), which requires costly additional grounding annotations, our approach highlights that LMMs can emerge robust grounding abilities during the RL training process, leveraging only a binary reward function derived from the correctness of the final answer. Additionally, we observe that LMMs struggle to autonomously trigger visual grounding during the rollout process. To address this cold start problem, we design a multi-turn conversational template and restrict policy loss computation to model outputs generated across multiple dialogue rounds, thereby promoting stable optimization. Extensive experiments demonstrate that, when trained on standard visual-question-short answering data without grounding annotations, MGPO effectively elicits stronger grounding capabilities compared to GRPO, leading to 5.4\% improvement on in-distribution MME-Realworld and 5.2\% improvement on the challenging out-of-distribution (OOD) V* Bench. Notably, MGPO post-training on Qwen2.5-VL-7B with 21K samples surpasses OpenAI's o1 and GPT-4o models on the OOD V* Bench. Codes are available at https://github.com/EvolvingLMMs-Lab/MGPO.
中文: 本文提出MGPO强化学习框架,通过多轮对话中的自动图像裁剪使大模型能迭代聚焦关键视觉区域,在无需标注数据的情况下显著提升了视觉任务性能。
English: The paper introduces MGPO, a reinforcement learning framework that enables large multi-modal models to iteratively focus on key image regions through automatic cropping, improving performance on visual tasks without requiring costly grounding annotations.

Authors:Yuedong Tan, Jiawei Shao, Eduard Zamfir, Ruanjun Li, Zhaochong An, Chao Ma, Danda Paudel, Luc Van Gool, Radu Timofte, Zongwei Wu
Title: What You Have is What You Track: Adaptive and Robust Multimodal Tracking
Abstract:
Multimodal data is known to be helpful for visual tracking by improving robustness to appearance variations. However, sensor synchronization challenges often compromise data availability, particularly in video settings where shortages can be temporal. Despite its importance, this area remains underexplored. In this paper, we present the first comprehensive study on tracker performance with temporally incomplete multimodal data. Unsurprisingly, under such a circumstance, existing trackers exhibit significant performance degradation, as their rigid architectures lack the adaptability needed to effectively handle missing modalities. To address these limitations, we propose a flexible framework for robust multimodal tracking. We venture that a tracker should dynamically activate computational units based on missing data rates. This is achieved through a novel Heterogeneous Mixture-of-Experts fusion mechanism with adaptive complexity, coupled with a video-level masking strategy that ensures both temporal consistency and spatial completeness which is critical for effective video tracking. Surprisingly, our model not only adapts to varying missing rates but also adjusts to scene complexity. Extensive experiments show that our model achieves SOTA performance across 9 benchmarks, excelling in both conventional complete and missing modality settings. The code and benchmark will be publicly available at https://github.com/supertyd/FlexTrack/tree/main.
中文: 本文提出了一种灵活的多模态跟踪框架,通过新型融合机制和掩码策略动态适应时序不完整数据,在多个基准测试中实现了最先进的性能。
English: This paper introduces a flexible multimodal tracking framework that dynamically adapts to temporally incomplete data through a novel fusion mechanism and masking strategy, achieving state-of-the-art performance across multiple benchmarks.

Authors:Ruijie Lu, Yu Liu, Jiaxiang Tang, Junfeng Ni, Yuxiang Wang, Diwen Wan, Gang Zeng, Yixin Chen, Siyuan Huang
Title: DreamArt: Generating Interactable Articulated Objects from a Single Image
Abstract:
Generating articulated objects, such as laptops and microwaves, is a crucial yet challenging task with extensive applications in Embodied AI and AR/VR. Current image-to-3D methods primarily focus on surface geometry and texture, neglecting part decomposition and articulation modeling. Meanwhile, neural reconstruction approaches (e.g., NeRF or Gaussian Splatting) rely on dense multi-view or interaction data, limiting their scalability. In this paper, we introduce DreamArt, a novel framework for generating high-fidelity, interactable articulated assets from single-view images. DreamArt employs a three-stage pipeline: firstly, it reconstructs part-segmented and complete 3D object meshes through a combination of image-to-3D generation, mask-prompted 3D segmentation, and part amodal completion. Second, we fine-tune a video diffusion model to capture part-level articulation priors, leveraging movable part masks as prompt and amodal images to mitigate ambiguities caused by occlusion. Finally, DreamArt optimizes the articulation motion, represented by a dual quaternion, and conducts global texture refinement and repainting to ensure coherent, high-quality textures across all parts. Experimental results demonstrate that DreamArt effectively generates high-quality articulated objects, possessing accurate part shape, high appearance fidelity, and plausible articulation, thereby providing a scalable solution for articulated asset generation. Our project page is available at https://dream-art-0.github.io/DreamArt/.
中文: DreamArt是一种创新框架,通过三阶段流程从单视图图像生成高保真可交互的铰接式3D对象,包括部件分割网格重建、基于视频扩散的关节建模,以及结合运动优化和纹理修复的最终处理。
English: DreamArt is a novel framework that generates high-fidelity, articulated 3D objects from single-view images through a three-stage process involving part-segmented mesh reconstruction, articulation modeling via video diffusion, and motion optimization with texture refinement.

Authors:Yisu Zhang, Chenjie Cao, Chaohui Yu, Jianke Zhu
Title: LiON-LoRA: Rethinking LoRA Fusion to Unify Controllable Spatial and Temporal Generation for Video Diffusion
Abstract:
Video Diffusion Models (VDMs) have demonstrated remarkable capabilities in synthesizing realistic videos by learning from large-scale data. Although vanilla Low-Rank Adaptation (LoRA) can learn specific spatial or temporal movement to driven VDMs with constrained data, achieving precise control over both camera trajectories and object motion remains challenging due to the unstable fusion and non-linear scalability. To address these issues, we propose LiON-LoRA, a novel framework that rethinks LoRA fusion through three core principles: Linear scalability, Orthogonality, and Norm consistency. First, we analyze the orthogonality of LoRA features in shallow VDM layers, enabling decoupled low-level controllability. Second, norm consistency is enforced across layers to stabilize fusion during complex camera motion combinations. Third, a controllable token is integrated into the diffusion transformer (DiT) to linearly adjust motion amplitudes for both cameras and objects with a modified self-attention mechanism to ensure decoupled control. Additionally, we extend LiON-LoRA to temporal generation by leveraging static-camera videos, unifying spatial and temporal controllability. Experiments demonstrate that LiON-LoRA outperforms state-of-the-art methods in trajectory control accuracy and motion strength adjustment, achieving superior generalization with minimal training data. Project Page: https://fuchengsu.github.io/lionlora.github.io/

Authors:Rongsheng Wang, Junying Chen, Ke Ji, Zhenyang Cai, Shunian Chen, Yunjin Yang, Benyou Wang
Title: MedGen: Unlocking Medical Video Generation by Scaling Granularly-annotated Medical Videos
Abstract:
Recent advances in video generation have shown remarkable progress in open-domain settings, yet medical video generation remains largely underexplored. Medical videos are critical for applications such as clinical training, education, and simulation, requiring not only high visual fidelity but also strict medical accuracy. However, current models often produce unrealistic or erroneous content when applied to medical prompts, largely due to the lack of large-scale, high-quality datasets tailored to the medical domain. To address this gap, we introduce MedVideoCap-55K, the first large-scale, diverse, and caption-rich dataset for medical video generation. It comprises over 55,000 curated clips spanning real-world medical scenarios, providing a strong foundation for training generalist medical video generation models. Built upon this dataset, we develop MedGen, which achieves leading performance among open-source models and rivals commercial systems across multiple benchmarks in both visual quality and medical accuracy. We hope our dataset and model can serve as a valuable resource and help catalyze further research in medical video generation. Our code and data is available at https://github.com/FreedomIntelligence/MedGen
中文摘要:针对医学视频生成领域缺乏专业数据集的问题,我们开发了MedVideoCap-55K大规模数据集,并基于此构建的MedGen模型在视觉质量和医学准确性方面均达到了领先水平。
English Summary: Medical video generation has been limited by the absence of specialized datasets, leading to the creation of MedVideoCap-55K, a comprehensive dataset that enables MedGen to achieve top-tier performance in both visual quality and medical precision.

Authors:Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Xuemeng Song, Liqiang Nie
Title: OFFSET: Segmentation-based Focus Shift Revision for Composed Image Retrieval
Abstract:
Composed Image Retrieval (CIR) represents a novel retrieval paradigm that is capable of expressing users' intricate retrieval requirements flexibly. It enables the user to give a multimodal query, comprising a reference image and a modification text, and subsequently retrieve the target image. Notwithstanding the considerable advances made by prevailing methodologies, CIR remains in its nascent stages due to two limitations: 1) inhomogeneity between dominant and noisy portions in visual data is ignored, leading to query feature degradation, and 2) the priority of textual data in the image modification process is overlooked, which leads to a visual focus bias. To address these two limitations, this work presents a focus mapping-based feature extractor, which consists of two modules: dominant portion segmentation and dual focus mapping. It is designed to identify significant dominant portions in images and guide the extraction of visual and textual data features, thereby reducing the impact of noise interference. Subsequently, we propose a textually guided focus revision module, which can utilize the modification requirements implied in the text to perform adaptive focus revision on the reference image, thereby enhancing the perception of the modification focus on the composed features. The aforementioned modules collectively constitute the segmentatiOn-based Focus shiFt reviSion nETwork (\mbox{OFFSET}), and comprehensive experiments on four benchmark datasets substantiate the superiority of our proposed method. The codes and data are available on https://zivchen-ty.github.io/OFFSET.github.io/
中文: 本文提出OFFSET网络,通过焦点映射识别图像主体区域并利用文本引导的焦点修正来增强组合特征感知,有效解决了组合图像检索中的特征退化与视觉焦点偏差问题,在多个基准数据集上验证了其优越性。
English: The paper introduces OFFSET, a novel network that addresses key limitations in Composed Image Retrieval by using focus mapping to identify dominant image portions and text-guided revision to enhance modification focus, demonstrating superior performance across multiple datasets.

Authors:Shuai Li, Shihan Chen, Wanru Geng, Zhaohua Xu, Xiaolu Liu, Can Dong, Zhen Tian, Changlin Chen
Title: Semi-Supervised Defect Detection via Conditional Diffusion and CLIP-Guided Noise Filtering
Abstract:
In the realm of industrial quality inspection, defect detection stands as a critical component, particularly in high-precision, safety-critical sectors such as automotive components aerospace, and medical devices. Traditional methods, reliant on manual inspection or early image processing algorithms, suffer from inefficiencies, high costs, and limited robustness. This paper introduces a semi-supervised defect detection framework based on conditional diffusion (DSYM), leveraging a two-stage collaborative training mechanism and a staged joint optimization strategy. The framework utilizes labeled data for initial training and subsequently incorporates unlabeled data through the generation of pseudo-labels. A conditional diffusion model synthesizes multi-scale pseudo-defect samples, while a CLIP cross-modal feature-based noise filtering mechanism mitigates label contamination. Experimental results on the NEU-DET dataset demonstrate a 78.4% mAP@0.5 with the same amount of labeled data as traditional supervised methods, and 75.1% mAP@0.5 with only 40% of the labeled data required by the original supervised model, showcasing significant advantages in data efficiency. This research provides a high-precision, low-labeling-dependent solution for defect detection in industrial quality inspection scenarios. The work of this article has been open-sourced at https://github.com/cLin-c/Semisupervised-DSYM.
中文: 本文提出了一种基于条件扩散的半监督缺陷检测框架DSYM,通过两阶段协同训练和伪标签生成,在减少标注数据需求的同时实现了高精度检测,在NEU-DET数据集上展现了显著的数据效率优势。
English: This paper presents a semi-supervised defect detection framework called DSYM that uses conditional diffusion models and a two-stage training approach to achieve high precision with reduced labeled data, demonstrating superior data efficiency on the NEU-DET dataset.

Authors:Pedro R. A. S. Bassi, Wenxuan Li, Jieneng Chen, Zheren Zhu, Tianyu Lin, Sergio Decherchi, Andrea Cavalli, Kang Wang, Yang Yang, Alan L. Yuille, Zongwei Zhou
Title: Learning Segmentation from Radiology Reports
Abstract:
Tumor segmentation in CT scans is key for diagnosis, surgery, and prognosis, yet segmentation masks are scarce because their creation requires time and expertise. Public abdominal CT datasets have from dozens to a couple thousand tumor masks, but hospitals have hundreds of thousands of tumor CTs with radiology reports. Thus, leveraging reports to improve segmentation is key for scaling. In this paper, we propose a report-supervision loss (R-Super) that converts radiology reports into voxel-wise supervision for tumor segmentation AI. We created a dataset with 6,718 CT-Report pairs (from the UCSF Hospital), and merged it with public CT-Mask datasets (from AbdomenAtlas 2.0). We used our R-Super to train with these masks and reports, and strongly improved tumor segmentation in internal and external validation--F1 Score increased by up to 16% with respect to training with masks only. By leveraging readily available radiology reports to supplement scarce segmentation masks, R-Super strongly improves AI performance both when very few training masks are available (e.g., 50), and when many masks were available (e.g., 1.7K). Project: https://github.com/MrGiovanni/R-Super
中文: 本文提出R-Super方法,通过将放射学报告转化为体素级监督来增强CT扫描中的肿瘤分割,在仅有少量分割掩码的情况下仍能显著提升AI性能,F1分数最高提升16%。
English: This paper introduces R-Super, a novel report-supervision loss that converts radiology reports into voxel-wise supervision to enhance tumor segmentation in CT scans, significantly improving AI performance by up to 16% in F1 scores even with limited segmentation masks.

Authors:Andrew Randono
Title: Cloud Diffusion Part 1: Theory and Motivation
Abstract:
Diffusion models for image generation function by progressively adding noise to an image set and training a model to separate out the signal from the noise. The noise profile used by these models is white noise -- that is, noise based on independent normal distributions at each point whose mean and variance is independent of the scale. By contrast, most natural image sets exhibit a type of scale invariance in their low-order statistical properties characterized by a power-law scaling. Consequently, natural images are closer (in a quantifiable sense) to a different probability distribution that emphasizes large scale correlations and de-emphasizes small scale correlations. These scale invariant noise profiles can be incorporated into diffusion models in place of white noise to form what we will call a ``Cloud Diffusion Model". We argue that these models can lead to faster inference, improved high-frequency details, and greater controllability. In a follow-up paper, we will build and train a Cloud Diffusion Model that uses scale invariance at a fundamental level and compare it to classic, white noise diffusion models.
Chinese: 云扩散模型用与自然图像统计特性更匹配的尺度不变噪声谱替代传统扩散模型中的白噪声,有望实现更快的推理速度、更优的高频细节和更强的可控性。
English: Cloud Diffusion Models replace the white noise in traditional diffusion models with scale-invariant noise profiles that better match natural images' statistical properties, promising faster inference, enhanced high-frequency details, and improved controllability.

Authors:Sajjad Ghiasvand, Mahnoosh Alizadeh, Ramtin Pedarsani
Title: pFedMMA: Personalized Federated Fine-Tuning with Multi-Modal Adapter for Vision-Language Models
Abstract:
Vision-Language Models (VLMs) like CLIP have demonstrated remarkable generalization in zero- and few-shot settings, but adapting them efficiently to decentralized, heterogeneous data remains a challenge. While prompt tuning has emerged as a popular parameter-efficient approach in personalized federated learning, existing methods often sacrifice generalization in favor of personalization, struggling particularly on unseen classes or domains. In this work, we propose pFedMMA, the first personalized federated learning framework that leverages multi-modal adapters for vision-language tasks. Each adapter contains modality-specific up- and down-projection layers alongside a globally shared projection that aligns cross-modal features. Our asymmetric optimization strategy allows clients to locally adapt to personalized data distributions while collaboratively training the shared projection to improve global generalization. This design is also communication-efficient, as only the shared component is exchanged during rounds. Through extensive experiments across eleven datasets, including domain- and label-shift scenarios, we show that pFedMMA achieves state-of-the-art trade-offs between personalization and generalization, outperforming recent federated prompt tuning methods. The code is available at https://github.com/sajjad-ucsb/pFedMMA.
中文:pFedMMA框架通过引入多模态适配器和共享全局投影,在联邦学习中兼顾个性化与泛化能力,在多种数据集上实现最优性能,并保持高效的通信效率。
English: The pFedMMA framework introduces multi-modal adapters with a shared global projection to enhance both personalization and generalization in federated learning, achieving superior performance across diverse datasets while maintaining communication efficiency.

Authors:Xiang Xu, Lingdong Kong, Song Wang, Chuanwei Zhou, Qingshan Liu
Title: Beyond One Shot, Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations
Abstract:
LiDAR representation learning aims to extract rich structural and semantic information from large-scale, readily available datasets, reducing reliance on costly human annotations. However, existing LiDAR representation strategies often overlook the inherent spatiotemporal cues in LiDAR sequences, limiting their effectiveness. In this work, we propose LiMA, a novel long-term image-to-LiDAR Memory Aggregation framework that explicitly captures longer range temporal correlations to enhance LiDAR representation learning. LiMA comprises three key components: 1) a Cross-View Aggregation module that aligns and fuses overlapping regions across neighboring camera views, constructing a more unified and redundancy-free memory bank; 2) a Long-Term Feature Propagation mechanism that efficiently aligns and integrates multi-frame image features, reinforcing temporal coherence during LiDAR representation learning; and 3) a Cross-Sequence Memory Alignment strategy that enforces consistency across driving sequences, improving generalization to unseen environments. LiMA maintains high pretraining efficiency and incurs no additional computational overhead during downstream tasks. Extensive experiments on mainstream LiDAR-based perception benchmarks demonstrate that LiMA significantly improves both LiDAR semantic segmentation and 3D object detection. We hope this work inspires more effective pretraining paradigms for autonomous driving. The code has be made publicly accessible for future research.
中文: LiMA是一种新颖的长时记忆聚合框架,通过跨视图聚合、长时特征传播和跨序列对齐机制捕捉长程时序关联,显著提升了激光雷达语义分割与3D目标检测性能,且在下游任务中不产生额外计算开销。
English: LiMA is a novel long-term memory aggregation framework that enhances LiDAR representation learning by capturing extended temporal correlations through cross-view aggregation, long-term feature propagation, and cross-sequence alignment, significantly improving performance in semantic segmentation and 3D object detection without added computational costs during downstream tasks.

Authors:Chun-Hsiao Yeh, Yilin Wang, Nanxuan Zhao, Richard Zhang, Yuheng Li, Yi Ma, Krishna Kumar Singh
Title: Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing
Abstract:
Recent diffusion-based image editing methods have significantly advanced text-guided tasks but often struggle to interpret complex, indirect instructions. Moreover, current models frequently suffer from poor identity preservation, unintended edits, or rely heavily on manual masks. To address these challenges, we introduce X-Planner, a Multimodal Large Language Model (MLLM)-based planning system that effectively bridges user intent with editing model capabilities. X-Planner employs chain-of-thought reasoning to systematically decompose complex instructions into simpler, clear sub-instructions. For each sub-instruction, X-Planner automatically generates precise edit types and segmentation masks, eliminating manual intervention and ensuring localized, identity-preserving edits. Additionally, we propose a novel automated pipeline for generating large-scale data to train X-Planner which achieves state-of-the-art results on both existing benchmarks and our newly introduced complex editing benchmark.

Authors:Haozhen Zheng, Beitong Tian, Mingyuan Wu, Zhenggang Tang, Klara Nahrstedt, Alex Schwing
Title: Spatio-Temporal LLM: Reasoning about Environments and Actions
Abstract:
Despite the significant recent progress of Multimodal Large Language Models (MLLMs), MLLMs still struggle to correctly answer prompts that require a holistic spatio-temporal understanding. Specifically, it is challenging to address prompts that refer to 1) the entirety of an environment that an agent equipped with an MLLM can operate in; and simultaneously also refer to 2) recent actions that just happened and are encoded in a video clip. However, such a holistic spatio-temporal understanding is important for agents operating in the real world. To address this issue, we first develop a framework to collect a large-scale dataset. Using the collected "Reasoning about Environments and Actions" (REA) dataset, we show that recent methods indeed struggle to correctly answer the prompts. To improve, we develop a "spatio-temporal LLM" (ST-LLM), a model equipped with projectors to improve both spatial understanding of an environment and temporal understanding of recent observations. On the collected REA data, we show that the proposed method significantly improves results compared to prior work. Code and data are available at https://zoezheng126.github.io/STLLM-website/.

Authors:Jiahao Zhu, Zixuan Chen, Guangcong Wang, Xiaohua Xie, Yi Zhou
Title: SegmentDreamer: Towards High-fidelity Text-to-3D Synthesis with Segmented Consistency Trajectory Distillation
Abstract:
Recent advancements in text-to-3D generation improve the visual quality of Score Distillation Sampling (SDS) and its variants by directly connecting Consistency Distillation (CD) to score distillation. However, due to the imbalance between self-consistency and cross-consistency, these CD-based methods inherently suffer from improper conditional guidance, leading to sub-optimal generation results. To address this issue, we present SegmentDreamer, a novel framework designed to fully unleash the potential of consistency models for high-fidelity text-to-3D generation. Specifically, we reformulate SDS through the proposed Segmented Consistency Trajectory Distillation (SCTD), effectively mitigating the imbalance issues by explicitly defining the relationship between self- and cross-consistency. Moreover, SCTD partitions the Probability Flow Ordinary Differential Equation (PF-ODE) trajectory into multiple sub-trajectories and ensures consistency within each segment, which can theoretically provide a significantly tighter upper bound on distillation error. Additionally, we propose a distillation pipeline for a more swift and stable generation. Extensive experiments demonstrate that our SegmentDreamer outperforms state-of-the-art methods in visual quality, enabling high-fidelity 3D asset creation through 3D Gaussian Splatting (3DGS).

Authors:Fabian Konstantinidis, Ariel Dallari Guerreiro, Raphael Trumpp, Moritz Sackmann, Ulrich Hofmann, Marco Caccamo, Christoph Stiller
Title: From Marginal to Joint Predictions: Evaluating Scene-Consistent Trajectory Prediction Approaches for Automated Driving
Abstract:
Accurate motion prediction of surrounding traffic participants is crucial for the safe and efficient operation of automated vehicles in dynamic environments. Marginal prediction models commonly forecast each agent's future trajectories independently, often leading to sub-optimal planning decisions for an automated vehicle. In contrast, joint prediction models explicitly account for the interactions between agents, yielding socially and physically consistent predictions on a scene level. However, existing approaches differ not only in their problem formulation but also in the model architectures and implementation details used, making it difficult to compare them. In this work, we systematically investigate different approaches to joint motion prediction, including post-processing of the marginal predictions, explicitly training the model for joint predictions, and framing the problem as a generative task. We evaluate each approach in terms of prediction accuracy, multi-modality, and inference efficiency, offering a comprehensive analysis of the strengths and limitations of each approach. Several prediction examples are available at https://frommarginaltojointpred.github.io/.

Authors:Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, Xihui Liu, Jiangmiao Pang
Title: StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling
Abstract:
Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions. While Video-based Large Language Models (Video-LLMs) have driven recent progress, current VLN methods based on Video-LLM often face trade-offs among fine-grained visual understanding, long-term context modeling and computational efficiency. We introduce StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs. The fast-streaming dialogue context facilitates responsive action generation through a sliding-window of active dialogues, while the slow-updating memory context compresses historical visual states using a 3D-aware token pruning strategy. With this slow-fast design, StreamVLN achieves coherent multi-turn dialogue through efficient KV cache reuse, supporting long video streams with bounded context size and inference cost. Experiments on VLN-CE benchmarks demonstrate state-of-the-art performance with stable low latency, ensuring robustness and efficiency in real-world deployment. The project page is: \href{https://streamvln.github.io/}{https://streamvln.github.io/}.
中文: StreamVLN通过慢-快双流上下文建模框架,结合动态令牌剪枝与KV缓存复用技术,在保证细粒度视觉理解的同时实现高效低延迟的视觉语言导航。
English: StreamVLN introduces a hybrid slow-fast context modeling framework that enables efficient, low-latency navigation by balancing fine-grained visual understanding with computational efficiency through dynamic token pruning and KV cache reuse.

Authors:Zongyan Han, Mohamed El Amine Boudjoghra, Jiahua Dong, Jinhong Wang, Rao Muhammad Anwer
Title: All in One: Visual-Description-Guided Unified Point Cloud Segmentation
Abstract:
Unified segmentation of 3D point clouds is crucial for scene understanding, but is hindered by its sparse structure, limited annotations, and the challenge of distinguishing fine-grained object classes in complex environments. Existing methods often struggle to capture rich semantic and contextual information due to limited supervision and a lack of diverse multimodal cues, leading to suboptimal differentiation of classes and instances. To address these challenges, we propose VDG-Uni3DSeg, a novel framework that integrates pre-trained vision-language models (e.g., CLIP) and large language models (LLMs) to enhance 3D segmentation. By leveraging LLM-generated textual descriptions and reference images from the internet, our method incorporates rich multimodal cues, facilitating fine-grained class and instance separation. We further design a Semantic-Visual Contrastive Loss to align point features with multimodal queries and a Spatial Enhanced Module to model scene-wide relationships efficiently. Operating within a closed-set paradigm that utilizes multimodal knowledge generated offline, VDG-Uni3DSeg achieves state-of-the-art results in semantic, instance, and panoptic segmentation, offering a scalable and practical solution for 3D understanding. Our code is available at https://github.com/Hanzy1996/VDG-Uni3DSeg.
中文: VDG-Uni3DSeg提出了一种创新框架,通过融合视觉语言和大语言模型,利用多模态线索和专门设计的模块来增强三维点云分割,在语义、实例和全景分割任务中均取得了领先性能。
English: VDG-Uni3DSeg introduces a novel framework that integrates vision-language and large language models to enhance 3D point cloud segmentation by leveraging multimodal cues and specialized modules, achieving state-of-the-art performance in semantic, instance, and panoptic segmentation.

Authors:Yijia Hong, Jiangning Zhang, Ran Yi, Yuji Wang, Weijian Cao, Xiaobin Hu, Zhucun Xue, Yabiao Wang, Chengjie Wang, Lizhuang Ma
Title: Semantic Frame Interpolation
Abstract:
Generating intermediate video content of varying lengths based on given first and last frames, along with text prompt information, offers significant research and application potential. However, traditional frame interpolation tasks primarily focus on scenarios with a small number of frames, no text control, and minimal differences between the first and last frames. Recent community developers have utilized large video models represented by Wan to endow frame-to-frame capabilities. However, these models can only generate a fixed number of frames and often fail to produce satisfactory results for certain frame lengths, while this setting lacks a clear official definition and a well-established benchmark. In this paper, we first propose a new practical Semantic Frame Interpolation (SFI) task from the perspective of academic definition, which covers the above two settings and supports inference at multiple frame rates. To achieve this goal, we propose a novel SemFi model building upon Wan2.1, which incorporates a Mixture-of-LoRA module to ensure the generation of high-consistency content that aligns with control conditions across various frame length limitations. Furthermore, we propose SFI-300K, the first general-purpose dataset and benchmark specifically designed for SFI. To support this, we collect and process data from the perspective of SFI, carefully designing evaluation metrics and methods to assess the model's performance across multiple dimensions, encompassing image and video, and various aspects, including consistency and diversity. Through extensive experiments on SFI-300K, we demonstrate that our method is particularly well-suited to meet the requirements of the SFI task.
中文摘要:本文提出了语义帧插值(SFI)新任务,开发了基于Wan2.1的SemFi模型,通过混合LoRA模块实现在不同帧长限制下生成符合控制条件的高一致性内容,并创建了首个专用数据集SFI-300K进行多维评估验证。
English Summary: This paper introduces the Semantic Frame Interpolation (SFI) task and proposes the SemFi model with a Mixture-of-LoRA module to generate variable-length intermediate videos between given frames while maintaining content consistency and control alignment, supported by the newly created SFI-300K benchmark dataset.

Authors:Nusrat Munia, Junfeng Zhu, Olfa Nasraoui, Abdullah-Al-Zubaer Imran
Title: Differential Attention for Multimodal Crisis Event Analysis
Abstract:
Social networks can be a valuable source of information during crisis events. In particular, users can post a stream of multimodal data that can be critical for real-time humanitarian response. However, effectively extracting meaningful information from this large and noisy data stream and effectively integrating heterogeneous data remains a formidable challenge. In this work, we explore vision language models (VLMs) and advanced fusion strategies to enhance the classification of crisis data in three different tasks. We incorporate LLaVA-generated text to improve text-image alignment. Additionally, we leverage Contrastive Language-Image Pretraining (CLIP)-based vision and text embeddings, which, without task-specific fine-tuning, outperform traditional models. To further refine multimodal fusion, we employ Guided Cross Attention (Guided CA) and combine it with the Differential Attention mechanism to enhance feature alignment by emphasizing critical information while filtering out irrelevant content. Our results show that while Differential Attention improves classification performance, Guided CA remains highly effective in aligning multimodal features. Extensive experiments on the CrisisMMD benchmark data set demonstrate that the combination of pretrained VLMs, enriched textual descriptions, and adaptive fusion strategies consistently outperforms state-of-the-art models in classification accuracy, contributing to more reliable and interpretable models for three different tasks that are crucial for disaster response. Our code is available at https://github.com/Munia03/Multimodal_Crisis_Event.
中文: 本研究通过融合视觉语言模型与先进的多模态融合策略,显著提升了危机数据的分类性能,在灾害应对任务中实现了更优的准确性和可解释性。
English: This study enhances crisis data classification by integrating vision-language models with advanced fusion strategies, achieving superior accuracy and interpretability for disaster response tasks.

Authors:Yutian Chen, Shi Guo, Tianshuo Yang, Lihe Ding, Xiuyuan Yu, Jinwei Gu, Tianfan Xue
Title: 4DSloMo: 4D Reconstruction for High Speed Scene with Asynchronous Capture
Abstract:
Reconstructing fast-dynamic scenes from multi-view videos is crucial for high-speed motion analysis and realistic 4D reconstruction. However, the majority of 4D capture systems are limited to frame rates below 30 FPS (frames per second), and a direct 4D reconstruction of high-speed motion from low FPS input may lead to undesirable results. In this work, we propose a high-speed 4D capturing system only using low FPS cameras, through novel capturing and processing modules. On the capturing side, we propose an asynchronous capture scheme that increases the effective frame rate by staggering the start times of cameras. By grouping cameras and leveraging a base frame rate of 25 FPS, our method achieves an equivalent frame rate of 100-200 FPS without requiring specialized high-speed cameras. On processing side, we also propose a novel generative model to fix artifacts caused by 4D sparse-view reconstruction, as asynchrony reduces the number of viewpoints at each timestamp. Specifically, we propose to train a video-diffusion-based artifact-fix model for sparse 4D reconstruction, which refines missing details, maintains temporal consistency, and improves overall reconstruction quality. Experimental results demonstrate that our method significantly enhances high-speed 4D reconstruction compared to synchronous capture.
中文: 本研究提出了一种利用低帧率相机的高速4D捕捉系统,通过异步采集方案提升等效帧率,并采用生成模型修复重建伪影。
English: This study introduces a high-speed 4D capture system using low-frame-rate cameras, employing an asynchronous capture scheme to boost effective frame rates and a generative model to correct reconstruction artifacts.

Authors:Nicholas Chivaran, Jianbing Ni
Title: LAID: Lightweight AI-Generated Image Detection in Spatial and Spectral Domains
Abstract:
The recent proliferation of photorealistic AI-generated images (AIGI) has raised urgent concerns about their potential misuse, particularly on social media platforms. Current state-of-the-art AIGI detection methods typically rely on large, deep neural architectures, creating significant computational barriers to real-time, large-scale deployment on platforms like social media. To challenge this reliance on computationally intensive models, we introduce LAID, the first framework -- to our knowledge -- that benchmarks and evaluates the detection performance and efficiency of off-the-shelf lightweight neural networks. In this framework, we comprehensively train and evaluate selected models on a representative subset of the GenImage dataset across spatial, spectral, and fusion image domains. Our results demonstrate that lightweight models can achieve competitive accuracy, even under adversarial conditions, while incurring substantially lower memory and computation costs compared to current state-of-the-art methods. This study offers valuable insight into the trade-off between efficiency and performance in AIGI detection and lays a foundation for the development of practical, scalable, and trustworthy detection systems. The source code of LAID can be found at: https://github.com/nchivar/LAID.
中文摘要:LAID框架研究表明,轻量级神经网络能以显著降低的计算成本实现与现有方法相媲美的AI生成图像检测精度,为社交媒体平台的可扩展部署提供了实用解决方案。
English summary: The LAID framework demonstrates that lightweight neural networks can achieve competitive accuracy in detecting AI-generated images with significantly lower computational costs, offering a practical solution for scalable deployment on social media platforms.

Authors:Yingyu Yang, Qianye Yang, Kangning Cui, Can Peng, Elena D'Alberti, Netzahualcoyotl Hernandez-Cruz, Olga Patey, Aris T. Papageorghiou, J. Alison Noble
Title: Latent Motion Profiling for Annotation-free Cardiac Phase Detection in Adult and Fetal Echocardiography Videos
Abstract:
The identification of cardiac phase is an essential step for analysis and diagnosis of cardiac function. Automatic methods, especially data-driven methods for cardiac phase detection, typically require extensive annotations, which is time-consuming and labor-intensive. In this paper, we present an unsupervised framework for end-diastole (ED) and end-systole (ES) detection through self-supervised learning of latent cardiac motion trajectories from 4-chamber-view echocardiography videos. Our method eliminates the need for manual annotations, including ED and ES indices, segmentation, or volumetric measurements, by training a reconstruction model to encode interpretable spatiotemporal motion patterns. Evaluated on the EchoNet-Dynamic benchmark, the approach achieves mean absolute error (MAE) of 3 frames (58.3 ms) for ED and 2 frames (38.8 ms) for ES detection, matching state-of-the-art supervised methods. Extended to fetal echocardiography, the model demonstrates robust performance with MAE 1.46 frames (20.7 ms) for ED and 1.74 frames (25.3 ms) for ES, despite the fact that the fetal heart model is built using non-standardized heart views due to fetal heart positioning variability. Our results demonstrate the potential of the proposed latent motion trajectory strategy for cardiac phase detection in adult and fetal echocardiography. This work advances unsupervised cardiac motion analysis, offering a scalable solution for clinical populations lacking annotated data. Code will be released at https://github.com/YingyuYyy/CardiacPhase.
中文: 本文提出了一种无监督框架,通过自监督学习从超声心动图视频中检测心脏相位而无需人工标注,在成人和胎儿应用中均实现了与有监督方法相当的性能。
English: This paper introduces an unsupervised framework using self-supervised learning to detect cardiac phases from echocardiography videos without manual annotations, achieving performance comparable to supervised methods in both adult and fetal applications.

Authors:Aadi Srivastava, Vignesh Natarajkumar, Utkarsh Bheemanaboyna, Devisree Akashapu, Nagraj Gaonkar, Archit Joshi
Title: VERITAS: Verification and Explanation of Realness in Images for Transparency in AI Systems
Abstract:
The widespread and rapid adoption of AI-generated content, created by models such as Generative Adversarial Networks (GANs) and Diffusion Models, has revolutionized the digital media landscape by allowing efficient and creative content generation. However, these models also blur the difference between real images and AI-generated synthetic images, raising concerns regarding content authenticity and integrity. While many existing solutions to detect fake images focus solely on classification and higher-resolution images, they often lack transparency in their decision-making, making it difficult for users to understand why an image is classified as fake. In this paper, we present VERITAS, a comprehensive framework that not only accurately detects whether a small (32x32) image is AI-generated but also explains why it was classified that way through artifact localization and semantic reasoning. VERITAS produces human-readable explanations that describe key artifacts in synthetic images. We show that this architecture offers clear explanations of the basis of zero-shot synthetic image detection tasks. Code and relevant prompts can be found at https://github.com/V-i-g-n-e-s-h-N/VERITAS .
AI生成内容虽革新了数字媒体,却引发真实性担忧,为此提出VERITAS框架,通过定位伪影和生成可读解释,实现对合成图像的检测与归因分析。
AI-generated content has transformed digital media but challenges authenticity, prompting the development of VERITAS, a framework that detects and explains synthetic images through artifact localization and human-readable reasoning.

Authors:Juyi Lin, Amir Taherin, Arash Akbari, Arman Akbari, Lei Lu, Guangyu Chen, Taskin Padir, Xiaomeng Yang, Weiwei Chen, Yiqian Li, Xue Lin, David Kaeli, Pu Zhao, Yanzhi Wang
Title: VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting
Abstract:
Recent large-scale Vision Language Action (VLA) models have shown superior performance in robotic manipulation tasks guided by natural language. However, current VLA models suffer from two drawbacks: (i) generation of massive tokens leading to high inference latency and increased training cost, and (ii) insufficient utilization of generated actions resulting in potential performance loss. To address these issues, we develop a training framework to finetune VLA models for generating significantly fewer action tokens with high parallelism, effectively reducing inference latency and training cost. Furthermore, we introduce an inference optimization technique with a novel voting-based ensemble strategy to combine current and previous action predictions, improving the utilization of generated actions and overall performance. Our results demonstrate that we achieve superior performance compared with state-of-the-art VLA models, achieving significantly higher success rates and 39$\times$ faster inference than OpenVLA with 46 Hz throughput on edge platforms, demonstrating practical deployability. The code is available at https://github.com/LukeLIN-web/VOTE.
中文摘要:本研究提出的训练框架和推理优化技术显著减少了视觉语言动作模型的行动令牌与延迟,同时通过改进行动利用提升了性能,实现了比现有最优方法更快的推理速度和更高的成功率。
English summary: This study introduces a training framework and inference optimization technique that significantly reduces action tokens and latency while enhancing performance in Vision Language Action models, achieving faster inference and higher success rates than state-of-the-art methods.

Authors:Yuyi Zhang, Peirong Zhang, Zhenhua Yang, Pengyu Yan, Yongxin Shi, Pengwei Liu, Fengjun Guo, Lianwen Jin
Title: Reviving Cultural Heritage: A Novel Approach for Comprehensive Historical Document Restoration
Abstract:
Historical documents represent an invaluable cultural heritage, yet have undergone significant degradation over time through tears, water erosion, and oxidation. Existing Historical Document Restoration (HDR) methods primarily focus on single modality or limited-size restoration, failing to meet practical needs. To fill this gap, we present a full-page HDR dataset (FPHDR) and a novel automated HDR solution (AutoHDR). Specifically, FPHDR comprises 1,633 real and 6,543 synthetic images with character-level and line-level locations, as well as character annotations in different damage grades. AutoHDR mimics historians' restoration workflows through a three-stage approach: OCR-assisted damage localization, vision-language context text prediction, and patch autoregressive appearance restoration. The modular architecture of AutoHDR enables seamless human-machine collaboration, allowing for flexible intervention and optimization at each restoration stage. Experiments demonstrate AutoHDR's remarkable performance in HDR. When processing severely damaged documents, our method improves OCR accuracy from 46.83% to 84.05%, with further enhancement to 94.25% through human-machine collaboration. We believe this work represents a significant advancement in automated historical document restoration and contributes substantially to cultural heritage preservation. The model and dataset are available at https://github.com/SCUT-DLVCLab/AutoHDR.
中文: 本文提出新型自动化历史文档修复系统(AutoHDR)和完整数据集(FPHDR),通过三阶段工作流程和人机协作显著提升OCR识别准确率,在文化遗产保护领域实现重要突破。
English: This paper introduces a novel automated historical document restoration system (AutoHDR) and a comprehensive dataset (FPHDR), which significantly enhances OCR accuracy through a three-stage workflow and human-machine collaboration, representing a major advancement in cultural heritage preservation.

Authors:Hongyao Yu, Yixiang Qiu, Yiheng Yang, Hao Fang, Tianqu Zhuang, Jiaxin Hong, Bin Chen, Hao Wu, Shu-Tao Xia
Title: ICAS: Detecting Training Data from Autoregressive Image Generative Models
Abstract:
Autoregressive image generation has witnessed rapid advancements, with prominent models such as scale-wise visual auto-regression pushing the boundaries of visual synthesis. However, these developments also raise significant concerns regarding data privacy and copyright. In response, training data detection has emerged as a critical task for identifying unauthorized data usage in model training. To better understand the vulnerability of autoregressive image generative models to such detection, we conduct the first study applying membership inference to this domain. Our approach comprises two key components: implicit classification and an adaptive score aggregation strategy. First, we compute the implicit token-wise classification score within the query image. Then we propose an adaptive score aggregation strategy to acquire a final score, which places greater emphasis on the tokens with lower scores. A higher final score indicates that the sample is more likely to be involved in the training set. To validate the effectiveness of our method, we adapt existing detection algorithms originally designed for LLMs to visual autoregressive models. Extensive experiments demonstrate the superiority of our method in both class-conditional and text-to-image scenarios. Moreover, our approach exhibits strong robustness and generalization under various data transformations. Furthermore, sufficient experiments suggest two novel key findings: (1) A linear scaling law on membership inference, exposing the vulnerability of large foundation models. (2) Training data from scale-wise visual autoregressive models is easier to detect than other autoregressive paradigms.Our code is available at https://github.com/Chrisqcwx/ImageAR-MIA.
中文: 本研究首次针对自回归图像生成模型提出成员推理方法,通过隐式分类与自适应分数聚合策略有效检测训练数据滥用,并揭示大型基础模型的脆弱性及不同范式间的检测差异。
English: This study introduces the first membership inference method for autoregressive image generative models, combining implicit classification with adaptive score aggregation to effectively detect unauthorized training data usage and revealing vulnerabilities in large foundation models.

Authors:Jan Carreras Boada, Rao Muhammad Umer, Carsten Marr
Title: CytoDiff: AI-Driven Cytomorphology Image Synthesis for Medical Diagnostics
Abstract:
Biomedical datasets are often constrained by stringent privacy requirements and frequently suffer from severe class imbalance. These two aspects hinder the development of accurate machine learning models. While generative AI offers a promising solution, producing synthetic images of sufficient quality for training robust classifiers remains challenging. This work addresses the classification of individual white blood cells, a critical task in diagnosing hematological malignancies such as acute myeloid leukemia (AML). We introduce CytoDiff, a stable diffusion model fine-tuned with LoRA weights and guided by few-shot samples that generates high-fidelity synthetic white blood cell images. Our approach demonstrates substantial improvements in classifier performance when training data is limited. Using a small, highly imbalanced real dataset, the addition of 5,000 synthetic images per class improved ResNet classifier accuracy from 27\% to 78\% (+51\%). Similarly, CLIP-based classification accuracy increased from 62\% to 77\% (+15\%). These results establish synthetic image generation as a valuable tool for biomedical machine learning, enhancing data coverage and facilitating secure data sharing while preserving patient privacy. Paper code is publicly available at https://github.com/JanCarreras24/CytoDiff.
中文摘要:生物医学数据集常受隐私限制和类别不平衡的困扰,影响机器学习准确性,而CytoDiff这一稳定扩散模型通过生成高质量合成白细胞图像,显著提升了分类器性能,将ResNet准确率提高51%、CLIP提高15%。
English Summary: Biomedical datasets face privacy constraints and class imbalance, hindering accurate machine learning, but CytoDiff, a stable diffusion model, generates high-fidelity synthetic white blood cell images that significantly boost classifier performance, improving ResNet accuracy by 51% and CLIP by 15%.

Authors:Ricardo Cardoso, Plinio Moreno
Title: Estimating Object Physical Properties from RGB-D Vision and Depth Robot Sensors Using Deep Learning
Abstract:
Inertial mass plays a crucial role in robotic applications such as object grasping, manipulation, and simulation, providing a strong prior for planning and control. Accurately estimating an object's mass before interaction can significantly enhance the performance of various robotic tasks. However, mass estimation using only vision sensors is a relatively underexplored area. This paper proposes a novel approach combining sparse point-cloud data from depth images with RGB images to estimate the mass of objects. We evaluate a range of point-cloud processing architectures, alongside RGB-only methods. To overcome the limited availability of training data, we create a synthetic dataset using ShapeNetSem 3D models, simulating RGBD images via a Kinect camera. This synthetic data is used to train an image generation model for estimating dense depth maps, which we then use to augment an existing dataset of images paired with mass values. Our approach significantly outperforms existing benchmarks across all evaluated metrics. The data generation (https://github.com/RavineWindteer/ShapenetSem-to-RGBD) as well as the training of the depth estimator (https://github.com/RavineWindteer/GLPDepth-Edited) and the mass estimator (https://github.com/RavineWindteer/Depth-mass-estimator) are available online.
Chinese: 本文提出了一种结合深度图像稀疏点云与RGB图像的新方法,通过合成数据克服训练数据不足,实现了物体质量的精确估计,并在各项评估指标上显著优于现有基准。
English: This paper introduces a novel method that combines sparse point-cloud data from depth images with RGB images to accurately estimate object mass, significantly outperforming existing benchmarks by using synthetic data to overcome training data limitations.

Authors:Soham Walimbe, Britty Baby, Vinkle Srivastav, Nicolas Padoy
Title: Adaptation of Multi-modal Representation Models for Multi-task Surgical Computer Vision
Abstract:
Surgical AI often involves multiple tasks within a single procedure, like phase recognition or assessing the Critical View of Safety in laparoscopic cholecystectomy. Traditional models, built for one task at a time, lack flexibility, requiring a separate model for each. To address this, we introduce MML-SurgAdapt, a unified multi-task framework with Vision-Language Models (VLMs), specifically CLIP, to handle diverse surgical tasks through natural language supervision. A key challenge in multi-task learning is the presence of partial annotations when integrating different tasks. To overcome this, we employ Single Positive Multi-Label (SPML) learning, which traditionally reduces annotation burden by training models with only one positive label per instance. Our framework extends this approach to integrate data from multiple surgical tasks within a single procedure, enabling effective learning despite incomplete or noisy annotations. We demonstrate the effectiveness of our model on a combined dataset consisting of Cholec80, Endoscapes2023, and CholecT50, utilizing custom prompts. Extensive evaluation shows that MML-SurgAdapt performs comparably to task-specific benchmarks, with the added advantage of handling noisy annotations. It also outperforms the existing SPML frameworks for the task. By reducing the required labels by 23%, our approach proposes a more scalable and efficient labeling process, significantly easing the annotation burden on clinicians. To our knowledge, this is the first application of SPML to integrate data from multiple surgical tasks, presenting a novel and generalizable solution for multi-task learning in surgical computer vision. Implementation is available at: https://github.com/CAMMA-public/MML-SurgAdapt
中文:MML-SurgAdapt提出了一种基于视觉语言模型的统一多任务框架,通过自然语言监督处理多种外科手术任务,采用单正例多标签学习方法有效应对部分标注问题,在减少23%标注需求的同时保持了与专用模型相当的性能。
English: MML-SurgAdapt introduces a unified multi-task framework using Vision-Language Models to handle diverse surgical tasks with natural language supervision, effectively addressing partial annotations through Single Positive Multi-Label learning and reducing labeling requirements by 23% while maintaining performance comparable to task-specific models.

Authors:Britty Baby, Vinkle Srivastav, Pooja P. Jain, Kun Yuan, Pietro Mascagni, Nicolas Padoy
Title: Multi-modal Representations for Fine-grained Multi-label Critical View of Safety Recognition
Abstract:
The Critical View of Safety (CVS) is crucial for safe laparoscopic cholecystectomy, yet assessing CVS criteria remains a complex and challenging task, even for experts. Traditional models for CVS recognition depend on vision-only models learning with costly, labor-intensive spatial annotations. This study investigates how text can be harnessed as a powerful tool for both training and inference in multi-modal surgical foundation models to automate CVS recognition. Unlike many existing multi-modal models, which are primarily adapted for multi-class classification, CVS recognition requires a multi-label framework. Zero-shot evaluation of existing multi-modal surgical models shows a significant performance gap for this task. To address this, we propose CVS-AdaptNet, a multi-label adaptation strategy that enhances fine-grained, binary classification across multiple labels by aligning image embeddings with textual descriptions of each CVS criterion using positive and negative prompts. By adapting PeskaVLP, a state-of-the-art surgical foundation model, on the Endoscapes-CVS201 dataset, CVS-AdaptNet achieves 57.6 mAP, improving over the ResNet50 image-only baseline (51.5 mAP) by 6 points. Our results show that CVS-AdaptNet's multi-label, multi-modal framework, enhanced by textual prompts, boosts CVS recognition over image-only methods. We also propose text-specific inference methods, that helps in analysing the image-text alignment. While further work is needed to match state-of-the-art spatial annotation-based methods, this approach highlights the potential of adapting generalist models to specialized surgical tasks. Code: https://github.com/CAMMA-public/CVS-AdaptNet
中文: 本研究提出CVS-AdaptNet多模态框架,通过图像嵌入与文本提示的对齐,显著提升了腹腔镜胆囊切除术中安全临界视图的识别能力,相比纯图像方法将mAP提高了6个百分点。
English: This study introduces CVS-AdaptNet, a multi-modal framework that enhances Critical View of Safety recognition in laparoscopic cholecystectomy by aligning image embeddings with textual prompts, achieving a 6-point mAP improvement over image-only methods.

Authors:Qinkai Yu, Jianyang Xie, Yitian Zhao, Cheng Chen, Lijun Zhang, Liming Chen, Jun Cheng, Lu Liu, Yalin Zheng, Yanda Meng
Title: Robust Incomplete-Modality Alignment for Ophthalmic Disease Grading and Diagnosis via Labeled Optimal Transport
Abstract:
Multimodal ophthalmic imaging-based diagnosis integrates color fundus image with optical coherence tomography (OCT) to provide a comprehensive view of ocular pathologies. However, the uneven global distribution of healthcare resources often results in real-world clinical scenarios encountering incomplete multimodal data, which significantly compromises diagnostic accuracy. Existing commonly used pipelines, such as modality imputation and distillation methods, face notable limitations: 1)Imputation methods struggle with accurately reconstructing key lesion features, since OCT lesions are localized, while fundus images vary in style. 2)distillation methods rely heavily on fully paired multimodal training data. To address these challenges, we propose a novel multimodal alignment and fusion framework capable of robustly handling missing modalities in the task of ophthalmic diagnostics. By considering the distinctive feature characteristics of OCT and fundus images, we emphasize the alignment of semantic features within the same category and explicitly learn soft matching between modalities, allowing the missing modality to utilize existing modality information, achieving robust cross-modal feature alignment under the missing modality. Specifically, we leverage the Optimal Transport for multi-scale modality feature alignment: class-wise alignment through predicted class prototypes and feature-wise alignment via cross-modal shared feature transport. Furthermore, we propose an asymmetric fusion strategy that effectively exploits the distinct characteristics of OCT and fundus modalities. Extensive evaluations on three large ophthalmic multimodal datasets demonstrate our model's superior performance under various modality-incomplete scenarios, achieving Sota performance in both complete modality and inter-modality incompleteness conditions. Code is available at https://github.com/Qinkaiyu/RIMA
中文总结:本文提出了一种新颖的多模态对齐融合框架,通过最优传输实现特征对齐和采用非对称融合策略,能够稳健处理眼科诊断中OCT或眼底图像缺失的情况,在各种不完整数据场景下均展现出最优性能。
English Summary: This paper introduces a novel multimodal alignment and fusion framework that robustly handles missing OCT or fundus image modalities in ophthalmic diagnostics by employing optimal transport for feature alignment and asymmetric fusion strategies, demonstrating state-of-the-art performance across various incomplete data scenarios.

Authors:Zonglin Lyu, Chen Chen
Title: TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation
Abstract:
Video Frame Interpolation (VFI) aims to predict the intermediate frame $I_n$ (we use n to denote time in videos to avoid notation overload with the timestep $t$ in diffusion models) based on two consecutive neighboring frames $I_0$ and $I_1$. Recent approaches apply diffusion models (both image-based and video-based) in this task and achieve strong performance. However, image-based diffusion models are unable to extract temporal information and are relatively inefficient compared to non-diffusion methods. Video-based diffusion models can extract temporal information, but they are too large in terms of training scale, model size, and inference time. To mitigate the above issues, we propose Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation (TLB-VFI), an efficient video-based diffusion model. By extracting rich temporal information from video inputs through our proposed 3D-wavelet gating and temporal-aware autoencoder, our method achieves 20% improvement in FID on the most challenging datasets over recent SOTA of image-based diffusion models. Meanwhile, due to the existence of rich temporal information, our method achieves strong performance while having 3times fewer parameters. Such a parameter reduction results in 2.3x speed up. By incorporating optical flow guidance, our method requires 9000x less training data and achieves over 20x fewer parameters than video-based diffusion models. Codes and results are available at our project page: https://zonglinl.github.io/tlbvfi_page.

Authors:Qinkai Yu, Wei Zhou, Hantao Liu, Yanyu Xu, Meng Wang, Yitian Zhao, Huazhu Fu, Xujiong Ye, Yalin Zheng, Yanda Meng
Title: Parameterized Diffusion Optimization enabled Autoregressive Ordinal Regression for Diabetic Retinopathy Grading
Abstract:
As a long-term complication of diabetes, diabetic retinopathy (DR) progresses slowly, potentially taking years to threaten vision. An accurate and robust evaluation of its severity is vital to ensure prompt management and care. Ordinal regression leverages the underlying inherent order between categories to achieve superior performance beyond traditional classification. However, there exist challenges leading to lower DR classification performance: 1) The uneven distribution of DR severity levels, characterized by a long-tailed pattern, adds complexity to the grading process. 2)The ambiguity in defining category boundaries introduces additional challenges, making the classification process more complex and prone to inconsistencies. This work proposes a novel autoregressive ordinal regression method called AOR-DR to address the above challenges by leveraging the clinical knowledge of inherent ordinal information in DR grading dataset settings. Specifically, we decompose the DR grading task into a series of ordered steps by fusing the prediction of the previous steps with extracted image features as conditions for the current prediction step. Additionally, we exploit the diffusion process to facilitate conditional probability modeling, enabling the direct use of continuous global image features for autoregression without relearning contextual information from patch-level features. This ensures the effectiveness of the autoregressive process and leverages the capabilities of pre-trained large-scale foundation models. Extensive experiments were conducted on four large-scale publicly available color fundus datasets, demonstrating our model's effectiveness and superior performance over six recent state-of-the-art ordinal regression methods. The implementation code is available at https://github.com/Qinkaiyu/AOR-DR.
Chinese: 本研究提出AOR-DR自回归序数回归方法,通过利用糖尿病视网膜病变分级中的临床序数信息和扩散过程,解决了类别不平衡和边界模糊问题,在多个公开数据集上展现出优越性能。
English: This study introduces AOR-DR, an autoregressive ordinal regression method that addresses challenges in diabetic retinopathy grading by leveraging clinical ordinal information and diffusion processes to improve classification accuracy across imbalanced datasets.

Authors:Yingshan Liang, Keyu Fan, Zhicheng Du, Yiran Wang, Qingyang Shi, Xinyu Zhang, Jiasheng Lu, Peiwu Qin
Title: Hear-Your-Click: Interactive Object-Specific Video-to-Audio Generation
Abstract:
Video-to-audio (V2A) generation shows great potential in fields such as film production. Despite significant advances, current V2A methods relying on global video information struggle with complex scenes and generating audio tailored to specific objects. To address these limitations, we introduce Hear-Your-Click, an interactive V2A framework enabling users to generate sounds for specific objects by clicking on the frame. To achieve this, we propose Object-aware Contrastive Audio-Visual Fine-tuning (OCAV) with a Mask-guided Visual Encoder (MVE) to obtain object-level visual features aligned with audio. Furthermore, we tailor two data augmentation strategies, Random Video Stitching (RVS) and Mask-guided Loudness Modulation (MLM), to enhance the model's sensitivity to segmented objects. To measure audio-visual correspondence, we designed a new evaluation metric, the CAV score. Extensive experiments demonstrate that our framework offers more precise control and improves generation performance across various metrics. Project Page: https://github.com/SynapGrid/Hear-Your-Click
中文:Hear-Your-Click框架通过让用户点击视频帧中的特定对象,结合对象感知的视觉特征和定制化数据增强,实现了交互式的视频到音频生成,从而提升了控制的精确性和生成效果。
English: The Hear-Your-Click framework enables interactive video-to-audio generation by allowing users to click on specific objects in a frame, utilizing object-aware visual features and tailored data augmentation to improve precision and performance.

Authors:Kefan Tang, Lihuo He, Jisheng Dang, Xinbo Gao
Title: Boosting Temporal Sentence Grounding via Causal Inference
Abstract:
Temporal Sentence Grounding (TSG) aims to identify relevant moments in an untrimmed video that semantically correspond to a given textual query. Despite existing studies having made substantial progress, they often overlook the issue of spurious correlations between video and textual queries. These spurious correlations arise from two primary factors: (1) inherent biases in the textual data, such as frequent co-occurrences of specific verbs or phrases, and (2) the model's tendency to overfit to salient or repetitive patterns in video content. Such biases mislead the model into associating textual cues with incorrect visual moments, resulting in unreliable predictions and poor generalization to out-of-distribution examples. To overcome these limitations, we propose a novel TSG framework, causal intervention and counterfactual reasoning that utilizes causal inference to eliminate spurious correlations and enhance the model's robustness. Specifically, we first formulate the TSG task from a causal perspective with a structural causal model. Then, to address unobserved confounders reflecting textual biases toward specific verbs or phrases, a textual causal intervention is proposed, utilizing do-calculus to estimate the causal effects. Furthermore, visual counterfactual reasoning is performed by constructing a counterfactual scenario that focuses solely on video features, excluding the query and fused multi-modal features. This allows us to debias the model by isolating and removing the influence of the video from the overall effect. Experiments on public datasets demonstrate the superiority of the proposed method. The code is available at https://github.com/Tangkfan/CICR.
中文摘要:时序语句定位(TSG)旨在将视频片段与文本查询相匹配,但现有模型因文本偏见和视频过拟合而产生虚假关联,而本文提出的因果干预与反事实推理框架通过消除这些干扰因素显著提升了模型的泛化能力。
English Summary: Temporal Sentence Grounding (TSG) aims to match video moments with text queries, but current models suffer from spurious correlations caused by textual biases and video overfitting, which this new framework addresses through causal intervention and counterfactual reasoning to improve robustness.

Authors:Johannes Künzel, Anna Hilsmann, Peter Eisert
Title: RIPE: Reinforcement Learning on Unlabeled Image Pairs for Robust Keypoint Extraction
Abstract:
We introduce RIPE, an innovative reinforcement learning-based framework for weakly-supervised training of a keypoint extractor that excels in both detection and description tasks. In contrast to conventional training regimes that depend heavily on artificial transformations, pre-generated models, or 3D data, RIPE requires only a binary label indicating whether paired images represent the same scene. This minimal supervision significantly expands the pool of training data, enabling the creation of a highly generalized and robust keypoint extractor. RIPE utilizes the encoder's intermediate layers for the description of the keypoints with a hyper-column approach to integrate information from different scales. Additionally, we propose an auxiliary loss to enhance the discriminative capability of the learned descriptors. Comprehensive evaluations on standard benchmarks demonstrate that RIPE simplifies data preparation while achieving competitive performance compared to state-of-the-art techniques, marking a significant advancement in robust keypoint extraction and description. To support further research, we have made our code publicly available at https://github.com/fraunhoferhhi/RIPE.
中文: RIPE是一种基于强化学习的框架,仅需二元场景标签即可弱监督训练关键点提取器,通过超列方法和辅助损失函数在检测与描述任务中实现优异性能,显著简化数据准备并提升鲁棒性。
English: RIPE is a reinforcement learning framework that trains a keypoint extractor with minimal supervision using only binary scene labels, achieving robust performance in detection and description tasks through a hyper-column approach and auxiliary loss.

Authors:Abiao Li, Chenlei Lv, Yuming Fang, Yifan Zuo, Jian Zhang, Guofeng Mei
Title: PointGAC: Geometric-Aware Codebook for Masked Point Cloud Modeling
Abstract:
Most masked point cloud modeling (MPM) methods follow a regression paradigm to reconstruct the coordinate or feature of masked regions. However, they tend to over-constrain the model to learn the details of the masked region, resulting in failure to capture generalized features. To address this limitation, we propose \textbf{\textit{PointGAC}}, a novel clustering-based MPM method that aims to align the feature distribution of masked regions. Specially, it features an online codebook-guided teacher-student framework. Firstly, it presents a geometry-aware partitioning strategy to extract initial patches. Then, the teacher model updates a codebook via online k-means based on features extracted from the complete patches. This procedure facilitates codebook vectors to become cluster centers. Afterward, we assigns the unmasked features to their corresponding cluster centers, and the student model aligns the assignment for the reconstructed masked features. This strategy focuses on identifying the cluster centers to which the masked features belong, enabling the model to learn more generalized feature representations. Benefiting from a proposed codebook maintenance mechanism, codebook vectors are actively updated, which further increases the efficiency of semantic feature learning. Experiments validate the effectiveness of the proposed method on various downstream tasks. Code is available at https://github.com/LAB123-tech/PointGAC
中文摘要:PointGAC提出了一种基于聚类的掩码点云建模方法,通过在线码本引导的师生框架对齐特征分布,避免了过度约束重建细节的问题,从而学习到更具泛化性的特征表示。
English Summary: PointGAC introduces a clustering-based masked point cloud modeling method that aligns feature distributions through an online codebook-guided teacher-student framework, enabling more generalized feature learning without over-constraining reconstruction details.

Authors:Anbang Wang, Marawan Elbatel, Keyuan Liu, Lizhuo Lin, Meng Lan, Yanqi Yang, Xiaomeng Li
Title: Geometric-Guided Few-Shot Dental Landmark Detection with Human-Centric Foundation Model
Abstract:
Accurate detection of anatomic landmarks is essential for assessing alveolar bone and root conditions, thereby optimizing clinical outcomes in orthodontics, periodontics, and implant dentistry. Manual annotation of landmarks on cone-beam computed tomography (CBCT) by dentists is time-consuming, labor-intensive, and subject to inter-observer variability. Deep learning-based automated methods present a promising approach to streamline this process efficiently. However, the scarcity of training data and the high cost of expert annotations hinder the adoption of conventional deep learning techniques. To overcome these challenges, we introduce GeoSapiens, a novel few-shot learning framework designed for robust dental landmark detection using limited annotated CBCT of anterior teeth. Our GeoSapiens framework comprises two key components: (1) a robust baseline adapted from Sapiens, a foundational model that has achieved state-of-the-art performance in human-centric vision tasks, and (2) a novel geometric loss function that improves the model's capacity to capture critical geometric relationships among anatomical structures. Experiments conducted on our collected dataset of anterior teeth landmarks revealed that GeoSapiens surpassed existing landmark detection methods, outperforming the leading approach by an 8.18% higher success detection rate at a strict 0.5 mm threshold-a standard widely recognized in dental diagnostics. Code is available at: https://github.com/xmed-lab/GeoSapiens.
Chinese: GeoSapiens是一种新颖的小样本学习框架,通过采用稳健的基线模型和几何损失函数,在CBCT扫描中提升了牙科标志点检测的准确性,以严格的0.5毫米阈值计算,其成功检测率比现有领先方法高出8.18%。
English: GeoSapiens is a novel few-shot learning framework that enhances dental landmark detection on CBCT scans by leveraging a robust baseline model and a geometric loss function, achieving an 8.18% higher success rate at a strict 0.5 mm threshold compared to leading methods.

Authors:Wanchang Yu, Qing Zhang, Rongjia Zheng, Wei-Shi Zheng
Title: Structure-Guided Diffusion Models for High-Fidelity Portrait Shadow Removal
Abstract:
We present a diffusion-based portrait shadow removal approach that can robustly produce high-fidelity results. Unlike previous methods, we cast shadow removal as diffusion-based inpainting. To this end, we first train a shadow-independent structure extraction network on a real-world portrait dataset with various synthetic lighting conditions, which allows to generate a shadow-independent structure map including facial details while excluding the unwanted shadow boundaries. The structure map is then used as condition to train a structure-guided inpainting diffusion model for removing shadows in a generative manner. Finally, to restore the fine-scale details (e.g., eyelashes, moles and spots) that may not be captured by the structure map, we take the gradients inside the shadow regions as guidance and train a detail restoration diffusion model to refine the shadow removal result. Extensive experiments on the benchmark datasets show that our method clearly outperforms existing methods, and is effective to avoid previously common issues such as facial identity tampering, shadow residual, color distortion, structure blurring, and loss of details. Our code is available at https://github.com/wanchang-yu/Structure-Guided-Diffusion-for-Portrait-Shadow-Removal.
Chinese: 我们提出的基于扩散的肖像阴影去除方法,通过提取无阴影结构图并采用引导修复和细节恢复扩散模型,有效消除阴影且避免身份篡改、残留阴影等常见问题。
English: Our diffusion-based approach removes portrait shadows by first extracting a shadow-independent structure map and then using guided inpainting and detail restoration diffusion models to achieve high-fidelity results without common artifacts.

Authors:Changsong Lei, Yaqian Liang, Shaofeng Wang, Jiajia Dai, Yong-Jin Liu
Title: TeethGenerator: A two-stage framework for paired pre- and post-orthodontic 3D dental data generation
Abstract:
Digital orthodontics represents a prominent and critical application of computer vision technology in the medical field. So far, the labor-intensive process of collecting clinical data, particularly in acquiring paired 3D orthodontic teeth models, constitutes a crucial bottleneck for developing tooth arrangement neural networks. Although numerous general 3D shape generation methods have been proposed, most of them focus on single-object generation and are insufficient for generating anatomically structured teeth models, each comprising 24-32 segmented teeth. In this paper, we propose TeethGenerator, a novel two-stage framework designed to synthesize paired 3D teeth models pre- and post-orthodontic, aiming to facilitate the training of downstream tooth arrangement networks. Specifically, our approach consists of two key modules: (1) a teeth shape generation module that leverages a diffusion model to learn the distribution of morphological characteristics of teeth, enabling the generation of diverse post-orthodontic teeth models; and (2) a teeth style generation module that synthesizes corresponding pre-orthodontic teeth models by incorporating desired styles as conditional inputs. Extensive qualitative and quantitative experiments demonstrate that our synthetic dataset aligns closely with the distribution of real orthodontic data, and promotes tooth alignment performance significantly when combined with real data for training. The code and dataset are available at https://github.com/lcshhh/teeth_generator.
中文摘要:数字正畸面临训练神经网络所需配对3D牙齿模型数据收集的瓶颈,而TeethGenerator通过两阶段框架生成治疗前后的配对3D牙齿模型,结合真实数据显著提升了牙齿排列性能。
English Summary: Digital orthodontics faces a bottleneck in data collection for training neural networks, which TeethGenerator addresses by using a two-stage framework to synthesize paired 3D teeth models before and after treatment, enhancing tooth alignment performance when combined with real data.

Authors:Maolin Wang, Tianshuo Wei, Sheng Zhang, Ruocheng Guo, Wanyu Wang, Shanshan Ye, Lixin Zou, Xuetao Wei, Xiangyu Zhao
Title: DANCE: Resource-Efficient Neural Architecture Search with Data-Aware and Continuous Adaptation
Abstract:
Neural Architecture Search (NAS) has emerged as a powerful approach for automating neural network design. However, existing NAS methods face critical limitations in real-world deployments: architectures lack adaptability across scenarios, each deployment context requires costly separate searches, and performance consistency across diverse platforms remains challenging. We propose DANCE (Dynamic Architectures with Neural Continuous Evolution), which reformulates architecture search as a continuous evolution problem through learning distributions over architectural components. DANCE introduces three key innovations: a continuous architecture distribution enabling smooth adaptation, a unified architecture space with learned selection gates for efficient sampling, and a multi-stage training strategy for effective deployment optimization. Extensive experiments across five datasets demonstrate DANCE's effectiveness. Our method consistently outperforms state-of-the-art NAS approaches in terms of accuracy while significantly reducing search costs. Under varying computational constraints, DANCE maintains robust performance while smoothly adapting architectures to different hardware requirements. The code and appendix can be found at https://github.com/Applied-Machine-Learning-Lab/DANCE.
中文:DANCE通过将架构搜索转化为连续演化问题,实现了跨场景自适应的高效网络设计,在显著降低搜索成本的同时,在不同硬件约束下均能保持优异的性能表现。
English: DANCE introduces a continuous evolution approach to neural architecture search, enabling adaptive and efficient network design with reduced costs and robust performance across diverse scenarios and hardware constraints.

Authors:Hahyeon Choi, Junhoo Lee, Nojun Kwak
Title: What's Making That Sound Right Now? Video-centric Audio-Visual Localization
Abstract:
Audio-Visual Localization (AVL) aims to identify sound-emitting sources within a visual scene. However, existing studies focus on image-level audio-visual associations, failing to capture temporal dynamics. Moreover, they assume simplified scenarios where sound sources are always visible and involve only a single object. To address these limitations, we propose AVATAR, a video-centric AVL benchmark that incorporates high-resolution temporal information. AVATAR introduces four distinct scenarios -- Single-sound, Mixed-sound, Multi-entity, and Off-screen -- enabling a more comprehensive evaluation of AVL models. Additionally, we present TAVLO, a novel video-centric AVL model that explicitly integrates temporal information. Experimental results show that conventional methods struggle to track temporal variations due to their reliance on global audio features and frame-level mappings. In contrast, TAVLO achieves robust and precise audio-visual alignment by leveraging high-resolution temporal modeling. Our work empirically demonstrates the importance of temporal dynamics in AVL and establishes a new standard for video-centric audio-visual localization.

Authors:Yun Wang, Longguang Wang, Chenghao Zhang, Yongjian Zhang, Zhanjie Zhang, Ao Ma, Chenyou Fan, Tin Lun Lam, Junjie Hu
Title: Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts
Abstract:
Recently, learning-based stereo matching networks have advanced significantly. However, they often lack robustness and struggle to achieve impressive cross-domain performance due to domain shifts and imbalanced disparity distributions among diverse datasets. Leveraging Vision Foundation Models (VFMs) can intuitively enhance the model's robustness, but integrating such a model into stereo matching cost-effectively to fully realize their robustness remains a key challenge. To address this, we propose SMoEStereo, a novel framework that adapts VFMs for stereo matching through a tailored, scene-specific fusion of Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) modules. SMoEStereo introduces MoE-LoRA with adaptive ranks and MoE-Adapter with adaptive kernel sizes. The former dynamically selects optimal experts within MoE to adapt varying scenes across domains, while the latter injects inductive bias into frozen VFMs to improve geometric feature extraction. Importantly, to mitigate computational overhead, we further propose a lightweight decision network that selectively activates MoE modules based on input complexity, balancing efficiency with accuracy. Extensive experiments demonstrate that our method exhibits state-of-the-art cross-domain and joint generalization across multiple benchmarks without dataset-specific adaptation. The code is available at \textcolor{red}{https://github.com/cocowy1/SMoE-Stereo}.
中文:提出的SMoEStereo框架通过将视觉基础模型与自适应MoE-LoRA模块及轻量决策网络相结合,无需针对特定数据集调整即可实现最先进的跨域性能,显著提升了立体匹配的鲁棒性。
English: The proposed SMoEStereo framework enhances stereo matching robustness by integrating Vision Foundation Models with adaptive MoE-LoRA modules and a lightweight decision network, achieving state-of-the-art cross-domain performance without dataset-specific tuning.

Authors:Shengli Zhou, Yang Liu, Feng Zheng
Title: Learn 3D VQA Better with Active Selection and Reannotation
Abstract:
3D Visual Question Answering (3D VQA) is crucial for enabling models to perceive the physical world and perform spatial reasoning. In 3D VQA, the free-form nature of answers often leads to improper annotations that can confuse or mislead models when training on the entire dataset. While other text generation tasks can mitigate this issue by learning on large-scale datasets, the scarcity of 3D scene data enlarges the negative effect of misleading annotations. Although active learning strategies can select valuable instances for training, they fail to identify and resolve misleading labels, which the oracle inevitably provides in practice. To address this issue, we propose a multi-turn interactive active learning strategy. This strategy selects data based on models' semantic uncertainty to form a solid knowledge foundation more effectively and actively requests reannotation from an oracle to resolve potentially misleading labels. For uncertainty assessment, we utilize a variance-based metric that takes semantic relationships between terms into consideration, thus avoiding the uniform inter-class similarity assumption of previous assessment metrics. Extensive experiments exhibit better model performance and a substantial reduction in training costs, with a halving of training costs for achieving relatively high accuracy. The code is available at https://github.com/fz-zsl/AQuA.
Chinese: 该研究提出了一种多轮交互式主动学习策略,通过基于语义不确定性选择数据并请求重新标注来解决误导性标签问题,显著提升了3D视觉问答的模型性能并减半了训练成本。
English: The study introduces a multi-turn interactive active learning strategy for 3D Visual Question Answering that selects data based on semantic uncertainty and requests reannotation to address misleading labels, significantly improving model performance and halving training costs.

Authors:Jiahui Yang, Yongjia Ma, Donglin Di, Hao Li, Wei Chen, Yan Xie, Jianxun Cui, Xun Yang, Wangmeng Zuo
Title: QR-LoRA: Efficient and Disentangled Fine-tuning via QR Decomposition for Customized Generation
Abstract:
Existing text-to-image models often rely on parameter fine-tuning techniques such as Low-Rank Adaptation (LoRA) to customize visual attributes. However, when combining multiple LoRA models for content-style fusion tasks, unstructured modifications of weight matrices often lead to undesired feature entanglement between content and style attributes. We propose QR-LoRA, a novel fine-tuning framework leveraging QR decomposition for structured parameter updates that effectively separate visual attributes. Our key insight is that the orthogonal Q matrix naturally minimizes interference between different visual features, while the upper triangular R matrix efficiently encodes attribute-specific transformations. Our approach fixes both Q and R matrices while only training an additional task-specific $ΔR$ matrix. This structured design reduces trainable parameters to half of conventional LoRA methods and supports effective merging of multiple adaptations without cross-contamination due to the strong disentanglement properties between $ΔR$ matrices. Experiments demonstrate that QR-LoRA achieves superior disentanglement in content-style fusion tasks, establishing a new paradigm for parameter-efficient, disentangled fine-tuning in generative models. The project page is available at: https://luna-ai-lab.github.io/QR-LoRA/.

Authors:Xinhua Lu, Runhe Lai, Yanqi Wu, Kanghao Chen, Wei-Shi Zheng, Ruixuan Wang
Title: FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection
Abstract:
Pre-trained vision-language models (VLMs) have advanced out-of-distribution (OOD) detection recently. However, existing CLIP-based methods often focus on learning OOD-related knowledge to improve OOD detection, showing limited generalization or reliance on external large-scale auxiliary datasets. In this study, instead of delving into the intricate OOD-related knowledge, we propose an innovative CLIP-based framework based on Forced prompt leArning (FA), designed to make full use of the In-Distribution (ID) knowledge and ultimately boost the effectiveness of OOD detection. Our key insight is to learn a prompt (i.e., forced prompt) that contains more diversified and richer descriptions of the ID classes beyond the textual semantics of class labels. Specifically, it promotes better discernment for ID images, by forcing more notable semantic similarity between ID images and the learnable forced prompt. Moreover, we introduce a forced coefficient, encouraging the forced prompt to learn more comprehensive and nuanced descriptions of the ID classes. In this way, FA is capable of achieving notable improvements in OOD detection, even when trained without any external auxiliary datasets, while maintaining an identical number of trainable parameters as CoOp. Extensive empirical evaluations confirm our method consistently outperforms current state-of-the-art methods. Code is available at https://github.com/0xFAFA/FA.
中文: 本研究提出了一种强制提示学习框架,通过利用多样化的类别描述来增强分布外检测,无需外部数据集即可实现卓越性能。
English: This study introduces a forced prompt learning framework that enhances out-of-distribution detection by leveraging in-distribution knowledge through diversified class descriptions, achieving superior performance without external datasets.

Authors:Yikang Zhao, Feng Gao, Xuepeng Jin, Junyu Dong, Qian Du
Title: Dynamic Frequency Feature Fusion Network for Multi-Source Remote Sensing Data Classification
Abstract:
Multi-source data classification is a critical yet challenging task for remote sensing image interpretation. Existing methods lack adaptability to diverse land cover types when modeling frequency domain features. To this end, we propose a Dynamic Frequency Feature Fusion Network (DFFNet) for hyperspectral image (HSI) and Synthetic Aperture Radar (SAR) / Light Detection and Ranging (LiDAR) data joint classification. Specifically, we design a dynamic filter block to dynamically learn the filter kernels in the frequency domain by aggregating the input features. The frequency contextual knowledge is injected into frequency filter kernels. Additionally, we propose spectral-spatial adaptive fusion block for cross-modal feature fusion. It enhances the spectral and spatial attention weight interactions via channel shuffle operation, thereby providing comprehensive cross-modal feature fusion. Experiments on two benchmark datasets show that our DFFNet outperforms state-of-the-art methods in multi-source data classification. The codes will be made publicly available at https://github.com/oucailab/DFFNet.
Chinese: 提出的动态频率特征融合网络(DFFNet)通过动态学习频域滤波器和融合跨模态特征,显著提升了多源遥感数据的分类性能,在基准数据集上超越了现有先进方法。
English: The proposed Dynamic Frequency Feature Fusion Network (DFFNet) enhances multi-source remote sensing classification by dynamically learning frequency domain filters and integrating cross-modal features, demonstrating superior performance over existing methods.

Authors:You Zhou, Lijiang Chen, Guangxia Cui, Wenpei Bai, Yu Guo, Shuchang Lyu, Guangliang Cheng, Qi Zhao
Title: ViTaL: A Multimodality Dataset and Benchmark for Multi-pathological Ovarian Tumor Recognition
Abstract:
Ovarian tumor, as a common gynecological disease, can rapidly deteriorate into serious health crises when undetected early, thus posing significant threats to the health of women. Deep neural networks have the potential to identify ovarian tumors, thereby reducing mortality rates, but limited public datasets hinder its progress. To address this gap, we introduce a vital ovarian tumor pathological recognition dataset called \textbf{ViTaL} that contains \textbf{V}isual, \textbf{T}abular and \textbf{L}inguistic modality data of 496 patients across six pathological categories. The ViTaL dataset comprises three subsets corresponding to different patient data modalities: visual data from 2216 two-dimensional ultrasound images, tabular data from medical examinations of 496 patients, and linguistic data from ultrasound reports of 496 patients. It is insufficient to merely distinguish between benign and malignant ovarian tumors in clinical practice. To enable multi-pathology classification of ovarian tumor, we propose a ViTaL-Net based on the Triplet Hierarchical Offset Attention Mechanism (THOAM) to minimize the loss incurred during feature fusion of multi-modal data. This mechanism could effectively enhance the relevance and complementarity between information from different modalities. ViTaL-Net serves as a benchmark for the task of multi-pathology, multi-modality classification of ovarian tumors. In our comprehensive experiments, the proposed method exhibited satisfactory performance, achieving accuracies exceeding 90\% on the two most common pathological types of ovarian tumor and an overall performance of 85\%. Our dataset and code are available at https://github.com/GGbond-study/vitalnet.
中文: 为解决卵巢肿瘤检测公共数据集不足的问题,本研究推出了ViTaL数据集,整合了496名患者的视觉、表格和语言数据,并提出基于三重分层偏移注意力机制的ViTaL-Net模型,在多病理分类中实现了高精度。
English: To address the limited public datasets for ovarian tumor detection, this study introduces the ViTaL dataset, which integrates visual, tabular, and linguistic data from 496 patients, and proposes ViTaL-Net, a model using a Triplet Hierarchical Offset Attention Mechanism to achieve high accuracy in multi-pathology classification.

Authors:Zhipeng Li, Kegang Wang, Hanguang Xiao, Xingyue Liu, Feizhong Zhou, Jiaxin Jiang, Tianqi Liu
Title: Exploring Remote Physiological Signal Measurement under Dynamic Lighting Conditions at Night: Dataset, Experiment, and Analysis
Abstract:
Remote photoplethysmography (rPPG) is a non-contact technique for measuring human physiological signals. Due to its convenience and non-invasiveness, it has demonstrated broad application potential in areas such as health monitoring and emotion recognition. In recent years, the release of numerous public datasets has significantly advanced the performance of rPPG algorithms under ideal lighting conditions. However, the effectiveness of current rPPG methods in realistic nighttime scenarios with dynamic lighting variations remains largely unknown. Moreover, there is a severe lack of datasets specifically designed for such challenging environments, which has substantially hindered progress in this area of research. To address this gap, we present and release a large-scale rPPG dataset collected under dynamic lighting conditions at night, named DLCN. The dataset comprises approximately 13 hours of video data and corresponding synchronized physiological signals from 98 participants, covering four representative nighttime lighting scenarios. DLCN offers high diversity and realism, making it a valuable resource for evaluating algorithm robustness in complex conditions. Built upon the proposed Happy-rPPG Toolkit, we conduct extensive experiments and provide a comprehensive analysis of the challenges faced by state-of-the-art rPPG methods when applied to DLCN. The dataset and code are publicly available at https://github.com/dalaoplan/Happp-rPPG-Toolkit.
Chinese: 远程光电容积描记术(rPPG)是一种有前景的非接触式生理监测方法,但由于缺乏专门的数据集,其在动态夜间条件下的有效性尚不明确,因此我们发布了DLCN数据集以评估算法的鲁棒性。
English: Remote photoplethysmography (rPPG) is a promising non-contact method for physiological monitoring, but its effectiveness in dynamic nighttime conditions remains unclear due to a lack of specialized datasets, prompting the release of the DLCN dataset to evaluate algorithm robustness.

Authors:Roy Uziel, Irit Chelly, Oren Freifeld, Ari Pakman
Title: Clustering via Self-Supervised Diffusion
Abstract:
Diffusion models, widely recognized for their success in generative tasks, have not yet been applied to clustering. We introduce Clustering via Diffusion (CLUDI), a self-supervised framework that combines the generative power of diffusion models with pre-trained Vision Transformer features to achieve robust and accurate clustering. CLUDI is trained via a teacher-student paradigm: the teacher uses stochastic diffusion-based sampling to produce diverse cluster assignments, which the student refines into stable predictions. This stochasticity acts as a novel data augmentation strategy, enabling CLUDI to uncover intricate structures in high-dimensional data. Extensive evaluations on challenging datasets demonstrate that CLUDI achieves state-of-the-art performance in unsupervised classification, setting new benchmarks in clustering robustness and adaptability to complex data distributions. Our code is available at https://github.com/BGU-CS-VIL/CLUDI.
中文: CLUDI是一种自监督聚类框架,通过教师-学生范式结合扩散模型和视觉Transformer特征,在无监督分类中实现了最先进的性能,能够揭示复杂的数据结构。
English: CLUDI is a self-supervised clustering framework that leverages diffusion models and Vision Transformer features through a teacher-student paradigm, achieving state-of-the-art performance in unsupervised classification by uncovering complex data structures.

Authors:Mohammadreza Sharifi, Ahad Harati
Title: Efficient Training of Deep Networks using Guided Spectral Data Selection: A Step Toward Learning What You Need
Abstract:
Effective data curation is essential for optimizing neural network training. In this paper, we present the Guided Spectrally Tuned Data Selection (GSTDS) algorithm, which dynamically adjusts the subset of data points used for training using an off-the-shelf pre-trained reference model. Based on a pre-scheduled filtering ratio, GSTDS effectively reduces the number of data points processed per batch. The proposed method ensures an efficient selection of the most informative data points for training while avoiding redundant or less beneficial computations. Preserving data points in each batch is performed based on spectral analysis. A Fiedler vector-based scoring mechanism removes the filtered portion of the batch, lightening the resource requirements of the learning. The proposed data selection approach not only streamlines the training process but also promotes improved generalization and accuracy. Extensive experiments on standard image classification benchmarks, including CIFAR-10, Oxford-IIIT Pet, and Oxford-Flowers, demonstrate that GSTDS outperforms standard training scenarios and JEST, a recent state-of-the-art data curation method, on several key factors. It is shown that GSTDS achieves notable reductions in computational requirements, up to four times, without compromising performance. GSTDS exhibits a considerable growth in terms of accuracy under the limited computational resource usage, in contrast to other methodologies. These promising results underscore the potential of spectral-based data selection as a scalable solution for resource-efficient deep learning and motivate further exploration into adaptive data curation strategies. You can find the code at https://github.com/rezasharifi82/GSTDS.
中文: GSTDS算法通过谱分析动态选择神经网络训练中最具信息量的数据点,在多个基准测试中将计算需求显著降低达四倍,同时提高了准确性和泛化能力。
English: The GSTDS algorithm dynamically selects the most informative data points for neural network training using spectral analysis, significantly reducing computational requirements by up to four times while improving accuracy and generalization across multiple benchmarks.

Authors:Yuan Zhong, Jingxiang Sun, Liang An, Yebin Liu
Title: MoReMouse: Monocular Reconstruction of Laboratory Mouse
Abstract:
Laboratory mice play a crucial role in biomedical research, yet accurate 3D mouse surface motion reconstruction remains challenging due to their complex non-rigid geometric deformations and textureless appearance. Moreover, the absence of structured 3D datasets severely hinders the progress beyond sparse keypoint tracking. To narrow the gap, we present MoReMouse, the first monocular dense 3D reconstruction network tailored for laboratory mice. To achieve this goal, we highlight three key designs. First, we construct the first high-fidelity dense-view synthetic dataset for mice, by rendering our self-designed realistic Gaussian mouse avatar. Second, MoReMouse adopts a transformer-based feedforward architecture with triplane representation, achieving high-quality 3D surface generation from a single image. Third, we create geodesic-based continuous correspondence embeddings on mouse surface, which serve as strong semantic priors to improve reconstruction stability and surface consistency. Extensive quantitative and qualitative experiments demonstrate that MoReMouse significantly outperforms existing open-source methods in accuracy and robustness. Video results are available at https://zyyw-eric.github.io/MoreMouse-webpage/.

Authors:Xinbo Wang, Wenju Xu, Qing Zhang, Wei-Shi Zheng
Title: Domain Generalizable Portrait Style Transfer
Abstract:
This paper presents a portrait style transfer method that generalizes well to various different domains while enabling high-quality semantic-aligned stylization on regions including hair, eyes, eyelashes, skins, lips, and background. To this end, we propose to establish dense semantic correspondence between the given input and reference portraits based on a pre-trained model and a semantic adapter, with which we obtain a warped reference semantically aligned with the input. To ensure effective yet controllable style transfer, we devise an AdaIN-Wavelet transform to balance content preservation and stylization by blending low-frequency information of the warped reference with high-frequency information of the input in the latent space. A style adapter is also designed to provide style guidance from the warped reference. With the stylized latent from AdaIN-Wavelet transform, we employ a dual-conditional diffusion model that integrates a ControlNet recording high-frequency information and the style guidance to generate the final result. Extensive experiments demonstrate the superiority of our method. Our code and trained model are available at https://github.com/wangxb29/DGPST.
中文: 本文提出一种人像风格迁移方法,通过建立密集语义对应和使用AdaIN小波变换,在多个面部区域实现高质量语义对齐的风格化,同时有效平衡内容保持与风格化效果。
English: This paper introduces a portrait style transfer method that achieves high-quality, semantically aligned stylization across multiple facial regions by establishing dense semantic correspondence and using an AdaIN-Wavelet transform to balance content preservation with style application.

Authors:Fengrui Tian, Tianjiao Ding, Jinqi Luo, Hancheng Min, René Vidal
Title: Voyaging into Perpetual Dynamic Scenes from a Single View
Abstract:
The problem of generating a perpetual dynamic scene from a single view is an important problem with widespread applications in augmented and virtual reality, and robotics. However, since dynamic scenes regularly change over time, a key challenge is to ensure that different generated views be consistent with the underlying 3D motions. Prior work learns such consistency by training on multiple views, but the generated scene regions often interpolate between training views and fail to generate perpetual views. To address this issue, we propose DynamicVoyager, which reformulates dynamic scene generation as a scene outpainting problem with new dynamic content. As 2D outpainting models struggle at generating 3D consistent motions from a single 2D view, we enrich 2D pixels with information from their 3D rays that facilitates learning of 3D motion consistency. More specifically, we first map the single-view video input to a dynamic point cloud using the estimated video depths. We then render a partial video of the point cloud from a novel view and outpaint the missing regions using ray information (e.g., the distance from a ray to the point cloud) to generate 3D consistent motions. Next, we use the outpainted video to update the point cloud, which is used for outpainting the scene from future novel views. Moreover, we can control the generated content with the input text prompt. Experiments show that our model can generate perpetual scenes with consistent motions along fly-through cameras. Project page: https://tianfr.github.io/DynamicVoyager.

Authors:Linshen Liu, Boyan Su, Junyue Jiang, Guanlin Wu, Cong Guo, Ceyu Xu, Hao Frank Yang
Title: Towards Accurate and Efficient 3D Object Detection for Autonomous Driving: A Mixture of Experts Computing System on Edge
Abstract:
This paper presents Edge-based Mixture of Experts (MoE) Collaborative Computing (EMC2), an optimal computing system designed for autonomous vehicles (AVs) that simultaneously achieves low-latency and high-accuracy 3D object detection. Unlike conventional approaches, EMC2 incorporates a scenario-aware MoE architecture specifically optimized for edge platforms. By effectively fusing LiDAR and camera data, the system leverages the complementary strengths of sparse 3D point clouds and dense 2D images to generate robust multimodal representations. To enable this, EMC2 employs an adaptive multimodal data bridge that performs multi-scale preprocessing on sensor inputs, followed by a scenario-aware routing mechanism that dynamically dispatches features to dedicated expert models based on object visibility and distance. In addition, EMC2 integrates joint hardware-software optimizations, including hardware resource utilization optimization and computational graph simplification, to ensure efficient and real-time inference on resource-constrained edge devices. Experiments on open-source benchmarks clearly show the EMC2 advancements as an end-to-end system. On the KITTI dataset, it achieves an average accuracy improvement of 3.58% and a 159.06% inference speedup compared to 15 baseline methods on Jetson platforms, with similar performance gains on the nuScenes dataset, highlighting its capability to advance reliable, real-time 3D object detection tasks for AVs. The official implementation is available at https://github.com/LinshenLiu622/EMC2.
中文: EMC2是一种面向自动驾驶车辆的边缘优化计算系统,通过场景感知专家混合架构融合激光雷达与相机数据,结合软硬件协同优化,实现了高精度三维物体检测与低延迟性能。
English: EMC2 is an edge-optimized computing system for autonomous vehicles that achieves high-accuracy 3D object detection with low latency by fusing LiDAR and camera data through a scenario-aware mixture of experts architecture and joint hardware-software optimizations.

Authors:Ziming Hong, Runnan Chen, Zengmao Wang, Bo Han, Bo Du, Tongliang Liu
Title: When Data-Free Knowledge Distillation Meets Non-Transferable Teacher: Escaping Out-of-Distribution Trap is All You Need
Abstract:
Data-free knowledge distillation (DFKD) transfers knowledge from a teacher to a student without access the real in-distribution (ID) data. Its common solution is to use a generator to synthesize fake data and use them as a substitute for real ID data. However, existing works typically assume teachers are trustworthy, leaving the robustness and security of DFKD from untrusted teachers largely unexplored. In this work, we conduct the first investigation into distilling non-transferable learning (NTL) teachers using DFKD, where the transferability from an ID domain to an out-of-distribution (OOD) domain is prohibited. We find that NTL teachers fool DFKD through divert the generator's attention from the useful ID knowledge to the misleading OOD knowledge. This hinders ID knowledge transfer but prioritizes OOD knowledge transfer. To mitigate this issue, we propose Adversarial Trap Escaping (ATEsc) to benefit DFKD by identifying and filtering out OOD-like synthetic samples. Specifically, inspired by the evidence that NTL teachers show stronger adversarial robustness on OOD samples than ID samples, we split synthetic samples into two groups according to their robustness. The fragile group is treated as ID-like data and used for normal knowledge distillation, while the robust group is seen as OOD-like data and utilized for forgetting OOD knowledge. Extensive experiments demonstrate the effectiveness of ATEsc for improving DFKD against NTL teachers. Code is released at https://github.com/tmllab/2025_ICML_ATEsc.
中文: 本研究提出对抗性陷阱逃逸方法,通过识别和过滤分布外合成样本,有效应对不可迁移学习教师,提升无数据知识蒸馏的鲁棒性和知识传递效果。
English: This study introduces Adversarial Trap Escaping (ATEsc) to enhance data-free knowledge distillation by identifying and filtering out-of-distribution synthetic samples, effectively countering non-transferable learning teachers and improving knowledge transfer robustness.

Authors:Wenyang Liu, Chen Cai, Jianjun Gao, Kejun Wu, Yi Wang, Kim-Hui Yap, Lap-Pui Chau
Title: PromptSR: Cascade Prompting for Lightweight Image Super-Resolution
Abstract:
Although the lightweight Vision Transformer has significantly advanced image super-resolution (SR), it faces the inherent challenge of a limited receptive field due to the window-based self-attention modeling. The quadratic computational complexity relative to window size restricts its ability to use a large window size for expanding the receptive field while maintaining low computational costs. To address this challenge, we propose PromptSR, a novel prompt-empowered lightweight image SR method. The core component is the proposed cascade prompting block (CPB), which enhances global information access and local refinement via three cascaded prompting layers: a global anchor prompting layer (GAPL) and two local prompting layers (LPLs). The GAPL leverages downscaled features as anchors to construct low-dimensional anchor prompts (APs) through cross-scale attention, significantly reducing computational costs. These APs, with enhanced global perception, are then used to provide global prompts, efficiently facilitating long-range token connections. The two LPLs subsequently combine category-based self-attention and window-based self-attention to refine the representation in a coarse-to-fine manner. They leverage attention maps from the GAPL as additional global prompts, enabling them to perceive features globally at different granularities for adaptive local refinement. In this way, the proposed CPB effectively combines global priors and local details, significantly enlarging the receptive field while maintaining the low computational costs of our PromptSR. The experimental results demonstrate the superiority of our method, which outperforms state-of-the-art lightweight SR methods in quantitative, qualitative, and complexity evaluations. Our code will be released at https://github.com/wenyang001/PromptSR.
Chinese: PromptSR提出了一种级联提示块,通过全局锚点提示和局部提示层在保持低计算成本的同时有效扩大感受野,在定量、定性和复杂度评估上均优于现有轻量级图像超分辨率方法。
English: PromptSR introduces a cascade prompting block that uses global anchor prompts and local prompting layers to expand the receptive field efficiently while maintaining low computational costs, outperforming existing lightweight image super-resolution methods.

Authors:Xiaohan Zhang, Tavis Shore, Chen Chen, Oscar Mendez, Simon Hadfield, Safwan Wshah
Title: VICI: VLM-Instructed Cross-view Image-localisation
Abstract:
In this paper, we present a high-performing solution to the UAVM 2025 Challenge, which focuses on matching narrow FOV street-level images to corresponding satellite imagery using the University-1652 dataset. As panoramic Cross-View Geo-Localisation nears peak performance, it becomes increasingly important to explore more practical problem formulations. Real-world scenarios rarely offer panoramic street-level queries; instead, queries typically consist of limited-FOV images captured with unknown camera parameters. Our work prioritises discovering the highest achievable performance under these constraints, pushing the limits of existing architectures. Our method begins by retrieving candidate satellite image embeddings for a given query, followed by a re-ranking stage that selectively enhances retrieval accuracy within the top candidates. This two-stage approach enables more precise matching, even under the significant viewpoint and scale variations inherent in the task. Through experimentation, we demonstrate that our approach achieves competitive results -specifically attaining R@1 and R@10 retrieval rates of \topone\% and \topten\% respectively. This underscores the potential of optimised retrieval and re-ranking strategies in advancing practical geo-localisation performance. Code is available at https://github.com/tavisshore/VICI.
中文: 本文提出了一种两阶段方法,通过优化的嵌入检索和选择性重排序,实现了窄视场街景图像与卫星图像的精确匹配,获得了具有竞争力的检索率。
English: This paper introduces a two-stage method for matching narrow field-of-view street images to satellite imagery, achieving competitive retrieval rates through optimized embedding retrieval and selective re-ranking.

Authors:Jianwei Tang, Hong Yang, Tengyue Chen, Jian-Fang Hu
Title: Stochastic Human Motion Prediction with Memory of Action Transition and Action Characteristic
Abstract:
Action-driven stochastic human motion prediction aims to generate future motion sequences of a pre-defined target action based on given past observed sequences performing non-target actions. This task primarily presents two challenges. Firstly, generating smooth transition motions is hard due to the varying transition speeds of different actions. Secondly, the action characteristic is difficult to be learned because of the similarity of some actions. These issues cause the predicted results to be unreasonable and inconsistent. As a result, we propose two memory banks, the Soft-transition Action Bank (STAB) and Action Characteristic Bank (ACB), to tackle the problems above. The STAB stores the action transition information. It is equipped with the novel soft searching approach, which encourages the model to focus on multiple possible action categories of observed motions. The ACB records action characteristic, which produces more prior information for predicting certain actions. To fuse the features retrieved from the two banks better, we further propose the Adaptive Attention Adjustment (AAA) strategy. Extensive experiments on four motion prediction datasets demonstrate that our approach consistently outperforms the previous state-of-the-art. The demo and code are available at https://hyqlat.github.io/STABACB.github.io/.

Authors:Hanghui Guo, Weijie Shi, Mengze Li, Juncheng Li, Hao Chen, Yue Cui, Jiajie Xu, Jia Zhu, Jiawei Shen, Zhangze Chen, Sirui Han
Title: Consistent and Invariant Generalization Learning for Short-video Misinformation Detection
Abstract:
Short-video misinformation detection has attracted wide attention in the multi-modal domain, aiming to accurately identify the misinformation in the video format accompanied by the corresponding audio. Despite significant advancements, current models in this field, trained on particular domains (source domains), often exhibit unsatisfactory performance on unseen domains (target domains) due to domain gaps. To effectively realize such domain generalization on the short-video misinformation detection task, we propose deep insights into the characteristics of different domains: (1) The detection on various domains may mainly rely on different modalities (i.e., mainly focusing on videos or audios). To enhance domain generalization, it is crucial to achieve optimal model performance on all modalities simultaneously. (2) For some domains focusing on cross-modal joint fraud, a comprehensive analysis relying on cross-modal fusion is necessary. However, domain biases located in each modality (especially in each frame of videos) will be accumulated in this fusion process, which may seriously damage the final identification of misinformation. To address these issues, we propose a new DOmain generalization model via ConsisTency and invariance learning for shORt-video misinformation detection (named DOCTOR), which contains two characteristic modules: (1) We involve the cross-modal feature interpolation to map multiple modalities into a shared space and the interpolation distillation to synchronize multi-modal learning; (2) We design the diffusion model to add noise to retain core features of multi modal and enhance domain invariant features through cross-modal guided denoising. Extensive experiments demonstrate the effectiveness of our proposed DOCTOR model. Our code is public available at https://github.com/ghh1125/DOCTOR.
中文: 该摘要提出了DOCTOR模型,通过跨模态一致性学习和基于扩散的去噪方法,在共享空间中同步多模态特征并强化域不变特征,有效解决了短视频虚假信息检测中因领域差异导致的泛化性能下降问题。
English: This abstract introduces DOCTOR, a novel domain generalization model for short-video misinformation detection that employs cross-modal consistency learning and diffusion-based denoising to overcome performance degradation across unseen domains by synchronizing multi-modal features and enhancing domain-invariant representations.

Authors:Jianwei Tang, Jiangxin Sun, Xiaotong Lin, Lifang Zhang, Wei-Shi Zheng, Jian-Fang Hu
Title: Temporal Continual Learning with Prior Compensation for Human Motion Prediction
Abstract:
Human Motion Prediction (HMP) aims to predict future poses at different moments according to past motion sequences. Previous approaches have treated the prediction of various moments equally, resulting in two main limitations: the learning of short-term predictions is hindered by the focus on long-term predictions, and the incorporation of prior information from past predictions into subsequent predictions is limited. In this paper, we introduce a novel multi-stage training framework called Temporal Continual Learning (TCL) to address the above challenges. To better preserve prior information, we introduce the Prior Compensation Factor (PCF). We incorporate it into the model training to compensate for the lost prior information. Furthermore, we derive a more reasonable optimization objective through theoretical derivation. It is important to note that our TCL framework can be easily integrated with different HMP backbone models and adapted to various datasets and applications. Extensive experiments on four HMP benchmark datasets demonstrate the effectiveness and flexibility of TCL. The code is available at https://github.com/hyqlat/TCL.
Chinese: 本文提出了一种时序持续学习(TCL)框架,通过引入先验补偿因子来保留历史信息并优化训练目标,有效解决了人体运动预测中的现有局限,在多个基准数据集上验证了其有效性。
English: This paper introduces a Temporal Continual Learning (TCL) framework with a Prior Compensation Factor to address limitations in Human Motion Prediction by preserving prior information and optimizing training objectives, demonstrating effectiveness across multiple datasets.

Authors:Christopher Wiedeman, Anastasiia Sarmakeeva, Elena Sizikova, Daniil Filienko, Miguel Lago, Jana G. Delfino, Aldo Badano
Title: T-SYNTH: A Knowledge-Based Dataset of Synthetic Breast Images
Abstract:
One of the key impediments for developing and assessing robust medical imaging algorithms is limited access to large-scale datasets with suitable annotations. Synthetic data generated with plausible physical and biological constraints may address some of these data limitations. We propose the use of physics simulations to generate synthetic images with pixel-level segmentation annotations, which are notoriously difficult to obtain. Specifically, we apply this approach to breast imaging analysis and release T-SYNTH, a large-scale open-source dataset of paired 2D digital mammography (DM) and 3D digital breast tomosynthesis (DBT) images. Our initial experimental results indicate that T-SYNTH images show promise for augmenting limited real patient datasets for detection tasks in DM and DBT. Our data and code are publicly available at https://github.com/DIDSR/tsynth-release.
中文摘要:开发稳健医学影像算法的主要障碍在于缺乏大规模标注数据集,而基于物理模拟生成的合成数据可缓解此问题;T-SYNTH乳腺影像数据集的实验表明,其能有效增强真实数据在检测任务中的表现。
English Summary: The development of robust medical imaging algorithms is hindered by limited annotated datasets, which can be addressed through physics-based synthetic data generation, as demonstrated by the T-SYNTH dataset for breast imaging that shows potential for augmenting real data in detection tasks.

Authors:Jingwei Shi, Zeyu Zhang, Biao Wu, Yanjie Liang, Meng Fang, Ling Chen, Yang Zhao
Title: PresentAgent: Multimodal Agent for Presentation Video Generation
Abstract:
We present PresentAgent, a multimodal agent that transforms long-form documents into narrated presentation videos. While existing approaches are limited to generating static slides or text summaries, our method advances beyond these limitations by producing fully synchronized visual and spoken content that closely mimics human-style presentations. To achieve this integration, PresentAgent employs a modular pipeline that systematically segments the input document, plans and renders slide-style visual frames, generates contextual spoken narration with large language models and Text-to-Speech models, and seamlessly composes the final video with precise audio-visual alignment. Given the complexity of evaluating such multimodal outputs, we introduce PresentEval, a unified assessment framework powered by Vision-Language Models that comprehensively scores videos across three critical dimensions: content fidelity, visual clarity, and audience comprehension through prompt-based evaluation. Our experimental validation on a curated dataset of 30 document-presentation pairs demonstrates that PresentAgent approaches human-level quality across all evaluation metrics. These results highlight the significant potential of controllable multimodal agents in transforming static textual materials into dynamic, effective, and accessible presentation formats. Code will be available at https://github.com/AIGeeksGroup/PresentAgent.
中文:PresentAgent是一种将长篇文档转化为同步视觉与语音叙述视频的多模态系统,其通过新型评估框架PresentEval验证,在各项关键指标上均展现出接近人类水平的演示质量。
English: PresentAgent is a multimodal system that converts long documents into synchronized narrated videos with human-like presentation quality, evaluated by the novel PresentEval framework which confirms its near-human performance across key metrics.

Authors:Seungjin Jung, Kanghee Lee, Yonghyun Jeong, Haeun Noh, Jungmin Lee, Jongwon Choi
Title: Group-wise Scaling and Orthogonal Decomposition for Domain-Invariant Feature Extraction in Face Anti-Spoofing
Abstract:
Domain Generalizable Face Anti-Spoofing (DGFAS) methods effectively capture domain-invariant features by aligning the directions (weights) of local decision boundaries across domains. However, the bias terms associated with these boundaries remain misaligned, leading to inconsistent classification thresholds and degraded performance on unseen target domains. To address this issue, we propose a novel DGFAS framework that jointly aligns weights and biases through Feature Orthogonal Decomposition (FOD) and Group-wise Scaling Risk Minimization (GS-RM). Specifically, GS-RM facilitates bias alignment by balancing group-wise losses across multiple domains. FOD employs the Gram-Schmidt orthogonalization process to decompose the feature space explicitly into domain-invariant and domain-specific subspaces. By enforcing orthogonality between domain-specific and domain-invariant features during training using domain labels, FOD ensures effective weight alignment across domains without negatively impacting bias alignment. Additionally, we introduce Expected Calibration Error (ECE) as a novel evaluation metric for quantitatively assessing the effectiveness of our method in aligning bias terms across domains. Extensive experiments on benchmark datasets demonstrate that our approach achieves state-of-the-art performance, consistently improving accuracy, reducing bias misalignment, and enhancing generalization stability on unseen target domains.
中文摘要:该研究提出的DGFAS框架通过特征正交分解和组间缩放风险最小化技术,联合校准权重与偏置项,有效解决了跨域人脸活体检测中的偏置失准问题,在多个基准数据集上实现了最优性能。
English Summary: The proposed DGFAS framework addresses bias misalignment in face anti-spoofing systems by jointly aligning weights and biases through Feature Orthogonal Decomposition and Group-wise Scaling Risk Minimization, achieving state-of-the-art performance across domains.

Authors:Siyu Li, Fei Teng, Yihong Cao, Kailun Yang, Zhiyong Li, Yaonan Wang
Title: NRSeg: Noise-Resilient Learning for BEV Semantic Segmentation via Driving World Models
Abstract:
Birds' Eye View (BEV) semantic segmentation is an indispensable perception task in end-to-end autonomous driving systems. Unsupervised and semi-supervised learning for BEV tasks, as pivotal for real-world applications, underperform due to the homogeneous distribution of the labeled data. In this work, we explore the potential of synthetic data from driving world models to enhance the diversity of labeled data for robustifying BEV segmentation. Yet, our preliminary findings reveal that generation noise in synthetic data compromises efficient BEV model learning. To fully harness the potential of synthetic data from world models, this paper proposes NRSeg, a noise-resilient learning framework for BEV semantic segmentation. Specifically, a Perspective-Geometry Consistency Metric (PGCM) is proposed to quantitatively evaluate the guidance capability of generated data for model learning. This metric originates from the alignment measure between the perspective road mask of generated data and the mask projected from the BEV labels. Moreover, a Bi-Distribution Parallel Prediction (BiDPP) is designed to enhance the inherent robustness of the model, where the learning process is constrained through parallel prediction of multinomial and Dirichlet distributions. The former efficiently predicts semantic probabilities, whereas the latter adopts evidential deep learning to realize uncertainty quantification. Furthermore, a Hierarchical Local Semantic Exclusion (HLSE) module is designed to address the non-mutual exclusivity inherent in BEV semantic segmentation tasks. Experimental results demonstrate that NRSeg achieves state-of-the-art performance, yielding the highest improvements in mIoU of 13.8% and 11.4% in unsupervised and semi-supervised BEV segmentation tasks, respectively. The source code will be made publicly available at https://github.com/lynn-yu/NRSeg.
中文: 本文提出NRSeg框架,通过一致性度量和鲁棒学习模块利用合成数据增强鸟瞰图语义分割,在无监督和半监督任务中实现了最先进的性能提升。
English: This paper introduces NRSeg, a noise-resilient framework that enhances BEV semantic segmentation by leveraging synthetic data through a consistency metric and robust learning modules, achieving state-of-the-art performance improvements in unsupervised and semi-supervised tasks.

Authors:Aleksandr Gushchin, Maksim Smirnov, Dmitriy Vatolin, Anastasia Antsiferova
Title: LEHA-CVQAD: Dataset To Enable Generalized Video Quality Assessment of Compression Artifacts
Abstract:
We propose the LEHA-CVQAD (Large-scale Enriched Human-Annotated Compressed Video Quality Assessment) dataset, which comprises 6,240 clips for compression-oriented video quality assessment. 59 source videos are encoded with 186 codec-preset variants, 1.8M pairwise, and 1.5k MOS ratings are fused into a single quality scale; part of the videos remains hidden for blind evaluation. We also propose Rate-Distortion Alignment Error (RDAE), a novel evaluation metric that quantifies how well VQA models preserve bitrate-quality ordering, directly supporting codec parameter tuning. Testing IQA/VQA methods reveals that popular VQA metrics exhibit high RDAE and lower correlations, underscoring the dataset challenges and utility. The open part and the results of LEHA-CVQAD are available at https://aleksandrgushchin.github.io/lcvqad/
中文:LEHA-CVQAD数据集包含6,240个压缩视频片段及新型RDAE评估指标,通过测试发现现有视频质量评估方法存在明显不足,为编解码器参数优化提供了重要数据支撑。
English: The LEHA-CVQAD dataset introduces 6,240 compressed video clips with extensive annotations and a novel RDAE metric to evaluate video quality models, revealing limitations in current methods while providing resources for codec optimization.

Authors:Tianyao He, Runqi Wang, Yang Chen, Dejia Song, Nemo Chen, Xu Tang, Yao Hu
Title: Flux-Sculptor: Text-Driven Rich-Attribute Portrait Editing through Decomposed Spatial Flow Control
Abstract:
Text-driven portrait editing holds significant potential for various applications but also presents considerable challenges. An ideal text-driven portrait editing approach should achieve precise localization and appropriate content modification, yet existing methods struggle to balance reconstruction fidelity and editing flexibility. To address this issue, we propose Flux-Sculptor, a flux-based framework designed for precise text-driven portrait editing. Our framework introduces a Prompt-Aligned Spatial Locator (PASL) to accurately identify relevant editing regions and a Structure-to-Detail Edit Control (S2D-EC) strategy to spatially guide the denoising process through sequential mask-guided fusion of latent representations and attention values. Extensive experiments demonstrate that Flux-Sculptor surpasses existing methods in rich-attribute editing and facial information preservation, making it a strong candidate for practical portrait editing applications. Project page is available at https://flux-sculptor.github.io/.
中文: Flux-Sculptor是一种基于通量的创新框架,通过提示对齐空间定位器和结构到细节编辑控制策略,实现了精准的文本驱动人像编辑,在编辑灵活性和面部信息保留方面表现卓越。
English: Flux-Sculptor is a novel framework that achieves precise text-driven portrait editing through a Prompt-Aligned Spatial Locator and a Structure-to-Detail Edit Control strategy, excelling in both editing flexibility and facial preservation.

Authors:Ze Li, Feng Zhang, Xiatian Zhu, Meng Zhang, Yanghong Zhou, P. Y. Mok
Title: Robust Low-light Scene Restoration via Illumination Transition
Abstract:
Synthesizing normal-light novel views from low-light multiview images is an important yet challenging task, given the low visibility and high ISO noise present in the input images. Existing low-light enhancement methods often struggle to effectively preprocess such low-light inputs, as they fail to consider correlations among multiple views. Although other state-of-the-art methods have introduced illumination-related components offering alternative solutions to the problem, they often result in drawbacks such as color distortions and artifacts, and they provide limited denoising effectiveness. In this paper, we propose a novel Robust Low-light Scene Restoration framework (RoSe), which enables effective synthesis of novel views in normal lighting conditions from low-light multiview image inputs, by formulating the task as an illuminance transition estimation problem in 3D space, conceptualizing it as a specialized rendering task. This multiview-consistent illuminance transition field establishes a robust connection between low-light and normal-light conditions. By further exploiting the inherent low-rank property of illumination to constrain the transition representation, we achieve more effective denoising without complex 2D techniques or explicit noise modeling. To implement RoSe, we design a concise dual-branch architecture and introduce a low-rank denoising module. Experiments demonstrate that RoSe significantly outperforms state-of-the-art models in both rendering quality and multiview consistency on standard benchmarks. The codes and data are available at https://pegasus2004.github.io/RoSe.

Authors:Kai Ye, Tianyi Chen, Zhen Wang
Title: Evaluating Adversarial Protections for Diffusion Personalization: A Comprehensive Study
Abstract:
With the increasing adoption of diffusion models for image generation and personalization, concerns regarding privacy breaches and content misuse have become more pressing. In this study, we conduct a comprehensive comparison of eight perturbation based protection methods: AdvDM, ASPL, FSGM, MetaCloak, Mist, PhotoGuard, SDS, and SimAC--across both portrait and artwork domains. These methods are evaluated under varying perturbation budgets, using a range of metrics to assess visual imperceptibility and protective efficacy. Our results offer practical guidance for method selection. Code is available at: https://github.com/vkeilo/DiffAdvPerturbationBench.
中文摘要:本研究全面评估了八种基于扰动的扩散模型保护方法,以应对隐私泄露和内容滥用问题,为在肖像和艺术作品领域选择有效技术提供了实用指导。
English Summary: This study comprehensively evaluates eight perturbation-based protection methods for diffusion models to address privacy and misuse concerns, providing practical guidance for selecting effective techniques across portrait and artwork domains.

Authors:Ha-Hieu Pham, Nguyen Lan Vi Vu, Thanh-Huy Nguyen, Ulas Bagci, Min Xu, Trung-Nghia Le, Huy-Hieu Pham
Title: Learning Disentangled Stain and Structural Representations for Semi-Supervised Histopathology Segmentation
Abstract:
Accurate gland segmentation in histopathology images is essential for cancer diagnosis and prognosis. However, significant variability in Hematoxylin and Eosin (H&E) staining and tissue morphology, combined with limited annotated data, poses major challenges for automated segmentation. To address this, we propose Color-Structure Dual-Student (CSDS), a novel semi-supervised segmentation framework designed to learn disentangled representations of stain appearance and tissue structure. CSDS comprises two specialized student networks: one trained on stain-augmented inputs to model chromatic variation, and the other on structure-augmented inputs to capture morphological cues. A shared teacher network, updated via Exponential Moving Average (EMA), supervises both students through pseudo-labels. To further improve label reliability, we introduce stain-aware and structure-aware uncertainty estimation modules that adaptively modulate the contribution of each student during training. Experiments on the GlaS and CRAG datasets show that CSDS achieves state-of-the-art performance in low-label settings, with Dice score improvements of up to 1.2% on GlaS and 0.7% on CRAG at 5% labeled data, and 0.7% and 1.4% at 10%. Our code and pre-trained models are available at https://github.com/hieuphamha19/CSDS.
中文:提出的颜色-结构双学生(CSDS)框架通过两个专门的学生网络分别学习染色和结构特征,解决了组织病理学中腺体分割的难题,在有限标注数据下取得了最优性能。
English: The proposed Color-Structure Dual-Student (CSDS) framework addresses gland segmentation challenges in histopathology by using two specialized student networks to disentangle stain and structural features, achieving state-of-the-art results with limited labeled data.

Authors:Shubin Ma, Liang Zhao, Mingdong Lu, Yifan Guo, Bo Xu
Title: Consistency-Aware Padding for Incomplete Multi-Modal Alignment Clustering Based on Self-Repellent Greedy Anchor Search
Abstract:
Multimodal representation is faithful and highly effective in describing real-world data samples' characteristics by describing their complementary information. However, the collected data often exhibits incomplete and misaligned characteristics due to factors such as inconsistent sensor frequencies and device malfunctions. Existing research has not effectively addressed the issue of filling missing data in scenarios where multiview data are both imbalanced and misaligned. Instead, it relies on class-level alignment of the available data. Thus, it results in some data samples not being well-matched, thereby affecting the quality of data fusion. In this paper, we propose the Consistency-Aware Padding for Incomplete Multimodal Alignment Clustering Based on Self-Repellent Greedy Anchor Search(CAPIMAC) to tackle the problem of filling imbalanced and misaligned data in multimodal datasets. Specifically, we propose a self-repellent greedy anchor search module(SRGASM), which employs a self-repellent random walk combined with a greedy algorithm to identify anchor points for re-representing incomplete and misaligned multimodal data. Subsequently, based on noise-contrastive learning, we design a consistency-aware padding module (CAPM) to effectively interpolate and align imbalanced and misaligned data, thereby improving the quality of multimodal data fusion. Experimental results demonstrate the superiority of our method over benchmark datasets. The code will be publicly released at https://github.com/Autism-mm/CAPIMAC.git.
Chinese: 本文提出CAPIMAC方法,通过自排斥贪婪锚点搜索和一致性感知填充技术,有效处理多模态数据的不完整与未对齐问题,提升数据融合质量并在基准数据集上表现优越。
English: The paper introduces CAPIMAC, a method that uses a self-repellent greedy anchor search and consistency-aware padding to address incomplete and misaligned multimodal data, enhancing fusion quality and outperforming benchmarks.

Authors:Yifan Jiang, Yibo Xue, Yukun Kang, Pin Zheng, Jian Peng, Feiran Wu, Changliang Xu
Title: Animation Needs Attention: A Holistic Approach to Slides Animation Comprehension with Visual-Language Models
Abstract:
Slide animations, such as fade-in, fly-in, and wipe, are critical for audience engagement, efficient information delivery, and vivid visual expression. However, most AI-driven slide-generation tools still lack native animation support, and existing vision-language models (VLMs) struggle with animation tasks due to the absence of public datasets and limited temporal-reasoning capabilities. To address this gap, we release the first public dataset for slide-animation modeling: 12,000 triplets of natural-language descriptions, animation JSON files, and rendered videos, collectively covering every built-in PowerPoint effect. Using this resource, we fine-tune Qwen-2.5-VL-7B with Low-Rank Adaptation (LoRA) and achieve consistent improvements over GPT-4.1 and Gemini-2.5-Pro in BLEU-4, ROUGE-L, SPICE, and our Coverage-Order-Detail Assessment (CODA) metric, which evaluates action coverage, temporal order, and detail fidelity. On a manually created test set of slides, the LoRA model increases BLEU-4 by around 60%, ROUGE-L by 30%, and shows significant improvements in CODA-detail. This demonstrates that low-rank adaptation enables reliable temporal reasoning and generalization beyond synthetic data. Overall, our dataset, LoRA-enhanced model, and CODA metric provide a rigorous benchmark and foundation for future research on VLM-based dynamic slide generation.
Chinese: 本研究发布了首个幻灯片动画建模公共数据集,并通过LoRA微调Qwen-2.5-VL-7B模型,在时序推理和动画生成质量上显著超越主流模型,为动态幻灯片生成领域建立了新基准。
English: This work introduces the first public dataset for slide animation modeling and demonstrates that fine-tuning Qwen-2.5-VL-7B with LoRA significantly outperforms leading models in temporal reasoning and animation generation, establishing a new benchmark for dynamic slide creation.

Authors:Jiaqi Zhang, Juntuo Wang, Zhixin Sun, John Zou, Randall Balestriero
Title: FastDINOv2: Frequency Based Curriculum Learning Improves Robustness and Training Speed
Abstract:
Large-scale vision foundation models such as DINOv2 boast impressive performances by leveraging massive architectures and training datasets. But numerous scenarios require practitioners to reproduce those pre-training solutions, such as on private data, new modalities, or simply for scientific questioning--which is currently extremely demanding computation-wise. We thus propose a novel pre-training strategy for DINOv2 that simultaneously accelerates convergence--and strengthens robustness to common corruptions as a by-product. Our approach involves a frequency filtering curriculum--low-frequency being seen first--and the Gaussian noise patching augmentation. Applied to a ViT-B/16 backbone trained on ImageNet-1K, while pre-training time and FLOPs are reduced by 1.6x and 2.25x, our method still achieves matching robustness in corruption benchmarks (ImageNet-C) and maintains competitive linear probing performance compared with baseline. This dual benefit of efficiency and robustness makes large-scale self-supervised foundation modeling more attainable, while opening the door to novel exploration around data curriculum and augmentation as means to improve self-supervised learning models robustness. The code is available at https://github.com/KevinZ0217/fast_dinov2
Chinese: 本文提出了一种新颖的DINOv2预训练策略,通过频率过滤课程和高斯噪声增强技术,在加速收敛的同时提升了模型鲁棒性,在显著减少计算成本的情况下,仍能在ImageNet-C基准测试中保持相当的鲁棒性表现和线性探测性能。
English: This paper introduces a novel pre-training strategy for DINOv2 that accelerates convergence and enhances robustness through frequency filtering and Gaussian noise augmentation, achieving significant computational savings while maintaining competitive performance on corruption benchmarks and linear probing tasks.

Authors:Akio Kodaira, Tingbo Hou, Ji Hou, Masayoshi Tomizuka, Yue Zhao
Title: StreamDiT: Real-Time Streaming Text-to-Video Generation
Abstract:
Recently, great progress has been achieved in text-to-video (T2V) generation by scaling transformer-based diffusion models to billions of parameters, which can generate high-quality videos. However, existing models typically produce only short clips offline, restricting their use cases in interactive and real-time applications. This paper addresses these challenges by proposing StreamDiT, a streaming video generation model. StreamDiT training is based on flow matching by adding a moving buffer. We design mixed training with different partitioning schemes of buffered frames to boost both content consistency and visual quality. StreamDiT modeling is based on adaLN DiT with varying time embedding and window attention. To practice the proposed method, we train a StreamDiT model with 4B parameters. In addition, we propose a multistep distillation method tailored for StreamDiT. Sampling distillation is performed in each segment of a chosen partitioning scheme. After distillation, the total number of function evaluations (NFEs) is reduced to the number of chunks in a buffer. Finally, our distilled model reaches real-time performance at 16 FPS on one GPU, which can generate video streams at 512p resolution. We evaluate our method through both quantitative metrics and human evaluation. Our model enables real-time applications, e.g. streaming generation, interactive generation, and video-to-video. We provide video results and more examples in our project website: https://cumulo-autumn.github.io/StreamDiT/

Authors:Yansong Peng, Kai Zhu, Yu Liu, Pingyu Wu, Hebei Li, Xiaoyan Sun, Feng Wu
Title: Flow-Anchored Consistency Models
Abstract:
Continuous-time Consistency Models (CMs) promise efficient few-step generation but face significant challenges with training instability. We argue this instability stems from a fundamental conflict: by training a network to learn only a shortcut across a probability flow, the model loses its grasp on the instantaneous velocity field that defines the flow. Our solution is to explicitly anchor the model in the underlying flow during training. We introduce the Flow-Anchored Consistency Model (FACM), a simple but effective training strategy that uses a Flow Matching (FM) task as an anchor for the primary CM shortcut objective. This Flow-Anchoring approach requires no architectural modifications and is broadly compatible with standard model architectures. By distilling a pre-trained LightningDiT model, our method achieves a state-of-the-art FID of 1.32 with two steps (NFE=2) and 1.76 with just one step (NFE=1) on ImageNet 256x256, significantly outperforming previous methods. This provides a general and effective recipe for building high-performance, few-step generative models. Our code and pretrained models: https://github.com/ali-vilab/FACM.
中文摘要:流锚定一致性模型(FACM)通过引入流匹配任务作为训练锚点,有效解决了连续性一致性模型的训练不稳定问题,在无需改变架构的情况下实现了最优的少步生成性能。
English Summary: The Flow-Anchored Consistency Model (FACM) resolves training instability in Consistency Models by incorporating a Flow Matching objective as an anchor, achieving state-of-the-art few-step generation performance without architectural changes.

Authors:Chong Cheng, Sicheng Yu, Zijian Wang, Yifan Zhou, Hao Wang
Title: Outdoor Monocular SLAM with Global Scale-Consistent 3D Gaussian Pointmaps
Abstract:
3D Gaussian Splatting (3DGS) has become a popular solution in SLAM due to its high-fidelity and real-time novel view synthesis performance. However, some previous 3DGS SLAM methods employ a differentiable rendering pipeline for tracking, lack geometric priors in outdoor scenes. Other approaches introduce separate tracking modules, but they accumulate errors with significant camera movement, leading to scale drift. To address these challenges, we propose a robust RGB-only outdoor 3DGS SLAM method: S3PO-GS. Technically, we establish a self-consistent tracking module anchored in the 3DGS pointmap, which avoids cumulative scale drift and achieves more precise and robust tracking with fewer iterations. Additionally, we design a patch-based pointmap dynamic mapping module, which introduces geometric priors while avoiding scale ambiguity. This significantly enhances tracking accuracy and the quality of scene reconstruction, making it particularly suitable for complex outdoor environments. Our experiments on the Waymo, KITTI, and DL3DV datasets demonstrate that S3PO-GS achieves state-of-the-art results in novel view synthesis and outperforms other 3DGS SLAM methods in tracking accuracy. Project page: https://3dagentworld.github.io/S3PO-GS/.

Authors:Zhiling Yan, Sifan Song, Dingjie Song, Yiwei Li, Rong Zhou, Weixiang Sun, Zhennong Chen, Sekeun Kim, Hui Ren, Tianming Liu, Quanzheng Li, Xiang Li, Lifang He, Lichao Sun
Title: SAMed-2: Selective Memory Enhanced Medical Segment Anything Model
Abstract:
Recent "segment anything" efforts show promise by learning from large-scale data, but adapting such models directly to medical images remains challenging due to the complexity of medical data, noisy annotations, and continual learning requirements across diverse modalities and anatomical structures. In this work, we propose SAMed-2, a new foundation model for medical image segmentation built upon the SAM-2 architecture. Specifically, we introduce a temporal adapter into the image encoder to capture image correlations and a confidence-driven memory mechanism to store high-certainty features for later retrieval. This memory-based strategy counters the pervasive noise in large-scale medical datasets and mitigates catastrophic forgetting when encountering new tasks or modalities. To train and evaluate SAMed-2, we curate MedBank-100k, a comprehensive dataset spanning seven imaging modalities and 21 medical segmentation tasks. Our experiments on both internal benchmarks and 10 external datasets demonstrate superior performance over state-of-the-art baselines in multi-task scenarios. The code is available at: https://github.com/ZhilingYan/Medical-SAM-Bench.
中文: SAMed-2是一种医学图像分割基础模型,通过时序适配器和置信度驱动记忆机制解决数据复杂性和噪声问题,在多项任务和数据集上展现出卓越性能。
English: SAMed-2 is a medical image segmentation foundation model that addresses data complexity and noise through a temporal adapter and confidence-driven memory, demonstrating superior performance across multiple tasks and datasets.

Authors:Ankit Sonthalia, Arnas Uselis, Seong Joon Oh
Title: On the rankability of visual embeddings
Abstract:
We study whether visual embedding models capture continuous, ordinal attributes along linear directions, which we term _rank axes_. We define a model as _rankable_ for an attribute if projecting embeddings onto such an axis preserves the attribute's order. Across 7 popular encoders and 9 datasets with attributes like age, crowd count, head pose, aesthetics, and recency, we find that many embeddings are inherently rankable. Surprisingly, a small number of samples, or even just two extreme examples, often suffice to recover meaningful rank axes, without full-scale supervision. These findings open up new use cases for image ranking in vector databases and motivate further study into the structure and learning of rankable embeddings. Our code is available at https://github.com/aktsonthalia/rankable-vision-embeddings.
中文: 研究表明视觉嵌入模型能通过线性排序轴自然捕捉有序属性,仅需少量样本即可在多数据集和编码器中实现有效排序,为向量数据库的图像排序应用开辟了新途径。
English: This research demonstrates that visual embedding models inherently capture ordinal attributes through linear rank axes, requiring minimal supervision to recover meaningful rankings across diverse datasets and encoders.

Authors:Yana Hasson, Pauline Luc, Liliane Momeni, Maks Ovsjanikov, Guillaume Le Moing, Alina Kuznetsova, Ira Ktena, Jennifer J. Sun, Skanda Koppula, Dilara Gokay, Joseph Heyward, Etienne Pot, Andrew Zisserman
Title: SciVid: Cross-Domain Evaluation of Video Models in Scientific Applications
Abstract:
In recent years, there has been a proliferation of spatiotemporal foundation models in different scientific disciplines. While promising, these models are often domain-specific and are only assessed within the particular applications for which they are designed. Given that many tasks can be represented as video modeling problems, video foundation models (ViFMs) hold considerable promise as general-purpose domain-agnostic approaches. However, it is not known whether the knowledge acquired on large-scale but potentially out-of-domain data can be effectively transferred across diverse scientific disciplines, and if a single, pretrained ViFM can be competitive with domain-specific baselines. To address this, we introduce SciVid, a comprehensive benchmark comprising five *Sci*entific *Vid*eo tasks, across medical computer vision, animal behavior, and weather forecasting. We adapt six leading ViFMs to SciVid using simple trainable readout modules, establishing strong baselines and demonstrating the potential for effective transfer learning. Specifically, we show that state-of-the-art results can be obtained in several applications by leveraging the general-purpose representations from ViFM backbones. Furthermore, our results reveal the limitations of existing ViFMs, and highlight opportunities for the development of generalizable models for high-impact scientific applications. We release our code at https://github.com/google-deepmind/scivid to facilitate further research in the development of ViFMs.
Chinese: 视频基础模型在科学应用中展现出作为通用工具的潜力,SciVid基准测试证明其通过迁移学习可获得先进成果,同时也揭示了现有模型的局限性。
English: Video foundation models show potential as general-purpose tools for scientific applications, with the SciVid benchmark demonstrating their ability to achieve state-of-the-art results through transfer learning while also revealing current limitations.

Authors:Zetian Feng, Juan Fu, Xuebin Zou, Hongsheng Ye, Hong Wu, Jianhua Zhou, Yi Wang
Title: Hybrid-View Attention Network for Clinically Significant Prostate Cancer Classification in Transrectal Ultrasound
Abstract:
Prostate cancer (PCa) is a leading cause of cancer-related mortality in men, and accurate identification of clinically significant PCa (csPCa) is critical for timely intervention. Transrectal ultrasound (TRUS) is widely used for prostate biopsy; however, its low contrast and anisotropic spatial resolution pose diagnostic challenges. To address these limitations, we propose a novel hybrid-view attention (HVA) network for csPCa classification in 3D TRUS that leverages complementary information from transverse and sagittal views. Our approach integrates a CNN-transformer hybrid architecture, where convolutional layers extract fine-grained local features and transformer-based HVA models global dependencies. Specifically, the HVA comprises intra-view attention to refine features within a single view and cross-view attention to incorporate complementary information across views. Furthermore, a hybrid-view adaptive fusion module dynamically aggregates features along both channel and spatial dimensions, enhancing the overall representation. Experiments are conducted on an in-house dataset containing 590 subjects who underwent prostate biopsy. Comparative and ablation results prove the efficacy of our method. The code is available at https://github.com/mock1ngbrd/HVAN.
中文摘要:本研究提出一种混合视角注意力网络,通过结合卷积和变换器架构,利用横断面与矢状面的互补信息,对三维经直肠超声图像中的临床显著前列腺癌进行分类。
English Summary: This study introduces a hybrid-view attention network for classifying clinically significant prostate cancer in 3D TRUS images by combining convolutional and transformer architectures to leverage complementary transverse and sagittal view information.

Authors:Qing Li, Huifang Feng, Xun Gong, Yu-Shen Liu
Title: Learning Normals of Noisy Points by Local Gradient-Aware Surface Filtering
Abstract:
Estimating normals for noisy point clouds is a persistent challenge in 3D geometry processing, particularly for end-to-end oriented normal estimation. Existing methods generally address relatively clean data and rely on supervised priors to fit local surfaces within specific neighborhoods. In this paper, we propose a novel approach for learning normals from noisy point clouds through local gradient-aware surface filtering. Our method projects noisy points onto the underlying surface by utilizing normals and distances derived from an implicit function constrained by local gradients. We start by introducing a distance measurement operator for global surface fitting on noisy data, which integrates projected distances along normals. Following this, we develop an implicit field-based filtering approach for surface point construction, adding projection constraints on these points during filtering. To address issues of over-smoothing and gradient degradation, we further incorporate local gradient consistency constraints, as well as local gradient orientation and aggregation. Comprehensive experiments on normal estimation, surface reconstruction, and point cloud denoising demonstrate the state-of-the-art performance of our method. The source code and trained models are available at https://github.com/LeoQLi/LGSF.
中文摘要:本文提出了一种通过局部梯度感知表面滤波从噪声点云学习法向量的新方法,利用隐函数约束的局部梯度将噪声点投影到潜在表面,有效解决了过平滑和梯度退化问题。
English Summary: This paper introduces a novel method for estimating normals from noisy point clouds using local gradient-aware surface filtering, which projects noisy points onto underlying surfaces through implicit functions constrained by local gradients to prevent over-smoothing and gradient degradation.

Authors:Yufan Zhou, Zhaobo Qi, Lingshuai Lin, Junqi Jing, Tingting Chai, Beichen Zhang, Shuhui Wang, Weigang Zhang
Title: Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos
Abstract:
In this paper, we address the challenge of procedure planning in instructional videos, aiming to generate coherent and task-aligned action sequences from start and end visual observations. Previous work has mainly relied on text-level supervision to bridge the gap between observed states and unobserved actions, but it struggles with capturing intricate temporal relationships among actions. Building on these efforts, we propose the Masked Temporal Interpolation Diffusion (MTID) model that introduces a latent space temporal interpolation module within the diffusion model. This module leverages a learnable interpolation matrix to generate intermediate latent features, thereby augmenting visual supervision with richer mid-state details. By integrating this enriched supervision into the model, we enable end-to-end training tailored to task-specific requirements, significantly enhancing the model's capacity to predict temporally coherent action sequences. Additionally, we introduce an action-aware mask projection mechanism to restrict the action generation space, combined with a task-adaptive masked proximity loss to prioritize more accurate reasoning results close to the given start and end states over those in intermediate steps. Simultaneously, it filters out task-irrelevant action predictions, leading to contextually aware action sequences. Experimental results across three widely used benchmark datasets demonstrate that our MTID achieves promising action planning performance on most metrics. The code is available at https://github.com/WiserZhou/MTID.
中文: 本文提出掩码时序插值扩散(MTID)模型,通过生成中间潜在特征并结合任务特定监督,在指导性视频中实现连贯动作序列的规划,显著提升了动作预测的准确性。
English: This paper introduces the Masked Temporal Interpolation Diffusion (MTID) model, which enhances procedure planning in instructional videos by generating intermediate latent features and incorporating task-specific supervision to produce coherent action sequences.

Authors:Blaž Rolih, Matic Fučka, Filip Wolf, Luka Čehovin Zajc
Title: Be the Change You Want to See: Revisiting Remote Sensing Change Detection Practices
Abstract:
Remote sensing change detection aims to localize semantic changes between images of the same location captured at different times. In the past few years, newer methods have attributed enhanced performance to the additions of new and complex components to existing architectures. Most fail to measure the performance contribution of fundamental design choices such as backbone selection, pre-training strategies, and training configurations. We claim that such fundamental design choices often improve performance even more significantly than the addition of new architectural components. Due to that, we systematically revisit the design space of change detection models and analyse the full potential of a well-optimised baseline. We identify a set of fundamental design choices that benefit both new and existing architectures. Leveraging this insight, we demonstrate that when carefully designed, even an architecturally simple model can match or surpass state-of-the-art performance on six challenging change detection datasets. Our best practices generalise beyond our architecture and also offer performance improvements when applied to related methods, indicating that the space of fundamental design choices has been underexplored. Our guidelines and architecture provide a strong foundation for future methods, emphasizing that optimizing core components is just as important as architectural novelty in advancing change detection performance. Code: https://github.com/blaz-r/BTC-change-detection
中文: 本研究证明,优化主干网络选择与训练策略等基础设计要素能显著提升变化检测性能,其效果常超越复杂架构改进,该结论在六个数据集上得到验证。
English: This study demonstrates that optimizing fundamental design choices like backbone selection and training strategies can significantly enhance change detection performance, often surpassing the benefits of complex architectural additions, as validated across six datasets.

Authors:Mingzhuo Li, Guang Li, Jiafeng Mao, Linfeng Ye, Takahiro Ogawa, Miki Haseyama
Title: Task-Specific Generative Dataset Distillation with Difficulty-Guided Sampling
Abstract:
To alleviate the reliance of deep neural networks on large-scale datasets, dataset distillation aims to generate compact, high-quality synthetic datasets that can achieve comparable performance to the original dataset. The integration of generative models has significantly advanced this field. However, existing approaches primarily focus on aligning the distilled dataset with the original one, often overlooking task-specific information that can be critical for optimal downstream performance. In this paper, focusing on the downstream task of classification, we propose a task-specific sampling strategy for generative dataset distillation that incorporates the concept of difficulty to consider the requirements of the target task better. The final dataset is sampled from a larger image pool with a sampling distribution obtained by matching the difficulty distribution of the original dataset. A logarithmic transformation is applied as a pre-processing step to correct for distributional bias. The results of extensive experiments demonstrate the effectiveness of our method and suggest its potential for enhancing performance on other downstream tasks. The code is available at https://github.com/SumomoTaku/DiffGuideSamp.
中文: 本文提出了一种针对生成式数据集蒸馏的任务特定采样策略,通过引入难度概念并匹配原始数据集的难度分布,有效提升了分类任务的性能。
English: This paper introduces a task-specific sampling strategy for generative dataset distillation that incorporates difficulty-based selection to enhance classification performance by aligning with the original dataset's difficulty distribution.

Authors:Peilin Tao, Hainan Cui, Diantao Tu, Shuhan Shen
Title: MGSfM: Multi-Camera Geometry Driven Global Structure-from-Motion
Abstract:
Multi-camera systems are increasingly vital in the environmental perception of autonomous vehicles and robotics. Their physical configuration offers inherent fixed relative pose constraints that benefit Structure-from-Motion (SfM). However, traditional global SfM systems struggle with robustness due to their optimization framework. We propose a novel global motion averaging framework for multi-camera systems, featuring two core components: a decoupled rotation averaging module and a hybrid translation averaging module. Our rotation averaging employs a hierarchical strategy by first estimating relative rotations within rigid camera units and then computing global rigid unit rotations. To enhance the robustness of translation averaging, we incorporate both camera-to-camera and camera-to-point constraints to initialize camera positions and 3D points with a convex distance-based objective function and refine them with an unbiased non-bilinear angle-based objective function. Experiments on large-scale datasets show that our system matches or exceeds incremental SfM accuracy while significantly improving efficiency. Our framework outperforms existing global SfM methods, establishing itself as a robust solution for real-world multi-camera SfM applications. The code is available at https://github.com/3dv-casia/MGSfM/.
中文: 我们提出了一种针对多相机系统的鲁棒全局运动平均框架,通过分层旋转估计和混合平移优化,在精度和效率上均超越了现有方法。
English: We introduce a robust global motion averaging framework for multi-camera systems, featuring hierarchical rotation estimation and hybrid translation optimization that achieves superior accuracy and efficiency compared to existing methods.

Authors:Wooseok Shin, Jisu Kang, Hyeonki Jeong, Jin Sob Kim, Sung Won Han
Title: Leveraging Out-of-Distribution Unlabeled Images: Semi-Supervised Semantic Segmentation with an Open-Vocabulary Model
Abstract:
In semi-supervised semantic segmentation, existing studies have shown promising results in academic settings with controlled splits of benchmark datasets. However, the potential benefits of leveraging significantly larger sets of unlabeled images remain unexplored. In real-world scenarios, abundant unlabeled images are often available from online sources (web-scraped images) or large-scale datasets. However, these images may have different distributions from those of the target dataset, a situation known as out-of-distribution (OOD). Using these images as unlabeled data in semi-supervised learning can lead to inaccurate pseudo-labels, potentially misguiding network training. In this paper, we propose a new semi-supervised semantic segmentation framework with an open-vocabulary segmentation model (SemiOVS) to effectively utilize unlabeled OOD images. Extensive experiments on Pascal VOC and Context datasets demonstrate two key findings: (1) using additional unlabeled images improves the performance of semi-supervised learners in scenarios with few labels, and (2) using the open-vocabulary segmentation (OVS) model to pseudo-label OOD images leads to substantial performance gains. In particular, SemiOVS outperforms existing PrevMatch and SemiVL methods by +3.5 and +3.0 mIoU, respectively, on Pascal VOC with a 92-label setting, achieving state-of-the-art performance. These findings demonstrate that our approach effectively utilizes abundant unlabeled OOD images for semantic segmentation tasks. We hope this work can inspire future research and real-world applications. The code is available at https://github.com/wooseok-shin/SemiOVS
中文: 本文提出的SemiOVS框架通过开放词汇分割有效利用分布外未标记图像,在基准数据集上实现了半监督语义分割的最先进性能。
English: This paper introduces SemiOVS, a semi-supervised semantic segmentation framework that leverages open-vocabulary segmentation to effectively utilize out-of-distribution unlabeled images, achieving state-of-the-art performance on benchmark datasets.

Authors:Deepan Adak, Yogesh Singh Rawat, Shruti Vyas
Title: MolVision: Molecular Property Prediction with Vision Language Models
Abstract:
Molecular property prediction is a fundamental task in computational chemistry with critical applications in drug discovery and materials science. While recent works have explored Large Language Models (LLMs) for this task, they primarily rely on textual molecular representations such as SMILES/SELFIES, which can be ambiguous and structurally less informative. In this work, we introduce MolVision, a novel approach that leverages Vision-Language Models (VLMs) by integrating both molecular structure as images and textual descriptions to enhance property prediction. We construct a benchmark spanning ten diverse datasets, covering classification, regression and description tasks. Evaluating nine different VLMs in zero-shot, few-shot, and fine-tuned settings, we find that visual information improves prediction performance, particularly when combined with efficient fine-tuning strategies such as LoRA. Our results reveal that while visual information alone is insufficient, multimodal fusion significantly enhances generalization across molecular properties. Adaptation of vision encoder for molecular images in conjunction with LoRA further improves the performance. The code and data is available at : $\href{https://molvision.github.io/MolVision/}{https://molvision.github.io/MolVision/}$.

Authors:Xinyang Li, Gen Li, Zhihui Lin, Yichen Qian, GongXin Yao, Weinan Jia, Aowen Wang, Weihua Chen, Fan Wang
Title: MoDA: Multi-modal Diffusion Architecture for Talking Head Generation
Abstract:
Talking head generation with arbitrary identities and speech audio remains a crucial problem in the realm of the virtual metaverse. Recently, diffusion models have become a popular generative technique in this field with their strong generation capabilities. However, several challenges remain for diffusion-based methods: 1) inefficient inference and visual artifacts caused by the implicit latent space of Variational Auto-Encoders (VAE), which complicates the diffusion process; 2) a lack of authentic facial expressions and head movements due to inadequate multi-modal information fusion. In this paper, MoDA handles these challenges by: 1) defining a joint parameter space that bridges motion generation and neural rendering, and leveraging flow matching to simplify diffusion learning; 2) introducing a multi-modal diffusion architecture to model the interaction among noisy motion, audio, and auxiliary conditions, enhancing overall facial expressiveness. In addition, a coarse-to-fine fusion strategy is employed to progressively integrate different modalities, ensuring effective feature fusion. Experimental results demonstrate that MoDA improves video diversity, realism, and efficiency, making it suitable for real-world applications. Project Page: https://lixinyyang.github.io/MoDA.github.io/
中文:MoDA通过构建联合参数空间与流匹配技术及多模态扩散架构,解决了基于扩散模型的说话头生成中的效率低下与表情失真问题,显著提升了真实感与实用性。
English: MoDA addresses inefficiencies and expression limitations in diffusion-based talking head generation by introducing a joint parameter space with flow matching and a multi-modal diffusion architecture, enhancing realism and efficiency.

Authors:Xiangrui Liu, Man Luo, Agneet Chatterjee, Hua Wei, Yezhou Yang
Title: Towards a Psychoanalytic Perspective on VLM Behaviour: A First-step Interpretation with Intriguing Observations
Abstract:
Hallucination is a long-standing problem that has been actively investigated in Vision-Language Models (VLMs). Existing research commonly attributes hallucinations to technical limitations or sycophancy bias, where the latter means the models tend to generate incorrect answers to align with user expectations. However, these explanations primarily focus on technical or externally driven factors, may have neglected the possibility that hallucination behaviours might mirror cognitive biases observed in human psychology. In this work, we introduce a psychological taxonomy, categorizing VLMs' hallucination behaviours, including sycophancy, logical inconsistency, and a newly identified VLMs behaviour: authority bias. To systematically analyze these behaviours, we design AIpsych, a scalable benchmark that reveals psychological tendencies in model response patterns. Leveraging this benchmark, we investigate how variations in model architecture and parameter size influence model behaviour when responding to strategically manipulated questions. Our experiments reveal that as model size increases, VLMs exhibit stronger sycophantic tendencies but reduced authority bias, suggesting increasing competence but a potential erosion of response integrity. A human subject study further validates our hypotheses and highlights key behavioural differences between VLMs and human respondents. This work suggests a new perspective for understanding hallucination in VLMs and highlights the importance of integrating psychological principles into model evaluation.The benchmark is available at https://github.com/lxrswdd/AIpsych.
中文: 本研究提出视觉语言模型的幻觉可能源于心理偏见而非仅技术局限,通过新分类法和AIpsych基准分析奉承与权威偏见等行为,发现模型越大奉承倾向越强但权威偏见减弱。
English: This study proposes that hallucinations in Vision-Language Models (VLMs) may stem from psychological biases rather than just technical limitations, introducing a new taxonomy and benchmark called AIpsych to analyze behaviors like sycophancy and authority bias, revealing that larger models show increased sycophancy but reduced authority bias.

Authors:Ana Vasilcoiu, Ivona Najdenkoska, Zeno Geradts, Marcel Worring
Title: LATTE: Latent Trajectory Embedding for Diffusion-Generated Image Detection
Abstract:
The rapid advancement of diffusion-based image generators has made it increasingly difficult to distinguish generated from real images. This can erode trust in digital media, making it critical to develop generalizable detectors for generated images. Recent methods leverage diffusion denoising cues, but mainly focus on single-step reconstruction errors, ignoring the inherent sequential nature of the denoising process. In this work, we propose LATTE - Latent Trajectory Embedding - a novel approach that models the evolution of latent embeddings across several denoising timesteps. By modeling the trajectory of such embeddings rather than single-step errors, LATTE captures subtle, discriminative patterns that distinguish real from generated images. Each latent is refined by employing our latent-visual feature refinement module and aggregated into a unified representation. Afterwards, it is fused with the visual features and finally passed into a lightweight classifier. Our experiments demonstrate that LATTE surpasses the baselines on several established benchmarks, such as GenImage and DiffusionFake. Moreover, it demonstrates strong performance in cross-generator and cross-datasets settings, highlighting the potential of using the trajectory of latent embeddings for generated image detection. The code is available on the following link: https://github.com/AnaMVasilcoiu/LATTE-Diffusion-Detector.
Chinese: LATTE提出了一种通过建模多步去噪过程中潜在嵌入轨迹的新方法,用于检测AI生成图像,在跨生成器和跨数据集的场景中表现出卓越性能。
English: LATTE introduces a novel method for detecting AI-generated images by modeling the trajectory of latent embeddings across multiple denoising steps, achieving superior performance in cross-generator and cross-dataset scenarios.

Authors:Haiqing Li, Yuzhi Guo, Feng Jiang, Thao M. Dang, Hehuan Ma, Qifeng Zhou, Jean Gao, Junzhou Huang
Title: Text-Guided Multi-Instance Learning for Scoliosis Screening via Gait Video Analysis
Abstract:
Early-stage scoliosis is often difficult to detect, particularly in adolescents, where delayed diagnosis can lead to serious health issues. Traditional X-ray-based methods carry radiation risks and rely heavily on clinical expertise, limiting their use in large-scale screenings. To overcome these challenges, we propose a Text-Guided Multi-Instance Learning Network (TG-MILNet) for non-invasive scoliosis detection using gait videos. To handle temporal misalignment in gait sequences, we employ Dynamic Time Warping (DTW) clustering to segment videos into key gait phases. To focus on the most relevant diagnostic features, we introduce an Inter-Bag Temporal Attention (IBTA) mechanism that highlights critical gait phases. Recognizing the difficulty in identifying borderline cases, we design a Boundary-Aware Model (BAM) to improve sensitivity to subtle spinal deviations. Additionally, we incorporate textual guidance from domain experts and large language models (LLM) to enhance feature representation and improve model interpretability. Experiments on the large-scale Scoliosis1K gait dataset show that TG-MILNet achieves state-of-the-art performance, particularly excelling in handling class imbalance and accurately detecting challenging borderline cases. The code is available at https://github.com/lhqqq/TG-MILNet
中文: 本文提出TG-MILNet,一种利用步态视频和文本引导的非侵入性脊柱侧弯检测方法,通过处理时间错位和类别不平衡问题,显著提升了对临界病例的检测灵敏度,实现了最优性能。
English: This paper introduces TG-MILNet, a non-invasive method using gait videos and textual guidance to detect early-stage scoliosis, which achieves state-of-the-art performance by addressing temporal misalignment and class imbalance while improving sensitivity to borderline cases.

Authors:Huihui Xu, Yuanpeng Nie, Hualiang Wang, Ying Chen, Wei Li, Junzhi Ning, Lihao Liu, Hongqiu Wang, Lei Zhu, Jiyao Liu, Xiaomeng Li, Junjun He
Title: MedGround-R1: Advancing Medical Image Grounding via Spatial-Semantic Rewarded Group Relative Policy Optimization
Abstract:
Medical Image Grounding (MIG), which involves localizing specific regions in medical images based on textual descriptions, requires models to not only perceive regions but also deduce spatial relationships of these regions. Existing Vision-Language Models (VLMs) for MIG often rely on Supervised Fine-Tuning (SFT) with large amounts of Chain-of-Thought (CoT) reasoning annotations, which are expensive and time-consuming to acquire. Recently, DeepSeek-R1 demonstrated that Large Language Models (LLMs) can acquire reasoning abilities through Group Relative Policy Optimization (GRPO) without requiring CoT annotations. In this paper, we adapt the GRPO reinforcement learning framework to VLMs for Medical Image Grounding. We propose the Spatial-Semantic Rewarded Group Relative Policy Optimization to train the model without CoT reasoning annotations. Specifically, we introduce Spatial-Semantic Rewards, which combine spatial accuracy reward and semantic consistency reward to provide nuanced feedback for both spatially positive and negative completions. Additionally, we propose to use the Chain-of-Box template, which integrates visual information of referring bounding boxes into the reasoning process, enabling the model to explicitly reason about spatial regions during intermediate steps. Experiments on three datasets MS-CXR, ChestX-ray8, and M3D-RefSeg demonstrate that our method achieves state-of-the-art performance in Medical Image Grounding. Ablation studies further validate the effectiveness of each component in our approach. Code, checkpoints, and datasets are available at https://github.com/bio-mlhui/MedGround-R1
中文: 本文提出一种用于医学图像定位的强化学习方法,通过空间语义奖励和链式边界框推理模板避免了对昂贵思维链标注的依赖,在多个数据集上实现了最先进的性能。
English: This paper introduces a reinforcement learning method for Medical Image Grounding that eliminates the need for costly Chain-of-Thought annotations by using Spatial-Semantic Rewards and Chain-of-Box reasoning templates, achieving state-of-the-art performance across multiple datasets.

Authors:Zhiyi Hou, Enhui Ma, Fang Li, Zhiyi Lai, Kalok Ho, Zhanqian Wu, Lijun Zhou, Long Chen, Chitian Sun, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Kaicheng Yu
Title: DriveMRP: Enhancing Vision-Language Models with Synthetic Motion Data for Motion Risk Prediction
Abstract:
Autonomous driving has seen significant progress, driven by extensive real-world data. However, in long-tail scenarios, accurately predicting the safety of the ego vehicle's future motion remains a major challenge due to uncertainties in dynamic environments and limitations in data coverage. In this work, we aim to explore whether it is possible to enhance the motion risk prediction capabilities of Vision-Language Models (VLM) by synthesizing high-risk motion data. Specifically, we introduce a Bird's-Eye View (BEV) based motion simulation method to model risks from three aspects: the ego-vehicle, other vehicles, and the environment. This allows us to synthesize plug-and-play, high-risk motion data suitable for VLM training, which we call DriveMRP-10K. Furthermore, we design a VLM-agnostic motion risk estimation framework, named DriveMRP-Agent. This framework incorporates a novel information injection strategy for global context, ego-vehicle perspective, and trajectory projection, enabling VLMs to effectively reason about the spatial relationships between motion waypoints and the environment. Extensive experiments demonstrate that by fine-tuning with DriveMRP-10K, our DriveMRP-Agent framework can significantly improve the motion risk prediction performance of multiple VLM baselines, with the accident recognition accuracy soaring from 27.13% to 88.03%. Moreover, when tested via zero-shot evaluation on an in-house real-world high-risk motion dataset, DriveMRP-Agent achieves a significant performance leap, boosting the accuracy from base_model's 29.42% to 68.50%, which showcases the strong generalization capabilities of our method in real-world scenarios.
中文: 本研究提出DriveMRP-10K合成高风险运动数据集和DriveMRP-Agent框架,通过增强视觉语言模型的运动风险预测能力,显著提升了事故识别准确率,并在真实场景中展现出强大的泛化性能。
English: This research introduces DriveMRP-10K, a synthesized high-risk motion dataset, and the DriveMRP-Agent framework to enhance Vision-Language Models' motion risk prediction, significantly improving accident recognition accuracy and demonstrating strong generalization in real-world scenarios.

Authors:Yuqi Li, Chuanguang Yang, Hansheng Zeng, Zeyu Dong, Zhulin An, Yongjun Xu, Yingli Tian, Hao Wu
Title: Frequency-Aligned Knowledge Distillation for Lightweight Spatiotemporal Forecasting
Abstract:
Spatiotemporal forecasting tasks, such as traffic flow, combustion dynamics, and weather forecasting, often require complex models that suffer from low training efficiency and high memory consumption. This paper proposes a lightweight framework, Spectral Decoupled Knowledge Distillation (termed SDKD), which transfers the multi-scale spatiotemporal representations from a complex teacher model to a more efficient lightweight student network. The teacher model follows an encoder-latent evolution-decoder architecture, where its latent evolution module decouples high-frequency details and low-frequency trends using convolution and Transformer (global low-frequency modeler). However, the multi-layer convolution and deconvolution structures result in slow training and high memory usage. To address these issues, we propose a frequency-aligned knowledge distillation strategy, which extracts multi-scale spectral features from the teacher's latent space, including both high and low frequency components, to guide the lightweight student model in capturing both local fine-grained variations and global evolution patterns. Experimental results show that SDKD significantly improves performance, achieving reductions of up to 81.3% in MSE and in MAE 52.3% on the Navier-Stokes equation dataset. The framework effectively captures both high-frequency variations and long-term trends while reducing computational complexity. Our codes are available at https://github.com/itsnotacie/SDKD
中文摘要:本文提出了一种名为谱解耦知识蒸馏(SDKD)的轻量级框架,通过将复杂教师模型的多尺度时空知识迁移到高效学生网络中,在显著提升预测精度的同时降低了计算复杂度。
English Summary: This paper introduces a lightweight framework called Spectral Decoupled Knowledge Distillation (SDKD) that transfers multi-scale spatiotemporal knowledge from a complex teacher model to an efficient student network, significantly improving forecasting accuracy while reducing computational complexity.

Authors:Vineet Kumar Rakesh, Soumya Mazumdar, Research Pratim Maity, Sarbajit Pal, Amitabha Das, Tapas Samanta
Title: Advancing Talking Head Generation: A Comprehensive Survey of Multi-Modal Methodologies, Datasets, Evaluation Metrics, and Loss Functions
Abstract:
Talking Head Generation (THG) has emerged as a transformative technology in computer vision, enabling the synthesis of realistic human faces synchronized with image, audio, text, or video inputs. This paper provides a comprehensive review of methodologies and frameworks for talking head generation, categorizing approaches into 2D--based, 3D--based, Neural Radiance Fields (NeRF)--based, diffusion--based, parameter-driven techniques and many other techniques. It evaluates algorithms, datasets, and evaluation metrics while highlighting advancements in perceptual realism and technical efficiency critical for applications such as digital avatars, video dubbing, ultra-low bitrate video conferencing, and online education. The study identifies challenges such as reliance on pre--trained models, extreme pose handling, multilingual synthesis, and temporal consistency. Future directions include modular architectures, multilingual datasets, hybrid models blending pre--trained and task-specific layers, and innovative loss functions. By synthesizing existing research and exploring emerging trends, this paper aims to provide actionable insights for researchers and practitioners in the field of talking head generation. For the complete survey, code, and curated resource list, visit our GitHub repository: https://github.com/VineetKumarRakesh/thg.
中文: 本文全面综述了说话头生成技术,评估了其应用、挑战及未来方向,为研究人员和从业者提供了实用指导。
English: This paper offers a comprehensive review of talking head generation methods, evaluating their applications, challenges, and future directions to guide researchers and practitioners in the field.

Authors:John Gideon, Kimimasa Tamura, Emily Sumner, Laporsha Dees, Patricio Reyes Gomez, Bassamul Haq, Todd Rowell, Avinash Balachandran, Simon Stent, Guy Rosman
Title: A Simulator Dataset to Support the Study of Impaired Driving
Abstract:
Despite recent advances in automated driving technology, impaired driving continues to incur a high cost to society. In this paper, we present a driving dataset designed to support the study of two common forms of driver impairment: alcohol intoxication and cognitive distraction. Our dataset spans 23.7 hours of simulated urban driving, with 52 human subjects under normal and impaired conditions, and includes both vehicle data (ground truth perception, vehicle pose, controls) and driver-facing data (gaze, audio, surveys). It supports analysis of changes in driver behavior due to alcohol intoxication (0.10\% blood alcohol content), two forms of cognitive distraction (audio n-back and sentence parsing tasks), and combinations thereof, as well as responses to a set of eight controlled road hazards, such as vehicle cut-ins. The dataset will be made available at https://toyotaresearchinstitute.github.io/IDD/.

Authors:Yuqi Wu, Wenzhao Zheng, Jie Zhou, Jiwen Lu
Title: Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory
Abstract:
Dense 3D scene reconstruction from an ordered sequence or unordered image collections is a critical step when bringing research in computer vision into practical scenarios. Following the paradigm introduced by DUSt3R, which unifies an image pair densely into a shared coordinate system, subsequent methods maintain an implicit memory to achieve dense 3D reconstruction from more images. However, such implicit memory is limited in capacity and may suffer from information loss of earlier frames. We propose Point3R, an online framework targeting dense streaming 3D reconstruction. To be specific, we maintain an explicit spatial pointer memory directly associated with the 3D structure of the current scene. Each pointer in this memory is assigned a specific 3D position and aggregates scene information nearby in the global coordinate system into a changing spatial feature. Information extracted from the latest frame interacts explicitly with this pointer memory, enabling dense integration of the current observation into the global coordinate system. We design a 3D hierarchical position embedding to promote this interaction and design a simple yet effective fusion mechanism to ensure that our pointer memory is uniform and efficient. Our method achieves competitive or state-of-the-art performance on various tasks with low training costs. Code is available at: https://github.com/YkiWu/Point3R.
中文摘要:Point3R提出了一种在线密集三维重建框架,通过显式空间指针内存直接关联三维场景结构,有效整合最新观测数据,在保持低训练成本的同时实现了优越的性能。
English Summary: Point3R introduces an online framework for dense 3D reconstruction using explicit spatial pointer memory that directly associates with 3D structures, enabling efficient integration of new observations while maintaining competitive performance with low training costs.

Authors:Xin Zhou, Dingkang Liang, Kaijin Chen, Tianrui Feng, Xiwu Chen, Hongkai Lin, Yikang Ding, Feiyang Tan, Hengshuang Zhao, Xiang Bai
Title: Less is Enough: Training-Free Video Diffusion Acceleration via Runtime-Adaptive Caching
Abstract:
Video generation models have demonstrated remarkable performance, yet their broader adoption remains constrained by slow inference speeds and substantial computational costs, primarily due to the iterative nature of the denoising process. Addressing this bottleneck is essential for democratizing advanced video synthesis technologies and enabling their integration into real-world applications. This work proposes EasyCache, a training-free acceleration framework for video diffusion models. EasyCache introduces a lightweight, runtime-adaptive caching mechanism that dynamically reuses previously computed transformation vectors, avoiding redundant computations during inference. Unlike prior approaches, EasyCache requires no offline profiling, pre-computation, or extensive parameter tuning. We conduct comprehensive studies on various large-scale video generation models, including OpenSora, Wan2.1, and HunyuanVideo. Our method achieves leading acceleration performance, reducing inference time by up to 2.1-3.3$\times$ compared to the original baselines while maintaining high visual fidelity with a significant up to 36% PSNR improvement compared to the previous SOTA method. This improvement makes our EasyCache a efficient and highly accessible solution for high-quality video generation in both research and practical applications. The code is available at https://github.com/H-EmbodVis/EasyCache.
中文: EasyCache是一种无需训练的加速框架,通过动态重用视频扩散模型中的变换向量,在保持高视觉保真度的同时,将推理速度提升最高3.3倍,并比现有最佳方法的PSNR指标显著提高36%。
English: EasyCache is a training-free acceleration framework that dynamically reuses transformation vectors in video diffusion models, achieving up to 3.3× faster inference while improving visual quality with a 36% PSNR gain over prior methods.

Authors:Ziqi Miao, Yi Ding, Lijun Li, Jing Shao
Title: Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection
Abstract:
With the emergence of strong vision language capabilities, multimodal large language models (MLLMs) have demonstrated tremendous potential for real-world applications. However, the security vulnerabilities exhibited by the visual modality pose significant challenges to deploying such models in open-world environments. Recent studies have successfully induced harmful responses from target MLLMs by encoding harmful textual semantics directly into visual inputs. However, in these approaches, the visual modality primarily serves as a trigger for unsafe behavior, often exhibiting semantic ambiguity and lacking grounding in realistic scenarios. In this work, we define a novel setting: vision-centric jailbreak, where visual information serves as a necessary component in constructing a complete and realistic jailbreak context. Building on this setting, we propose the VisCo (Visual Contextual) Attack. VisCo fabricates contextual dialogue using four distinct vision-focused strategies, dynamically generating auxiliary images when necessary to construct a vision-centric jailbreak scenario. To maximize attack effectiveness, it incorporates automatic toxicity obfuscation and semantic refinement to produce a final attack prompt that reliably triggers harmful responses from the target black-box MLLMs. Specifically, VisCo achieves a toxicity score of 4.78 and an Attack Success Rate (ASR) of 85% on MM-SafetyBench against GPT-4o, significantly outperforming the baseline, which achieves a toxicity score of 2.48 and an ASR of 22.2%. Code: https://github.com/Dtc7w3PQ/Visco-Attack.
Chinese: 多模态大语言模型面临以视觉为中心的越狱安全风险,VisCo攻击通过情境对话和动态图像生成有效诱导有害响应,在GPT-4o等模型上实现了高毒性和成功率。
English: Multimodal large language models face security risks from vision-centric jailbreaks, where the VisCo Attack uses contextual dialogue and dynamic image generation to effectively trigger harmful responses, achieving high toxicity and success rates against models like GPT-4o.

Authors:Fangfu Liu, Hao Li, Jiawei Chi, Hanyang Wang, Minghui Yang, Fudong Wang, Yueqi Duan
Title: LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion
Abstract:
Recovering 3D structures with open-vocabulary scene understanding from 2D images is a fundamental but daunting task. Recent developments have achieved this by performing per-scene optimization with embedded language information. However, they heavily rely on the calibrated dense-view reconstruction paradigm, thereby suffering from severe rendering artifacts and implausible semantic synthesis when limited views are available. In this paper, we introduce a novel generative framework, coined LangScene-X, to unify and generate 3D consistent multi-modality information for reconstruction and understanding. Powered by the generative capability of creating more consistent novel observations, we can build generalizable 3D language-embedded scenes from only sparse views. Specifically, we first train a TriMap video diffusion model that can generate appearance (RGBs), geometry (normals), and semantics (segmentation maps) from sparse inputs through progressive knowledge integration. Furthermore, we propose a Language Quantized Compressor (LQC), trained on large-scale image datasets, to efficiently encode language embeddings, enabling cross-scene generalization without per-scene retraining. Finally, we reconstruct the language surface fields by aligning language information onto the surface of 3D scenes, enabling open-ended language queries. Extensive experiments on real-world data demonstrate the superiority of our LangScene-X over state-of-the-art methods in terms of quality and generalizability. Project Page: https://liuff19.github.io/LangScene-X.
中文摘要:LangScene-X是一种创新生成框架,通过扩散模型和语言量化压缩技术,仅需稀疏二维视图即可生成外观、几何和语义统一的三维多模态场景,实现高质量重建与开放词汇查询。
English Summary: LangScene-X is a generative framework that creates consistent 3D multi-modality scenes from sparse 2D views by integrating appearance, geometry, and semantics through diffusion models and language encoding, enabling robust reconstruction and open-vocabulary queries.

Authors:Gent Serifi, Marcel C. Bühler
Title: HyperGaussians: High-Dimensional Gaussian Splatting for High-Fidelity Animatable Face Avatars
Abstract:
We introduce HyperGaussians, a novel extension of 3D Gaussian Splatting for high-quality animatable face avatars. Creating such detailed face avatars from videos is a challenging problem and has numerous applications in augmented and virtual reality. While tremendous successes have been achieved for static faces, animatable avatars from monocular videos still fall in the uncanny valley. The de facto standard, 3D Gaussian Splatting (3DGS), represents a face through a collection of 3D Gaussian primitives. 3DGS excels at rendering static faces, but the state-of-the-art still struggles with nonlinear deformations, complex lighting effects, and fine details. While most related works focus on predicting better Gaussian parameters from expression codes, we rethink the 3D Gaussian representation itself and how to make it more expressive. Our insights lead to a novel extension of 3D Gaussians to high-dimensional multivariate Gaussians, dubbed 'HyperGaussians'. The higher dimensionality increases expressivity through conditioning on a learnable local embedding. However, splatting HyperGaussians is computationally expensive because it requires inverting a high-dimensional covariance matrix. We solve this by reparameterizing the covariance matrix, dubbed the 'inverse covariance trick'. This trick boosts the efficiency so that HyperGaussians can be seamlessly integrated into existing models. To demonstrate this, we plug in HyperGaussians into the state-of-the-art in fast monocular face avatars: FlashAvatar. Our evaluation on 19 subjects from 4 face datasets shows that HyperGaussians outperform 3DGS numerically and visually, particularly for high-frequency details like eyeglass frames, teeth, complex facial movements, and specular reflections.
Chinese: HyperGaussians 通过引入带有可学习嵌入的高维多变量高斯分布,增强了3D高斯泼溅技术,显著提升了可动画人脸化身的表达能力,在渲染精细细节和复杂面部运动方面优于现有方法。
English: HyperGaussians enhance 3D Gaussian Splatting by introducing high-dimensional multivariate Gaussians with learnable embeddings, significantly improving expressivity for animatable face avatars and outperforming existing methods in rendering fine details and complex facial movements.

Authors:Mingxin Liu, Peiyuan Zhang, Yuan Liu, Wei Zhang, Yue Zhou, Ning Liao, Ziyang Gong, Junwei Luo, Zhirui Wang, Yi Yu, Xue Yang
Title: Partial Weakly-Supervised Oriented Object Detection
Abstract:
The growing demand for oriented object detection (OOD) across various domains has driven significant research in this area. However, the high cost of dataset annotation remains a major concern. Current mainstream OOD algorithms can be mainly categorized into three types: (1) fully supervised methods using complete oriented bounding box (OBB) annotations, (2) semi-supervised methods using partial OBB annotations, and (3) weakly supervised methods using weak annotations such as horizontal boxes or points. However, these algorithms inevitably increase the cost of models in terms of annotation speed or annotation cost. To address this issue, we propose:(1) the first Partial Weakly-Supervised Oriented Object Detection (PWOOD) framework based on partially weak annotations (horizontal boxes or single points), which can efficiently leverage large amounts of unlabeled data, significantly outperforming weakly supervised algorithms trained with partially weak annotations, also offers a lower cost solution; (2) Orientation-and-Scale-aware Student (OS-Student) model capable of learning orientation and scale information with only a small amount of orientation-agnostic or scale-agnostic weak annotations; and (3) Class-Agnostic Pseudo-Label Filtering strategy (CPF) to reduce the model's sensitivity to static filtering thresholds. Comprehensive experiments on DOTA-v1.0/v1.5/v2.0 and DIOR datasets demonstrate that our PWOOD framework performs comparably to, or even surpasses, traditional semi-supervised algorithms.
中文: 本文提出了一种基于部分弱标注的部分弱监督定向目标检测(PWOOD)框架,通过降低标注成本,在性能上达到甚至超越了传统半监督算法。
English: This paper introduces a Partial Weakly-Supervised Oriented Object Detection (PWOOD) framework that utilizes partial weak annotations to reduce annotation costs while achieving performance comparable to or better than traditional semi-supervised methods.

Authors:Alex Colagrande, Paul Caillon, Eva Feillet, Alexandre Allauzen
Title: Linear Attention with Global Context: A Multipole Attention Mechanism for Vision and Physics
Abstract:
Transformers have become the de facto standard for a wide range of tasks, from image classification to physics simulations. Despite their impressive performance, the quadratic complexity of standard Transformers in both memory and time with respect to the input length makes them impractical for processing high-resolution inputs. Therefore, several variants have been proposed, the most successful relying on patchification, downsampling, or coarsening techniques, often at the cost of losing the finest-scale details. In this work, we take a different approach. Inspired by state-of-the-art techniques in $n$-body numerical simulations, we cast attention as an interaction problem between grid points. We introduce the Multipole Attention Neural Operator (MANO), which computes attention in a distance-based multiscale fashion. MANO maintains, in each attention head, a global receptive field and achieves linear time and memory complexity with respect to the number of grid points. Empirical results on image classification and Darcy flows demonstrate that MANO rivals state-of-the-art models such as ViT and Swin Transformer, while reducing runtime and peak memory usage by orders of magnitude. We open source our code for reproducibility at https://github.com/AlexColagrande/MANO.
Chinese: 多极注意力神经算子(MANO)借鉴n体数值模拟技术,采用基于距离的多尺度注意力机制,在图像分类等任务中性能媲美ViT和Swin Transformer等先进模型,同时实现线性时空复杂度并大幅降低计算资源消耗。
English: The Multipole Attention Neural Operator (MANO) introduces a distance-based multiscale attention mechanism inspired by n-body simulations, achieving linear complexity in time and memory while maintaining competitive performance with models like ViT and Swin Transformer across tasks such as image classification.

Authors:Mélanie Gaillochet, Mehrdad Noori, Sahar Dastani, Christian Desrosiers, Hervé Lombaert
Title: Prompt learning with bounding box constraints for medical image segmentation
Abstract:
Pixel-wise annotations are notoriously labourious and costly to obtain in the medical domain. To mitigate this burden, weakly supervised approaches based on bounding box annotations-much easier to acquire-offer a practical alternative. Vision foundation models have recently shown noteworthy segmentation performance when provided with prompts such as points or bounding boxes. Prompt learning exploits these models by adapting them to downstream tasks and automating segmentation, thereby reducing user intervention. However, existing prompt learning approaches depend on fully annotated segmentation masks. This paper proposes a novel framework that combines the representational power of foundation models with the annotation efficiency of weakly supervised segmentation. More specifically, our approach automates prompt generation for foundation models using only bounding box annotations. Our proposed optimization scheme integrates multiple constraints derived from box annotations with pseudo-labels generated by the prompted foundation model. Extensive experiments across multimodal datasets reveal that our weakly supervised method achieves an average Dice score of 84.90% in a limited data setting, outperforming existing fully-supervised and weakly-supervised approaches. The code is available at https://github.com/Minimel/box-prompt-learning-VFM.git
中文: 本文提出了一种弱监督分割框架,仅利用边界框标注自动生成视觉基础模型的提示,在有限数据设置下以84.90%的平均Dice分数超越了现有全监督和弱监督方法。
English: This paper introduces a weakly supervised segmentation framework that automates prompt generation for vision foundation models using only bounding box annotations, achieving superior performance over existing methods with an average Dice score of 84.90%.

Authors:Xiangyang Luo, Ye Zhu, Yunfei Liu, Lijian Lin, Cong Wan, Zijian Cai, Shao-Lun Huang, Yu Li
Title: CanonSwap: High-Fidelity and Consistent Video Face Swapping via Canonical Space Modulation
Abstract:
Video face swapping aims to address two primary challenges: effectively transferring the source identity to the target video and accurately preserving the dynamic attributes of the target face, such as head poses, facial expressions, lip-sync, \etc. Existing methods mainly focus on achieving high-quality identity transfer but often fall short in maintaining the dynamic attributes of the target face, leading to inconsistent results. We attribute this issue to the inherent coupling of facial appearance and motion in videos. To address this, we propose CanonSwap, a novel video face-swapping framework that decouples motion information from appearance information. Specifically, CanonSwap first eliminates motion-related information, enabling identity modification within a unified canonical space. Subsequently, the swapped feature is reintegrated into the original video space, ensuring the preservation of the target face's dynamic attributes. To further achieve precise identity transfer with minimal artifacts and enhanced realism, we design a Partial Identity Modulation module that adaptively integrates source identity features using a spatial mask to restrict modifications to facial regions. Additionally, we introduce several fine-grained synchronization metrics to comprehensively evaluate the performance of video face swapping methods. Extensive experiments demonstrate that our method significantly outperforms existing approaches in terms of visual quality, temporal consistency, and identity preservation. Our project page are publicly available at https://luoxyhappy.github.io/CanonSwap/.

Authors:JungWoo Chae, Jiyoon Kim, JaeWoong Choi, Kyungyul Kim, Sangheum Hwang
Title: APT: Adaptive Personalized Training for Diffusion Models with Limited Data
Abstract:
Personalizing diffusion models using limited data presents significant challenges, including overfitting, loss of prior knowledge, and degradation of text alignment. Overfitting leads to shifts in the noise prediction distribution, disrupting the denoising trajectory and causing the model to lose semantic coherence. In this paper, we propose Adaptive Personalized Training (APT), a novel framework that mitigates overfitting by employing adaptive training strategies and regularizing the model's internal representations during fine-tuning. APT consists of three key components: (1) Adaptive Training Adjustment, which introduces an overfitting indicator to detect the degree of overfitting at each time step bin and applies adaptive data augmentation and adaptive loss weighting based on this indicator; (2)Representation Stabilization, which regularizes the mean and variance of intermediate feature maps to prevent excessive shifts in noise prediction; and (3) Attention Alignment for Prior Knowledge Preservation, which aligns the cross-attention maps of the fine-tuned model with those of the pretrained model to maintain prior knowledge and semantic coherence. Through extensive experiments, we demonstrate that APT effectively mitigates overfitting, preserves prior knowledge, and outperforms existing methods in generating high-quality, diverse images with limited reference data.

Authors:Qingyu Fan, Yinghao Cai, Chao Li, Chunting Jiao, Xudong Zheng, Tao Lu, Bin Liang, Shuo Wang
Title: MISCGrasp: Leveraging Multiple Integrated Scales and Contrastive Learning for Enhanced Volumetric Grasping
Abstract:
Robotic grasping faces challenges in adapting to objects with varying shapes and sizes. In this paper, we introduce MISCGrasp, a volumetric grasping method that integrates multi-scale feature extraction with contrastive feature enhancement for self-adaptive grasping. We propose a query-based interaction between high-level and low-level features through the Insight Transformer, while the Empower Transformer selectively attends to the highest-level features, which synergistically strikes a balance between focusing on fine geometric details and overall geometric structures. Furthermore, MISCGrasp utilizes multi-scale contrastive learning to exploit similarities among positive grasp samples, ensuring consistency across multi-scale features. Extensive experiments in both simulated and real-world environments demonstrate that MISCGrasp outperforms baseline and variant methods in tabletop decluttering tasks. More details are available at https://miscgrasp.github.io/.

Authors:Abiam Remache González, Meriem Chagour, Timon Bijan Rüth, Raúl Trapiella Cañedo, Marina Martínez Soler, Álvaro Lorenzo Felipe, Hyun-Suk Shin, María-Jesús Zamorano Serrano, Ricardo Torres, Juan-Antonio Castillo Parra, Eduardo Reyes Abad, Miguel-Ángel Ferrer Ballester, Juan-Manuel Afonso López, Francisco-Mario Hernández Tejera, Adrian Penate-Sanchez
Title: IMASHRIMP: Automatic White Shrimp (Penaeus vannamei) Biometrical Analysis from Laboratory Images Using Computer Vision and Deep Learning
Abstract:
This paper introduces IMASHRIMP, an adapted system for the automated morphological analysis of white shrimp (Penaeus vannamei}, aimed at optimizing genetic selection tasks in aquaculture. Existing deep learning and computer vision techniques were modified to address the specific challenges of shrimp morphology analysis from RGBD images. IMASHRIMP incorporates two discrimination modules, based on a modified ResNet-50 architecture, to classify images by the point of view and determine rostrum integrity. It is proposed a "two-factor authentication (human and IA)" system, it reduces human error in view classification from 0.97% to 0% and in rostrum detection from 12.46% to 3.64%. Additionally, a pose estimation module was adapted from VitPose to predict 23 key points on the shrimp's skeleton, with separate networks for lateral and dorsal views. A morphological regression module, using a Support Vector Machine (SVM) model, was integrated to convert pixel measurements to centimeter units. Experimental results show that the system effectively reduces human error, achieving a mean average precision (mAP) of 97.94% for pose estimation and a pixel-to-centimeter conversion error of 0.07 (+/- 0.1) cm. IMASHRIMP demonstrates the potential to automate and accelerate shrimp morphological analysis, enhancing the efficiency of genetic selection and contributing to more sustainable aquaculture practices.The code are available at https://github.com/AbiamRemacheGonzalez/ImaShrimp-public
中文: 本文介绍了IMASHRIMP系统,它采用改进的深度学习和计算机视觉技术,通过RGBD图像自动分析白虾形态特征,显著降低了人工误差,在姿态估计和尺寸测量转换中实现高精度,有效提升了水产养殖遗传选育效率。
English: This paper presents IMASHRIMP, an automated system using adapted deep learning and computer vision to analyze white shrimp morphology from RGBD images, significantly reducing human error and achieving high precision in pose estimation and measurement conversion for enhanced genetic selection in aquaculture.

Authors:Dimitrios Bouzoulas, Eerik Alamikkotervo, Risto Ojala
Title: Automatic Labelling for Low-Light Pedestrian Detection
Abstract:
Pedestrian detection in RGB images is a key task in pedestrian safety, as the most common sensor in autonomous vehicles and advanced driver assistance systems is the RGB camera. A challenge in RGB pedestrian detection, that does not appear to have large public datasets, is low-light conditions. As a solution, in this research, we propose an automated infrared-RGB labeling pipeline. The proposed pipeline consists of 1) Infrared detection, where a fine-tuned model for infrared pedestrian detection is used 2) Label transfer process from the infrared detections to their RGB counterparts 3) Training object detection models using the generated labels for low-light RGB pedestrian detection. The research was performed using the KAIST dataset. For the evaluation, object detection models were trained on the generated autolabels and ground truth labels. When compared on a previously unseen image sequence, the results showed that the models trained on generated labels outperformed the ones trained on ground-truth labels in 6 out of 9 cases for the mAP@50 and mAP@50-95 metrics. The source code for this research is available at https://github.com/BouzoulasDimitrios/IR-RGB-Automated-LowLight-Pedestrian-Labeling
中文摘要:本研究提出了一种自动化的红外-RGB标注流程,通过将红外检测标签迁移至RGB图像来解决低光照行人检测难题,实验表明该方法在多数测试案例中优于基于人工标注的训练效果。
English Summary: This research introduces an automated infrared-RGB labeling pipeline to address low-light pedestrian detection challenges by transferring infrared detection labels to RGB images, demonstrating improved performance over ground-truth labels in most test cases.

Authors:Luca Parolari, Andrea Cherubini, Lamberto Ballan, Carlo Biffi
Title: Temporally-Aware Supervised Contrastive Learning for Polyp Counting in Colonoscopy
Abstract:
Automated polyp counting in colonoscopy is a crucial step toward automated procedure reporting and quality control, aiming to enhance the cost-effectiveness of colonoscopy screening. Counting polyps in a procedure involves detecting and tracking polyps, and then clustering tracklets that belong to the same polyp entity. Existing methods for polyp counting rely on self-supervised learning and primarily leverage visual appearance, neglecting temporal relationships in both tracklet feature learning and clustering stages. In this work, we introduce a paradigm shift by proposing a supervised contrastive loss that incorporates temporally-aware soft targets. Our approach captures intra-polyp variability while preserving inter-polyp discriminability, leading to more robust clustering. Additionally, we improve tracklet clustering by integrating a temporal adjacency constraint, reducing false positive re-associations between visually similar but temporally distant tracklets. We train and validate our method on publicly available datasets and evaluate its performance with a leave-one-out cross-validation strategy. Results demonstrate a 2.2x reduction in fragmentation rate compared to prior approaches. Our results highlight the importance of temporal awareness in polyp counting, establishing a new state-of-the-art. Code is available at https://github.com/lparolari/temporally-aware-polyp-counting.
中文: 本研究提出了一种结合时序软目标和邻接约束的监督对比学习方法,显著提升了结肠镜息肉计数的准确性,通过降低碎片化程度增强了聚类鲁棒性,相比现有方法性能提高了2.2倍。
English: This study introduces a supervised contrastive learning method with temporal soft targets and adjacency constraints to enhance polyp counting in colonoscopy by reducing fragmentation and improving clustering robustness, achieving a 2.2x improvement over existing approaches.

Authors:Zunhui Xia, Hongxing Li, Libin Lan
Title: MedFormer: Hierarchical Medical Vision Transformer with Content-Aware Dual Sparse Selection Attention
Abstract:
Medical image recognition serves as a key way to aid in clinical diagnosis, enabling more accurate and timely identification of diseases and abnormalities. Vision transformer-based approaches have proven effective in handling various medical recognition tasks. However, these methods encounter two primary challenges. First, they are often task-specific and architecture-tailored, limiting their general applicability. Second, they usually either adopt full attention to model long-range dependencies, resulting in high computational costs, or rely on handcrafted sparse attention, potentially leading to suboptimal performance. To tackle these issues, we present MedFormer, an efficient medical vision transformer with two key ideas. First, it employs a pyramid scaling structure as a versatile backbone for various medical image recognition tasks, including image classification and dense prediction tasks such as semantic segmentation and lesion detection. This structure facilitates hierarchical feature representation while reducing the computation load of feature maps, highly beneficial for boosting performance. Second, it introduces a novel Dual Sparse Selection Attention (DSSA) with content awareness to improve computational efficiency and robustness against noise while maintaining high performance. As the core building technique of MedFormer, DSSA is designed to explicitly attend to the most relevant content. Theoretical analysis demonstrates that MedFormer outperforms existing medical vision transformers in terms of generality and efficiency. Extensive experiments across various imaging modality datasets show that MedFormer consistently enhances performance in all three medical image recognition tasks mentioned above. MedFormer provides an efficient and versatile solution for medical image recognition, with strong potential for clinical application.
中文摘要:MedFormer是一种高效通用的医学视觉变换器,通过金字塔缩放结构和新型双稀疏选择注意力机制解决现有方法的局限性,在多种医学图像识别任务中展现出卓越性能。
English Summary: MedFormer is an efficient and versatile medical vision transformer that uses a pyramid scaling structure and a novel Dual Sparse Selection Attention to overcome limitations of existing methods, demonstrating superior performance across various medical image recognition tasks.

Authors:Teng Fu, Yuwen Chen, Zhuofan Chen, Mengyang Zhao, Bin Li, Xiangyang Xue
Title: CrowdTrack: A Benchmark for Difficult Multiple Pedestrian Tracking in Real Scenarios
Abstract:
Multi-object tracking is a classic field in computer vision. Among them, pedestrian tracking has extremely high application value and has become the most popular research category. Existing methods mainly use motion or appearance information for tracking, which is often difficult in complex scenarios. For the motion information, mutual occlusions between objects often prevent updating of the motion state; for the appearance information, non-robust results are often obtained due to reasons such as only partial visibility of the object or blurred images. Although learning how to perform tracking in these situations from the annotated data is the simplest solution, the existing MOT dataset fails to satisfy this solution. Existing methods mainly have two drawbacks: relatively simple scene composition and non-realistic scenarios. Although some of the video sequences in existing dataset do not have the above-mentioned drawbacks, the number is far from adequate for research purposes. To this end, we propose a difficult large-scale dataset for multi-pedestrian tracking, shot mainly from the first-person view and all from real-life complex scenarios. We name it ``CrowdTrack'' because there are numerous objects in most of the sequences. Our dataset consists of 33 videos, containing a total of 5,185 trajectories. Each object is annotated with a complete bounding box and a unique object ID. The dataset will provide a platform to facilitate the development of algorithms that remain effective in complex situations. We analyzed the dataset comprehensively and tested multiple SOTA models on our dataset. Besides, we analyzed the performance of the foundation models on our dataset. The dataset and project code is released at: https://github.com/loseevaya/CrowdTrack .
中文: 作者提出了“CrowdTrack”,这是一个从第一人称视角在复杂现实场景中拍摄的、具有挑战性的大规模多行人跟踪数据集,包含33个视频和5,185条完整标注轨迹,旨在克服现有数据集的局限性并推动鲁棒跟踪算法的发展。
English: The authors introduce "CrowdTrack," a challenging large-scale dataset for multi-pedestrian tracking captured from first-person views in complex real-world scenarios, designed to overcome limitations of existing datasets by providing 33 videos with 5,185 fully annotated trajectories to advance robust tracking algorithms.

Authors:Wei Li, Jingyang Zhang, Lihao Liu, Guoan Wang, Junjun He, Yang Chen, Lixu Gu
Title: F^2TTA: Free-Form Test-Time Adaptation on Cross-Domain Medical Image Classification via Image-Level Disentangled Prompt Tuning
Abstract:
Test-Time Adaptation (TTA) has emerged as a promising solution for adapting a source model to unseen medical sites using unlabeled test data, due to the high cost of data annotation. Existing TTA methods consider scenarios where data from one or multiple domains arrives in complete domain units. However, in clinical practice, data usually arrives in domain fragments of arbitrary lengths and in random arrival orders, due to resource constraints and patient variability. This paper investigates a practical Free-Form Test-Time Adaptation (F$^{2}$TTA) task, where a source model is adapted to such free-form domain fragments, with shifts occurring between fragments unpredictably. In this setting, these shifts could distort the adaptation process. To address this problem, we propose a novel Image-level Disentangled Prompt Tuning (I-DiPT) framework. I-DiPT employs an image-invariant prompt to explore domain-invariant representations for mitigating the unpredictable shifts, and an image-specific prompt to adapt the source model to each test image from the incoming fragments. The prompts may suffer from insufficient knowledge representation since only one image is available for training. To overcome this limitation, we first introduce Uncertainty-oriented Masking (UoM), which encourages the prompts to extract sufficient information from the incoming image via masked consistency learning driven by the uncertainty of the source model representations. Then, we further propose a Parallel Graph Distillation (PGD) method that reuses knowledge from historical image-specific and image-invariant prompts through parallel graph networks. Experiments on breast cancer and glaucoma classification demonstrate the superiority of our method over existing TTA approaches in F$^{2}$TTA. Code is available at https://github.com/mar-cry/F2TTA.
中文摘要:测试时适应(TTA)旨在利用未标记测试数据将源模型适配到未见医疗站点,但现有方法假设数据以完整域单元到达,而临床数据实际以任意长度和随机顺序的域片段形式出现,为此提出图像级解耦提示调优(I-DiPT)框架,通过不确定性导向掩码和平行图蒸馏技术有效应对片段间不可预测的分布偏移问题。
English Summary: Test-Time Adaptation (TTA) addresses adapting source models to unlabeled medical data from new sites, but existing methods assume complete domain units, whereas real clinical data arrives in unpredictable fragments with shifts, prompting the proposed Image-level Disentangled Prompt Tuning (I-DiPT) framework that uses uncertainty-driven masking and parallel graph distillation to handle these challenges effectively.

Authors:Mufhumudzi Muthivhi, Terence L. van Zyl
Title: Wildlife Target Re-Identification Using Self-supervised Learning in Non-Urban Settings
Abstract:
Wildlife re-identification aims to match individuals of the same species across different observations. Current state-of-the-art (SOTA) models rely on class labels to train supervised models for individual classification. This dependence on annotated data has driven the curation of numerous large-scale wildlife datasets. This study investigates self-supervised learning Self-Supervised Learning (SSL) for wildlife re-identification. We automatically extract two distinct views of an individual using temporal image pairs from camera trap data without supervision. The image pairs train a self-supervised model from a potentially endless stream of video data. We evaluate the learnt representations against supervised features on open-world scenarios and transfer learning in various wildlife downstream tasks. The analysis of the experimental results shows that self-supervised models are more robust even with limited data. Moreover, self-supervised features outperform supervision across all downstream tasks. The code is available here https://github.com/pxpana/SSLWildlife.
中文摘要:本研究通过从相机陷阱数据中自动提取无监督的时间图像对,探索了自监督学习在野生动物重识别中的应用,证明自监督模型即使在数据有限的情况下,其鲁棒性和各项下游任务性能均优于监督学习方法。
English Summary: This study explores self-supervised learning for wildlife re-identification by automatically extracting temporal image pairs from camera trap data, demonstrating that self-supervised models outperform supervised approaches in robustness and performance across various downstream tasks even with limited data.

Authors:Taehoon Kim, Jongwook Choi, Yonghyun Jeong, Haeun Noh, Jaejun Yoo, Seungryul Baek, Jongwon Choi
Title: Beyond Spatial Frequency: Pixel-wise Temporal Frequency-based Deepfake Video Detection
Abstract:
We introduce a deepfake video detection approach that exploits pixel-wise temporal inconsistencies, which traditional spatial frequency-based detectors often overlook. Traditional detectors represent temporal information merely by stacking spatial frequency spectra across frames, resulting in the failure to detect temporal artifacts in the pixel plane. Our approach performs a 1D Fourier transform on the time axis for each pixel, extracting features highly sensitive to temporal inconsistencies, especially in areas prone to unnatural movements. To precisely locate regions containing the temporal artifacts, we introduce an attention proposal module trained in an end-to-end manner. Additionally, our joint transformer module effectively integrates pixel-wise temporal frequency features with spatio-temporal context features, expanding the range of detectable forgery artifacts. Our framework represents a significant advancement in deepfake video detection, providing robust performance across diverse and challenging detection scenarios.
Chinese: 我们的方法通过逐像素时间轴傅里叶变换检测深度伪造视频中的时序异常,并结合注意力机制与Transformer模块融合时空特征,大幅提升了检测系统的鲁棒性。
English: Our method detects deepfake videos by analyzing pixel-wise temporal inconsistencies through a 1D Fourier transform and integrates these features with spatio-temporal contexts using a transformer module, significantly improving detection robustness.

Authors:Jiahao Wu, Rui Peng, Jianbo Jiao, Jiayu Yang, Luyang Tang, Kaiqiang Xiong, Jie Liang, Jinbo Yan, Runling Liu, Ronggang Wang
Title: LocalDyGS: Multi-view Global Dynamic Scene Modeling via Adaptive Local Implicit Feature Decoupling
Abstract:
Due to the complex and highly dynamic motions in the real world, synthesizing dynamic videos from multi-view inputs for arbitrary viewpoints is challenging. Previous works based on neural radiance field or 3D Gaussian splatting are limited to modeling fine-scale motion, greatly restricting their application. In this paper, we introduce LocalDyGS, which consists of two parts to adapt our method to both large-scale and fine-scale motion scenes: 1) We decompose a complex dynamic scene into streamlined local spaces defined by seeds, enabling global modeling by capturing motion within each local space. 2) We decouple static and dynamic features for local space motion modeling. A static feature shared across time steps captures static information, while a dynamic residual field provides time-specific features. These are combined and decoded to generate Temporal Gaussians, modeling motion within each local space. As a result, we propose a novel dynamic scene reconstruction framework to model highly dynamic real-world scenes more realistically. Our method not only demonstrates competitive performance on various fine-scale datasets compared to state-of-the-art (SOTA) methods, but also represents the first attempt to model larger and more complex highly dynamic scenes. Project page: https://wujh2001.github.io/LocalDyGS/.

Authors:Anlin Zheng, Haochen Wang, Yucheng Zhao, Weipeng Deng, Tiancai Wang, Xiangyu Zhang, Xiaojuan Qi
Title: Hita: Holistic Tokenizer for Autoregressive Image Generation
Abstract:
Vanilla autoregressive image generation models generate visual tokens step-by-step, limiting their ability to capture holistic relationships among token sequences. Moreover, because most visual tokenizers map local image patches into latent tokens, global information is limited. To address this, we introduce \textit{Hita}, a novel image tokenizer for autoregressive (AR) image generation. It introduces a holistic-to-local tokenization scheme with learnable holistic queries and local patch tokens. Hita incorporates two key strategies to better align with the AR generation process: 1) {arranging} a sequential structure with holistic tokens at the beginning, followed by patch-level tokens, and using causal attention to maintain awareness of previous tokens; and 2) adopting a lightweight fusion module before feeding the de-quantized tokens into the decoder to control information flow and prioritize holistic tokens. Extensive experiments show that Hita accelerates the training speed of AR generators and outperforms those trained with vanilla tokenizers, achieving \textbf{2.59 FID} and \textbf{281.9 IS} on the ImageNet benchmark. Detailed analysis of the holistic representation highlights its ability to capture global image properties, such as textures, materials, and shapes. Additionally, Hita also demonstrates effectiveness in zero-shot style transfer and image in-painting. The code is available at \href{https://github.com/CVMI-Lab/Hita}{https://github.com/CVMI-Lab/Hita}.
中文: Hita分词器采用整体到局部的方案,通过顺序排列标记和融合模块来增强自回归图像生成,捕捉全局属性,在ImageNet等基准测试中表现优异。
English: The Hita tokenizer introduces a holistic-to-local scheme with sequential token arrangement and fusion modules to enhance autoregressive image generation by capturing global properties, achieving superior performance on benchmarks like ImageNet.

Authors:Nina Konovalova, Maxim Nikolaev, Andrey Kuznetsov, Aibek Alanov
Title: Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback
Abstract:
Despite significant progress in text-to-image diffusion models, achieving precise spatial control over generated outputs remains challenging. ControlNet addresses this by introducing an auxiliary conditioning module, while ControlNet++ further refines alignment through a cycle consistency loss applied only to the final denoising steps. However, this approach neglects intermediate generation stages, limiting its effectiveness. We propose InnerControl, a training strategy that enforces spatial consistency across all diffusion steps. Our method trains lightweight convolutional probes to reconstruct input control signals (e.g., edges, depth) from intermediate UNet features at every denoising step. These probes efficiently extract signals even from highly noisy latents, enabling pseudo ground truth controls for training. By minimizing the discrepancy between predicted and target conditions throughout the entire diffusion process, our alignment loss improves both control fidelity and generation quality. Combined with established techniques like ControlNet++, InnerControl achieves state-of-the-art performance across diverse conditioning methods (e.g., edges, depth).
Chinese: InnerControl通过在所有去噪步骤中使用轻量级卷积探针从中间特征重建控制信号,强制实现空间一致性,从而在与ControlNet++等技术结合时,显著提升了控制精度和生成质量。
English: InnerControl enhances text-to-image diffusion models by enforcing spatial consistency across all denoising steps through lightweight convolutional probes that reconstruct control signals from intermediate features, achieving superior alignment and generation quality when combined with existing methods like ControlNet++.

Authors:Zecheng Zhao, Selena Song, Tong Chen, Zhi Chen, Shazia Sadiq, Yadan Luo
Title: Are Synthetic Videos Useful? A Benchmark for Retrieval-Centric Evaluation of Synthetic Videos
Abstract:
Text-to-video (T2V) synthesis has advanced rapidly, yet current evaluation metrics primarily capture visual quality and temporal consistency, offering limited insight into how synthetic videos perform in downstream tasks such as text-to-video retrieval (TVR). In this work, we introduce SynTVA, a new dataset and benchmark designed to evaluate the utility of synthetic videos for building retrieval models. Based on 800 diverse user queries derived from MSRVTT training split, we generate synthetic videos using state-of-the-art T2V models and annotate each video-text pair along four key semantic alignment dimensions: Object \& Scene, Action, Attribute, and Prompt Fidelity. Our evaluation framework correlates general video quality assessment (VQA) metrics with these alignment scores, and examines their predictive power for downstream TVR performance. To explore pathways of scaling up, we further develop an Auto-Evaluator to estimate alignment quality from existing metrics. Beyond benchmarking, our results show that SynTVA is a valuable asset for dataset augmentation, enabling the selection of high-utility synthetic samples that measurably improve TVR outcomes. Project page and dataset can be found at https://jasoncodemaker.github.io/SynTVA/.

Authors:JaeHyuck Choi, MinJun Kim, JeHyeong Hong
Title: MAGIC: Mask-Guided Diffusion Inpainting with Multi-Level Perturbations and Context-Aware Alignment for Few-Shot Anomaly Generation
Abstract:
Few-shot anomaly generation is emerging as a practical solution for augmenting the scarce anomaly data in industrial quality control settings. An ideal generator would meet three demands at once, namely (i) keep the normal background intact, (ii) inpaint anomalous regions to tightly overlap with the corresponding anomaly masks, and (iii) generate anomalous regions in a semantically valid location, while still producing realistic, diverse appearances from only a handful of real examples. Existing diffusion-based methods usually satisfy at most two of these requirements: global anomaly generators corrupt the background, whereas mask-guided ones often falter when the mask is imprecise or misplaced. We propose MAGIC--Mask-guided inpainting with multi-level perturbations and Context-aware alignment--to resolve all three issues. At its core, MAGIC fine-tunes a Stable Diffusion inpainting backbone that preserves normal regions and ensures strict adherence of the synthesized anomaly to the supplied mask, directly addressing background corruption and misalignment. To offset the diversity loss that fine-tuning can cause, MAGIC adds two complementary perturbation strategies: (i) Gaussian prompt-level perturbation applied during fine-tuning and inference that broadens the global appearance of anomalies while avoiding low-fidelity textual appearances, and (ii) mask-guided spatial noise injection that enriches local texture variations. Additionally, the context-aware mask alignment module forms semantic correspondences and relocates masks so that every anomaly remains plausibly contained within the host object, eliminating out-of-boundary artifacts. Under a consistent identical evaluation protocol on the MVTec-AD dataset, MAGIC outperforms previous state-of-the-arts in downstream anomaly tasks.
中文摘要:MAGIC是一种新颖的小样本异常生成方法,通过多级扰动和上下文感知对齐技术,在保留正常背景的同时精确贴合异常掩码并确保语义合理性,在MVTec-AD数据集上超越了现有最优方法。
English Summary: MAGIC is a novel few-shot anomaly generation method that preserves normal backgrounds, aligns anomalies precisely with masks, and ensures semantic plausibility through multi-level perturbations and context-aware alignment, outperforming existing techniques on the MVTec-AD dataset.

Authors:Dohoon Kim, Donghun Kang, Taesup Moon
Title: DoMIX: An Efficient Framework for Exploiting Domain Knowledge in Fine-Tuning
Abstract:
Domain-Adaptive Pre-training (DAP) has recently gained attention for its effectiveness in fine-tuning pre-trained models. Building on this, continual DAP has been explored to develop pre-trained models capable of incrementally incorporating different domain datasets. However, existing continual DAP methods face several limitations: (1) high computational cost and GPU memory usage during training; (2) sensitivity to incremental data order; and (3) providing a single, generalized model for all end tasks, which contradicts the essence of DAP. In this paper, we propose DoMIX, a novel approach that addresses these challenges by leveraging LoRA modules, a representative parameter-efficient fine-tuning (PEFT) method. Our approach enables efficient and parallel domain-adaptive pre-training that is robust to domain order and effectively utilizes accumulated knowledge to provide tailored pre-trained models for specific tasks. We also demonstrate that our method can be extended beyond the DAP setting to standard LLM fine-tuning scenarios. Code is available at https://github.com/dohoonkim-ai/DoMIX.
Chinese: DoMIX提出了一种利用LoRA模块的高效并行领域自适应预训练方法,解决了持续DAP中的计算成本高、领域顺序敏感和缺乏任务专用模型的问题,并可扩展至标准大语言模型微调场景。
English: DoMIX introduces an efficient and parallel domain-adaptive pre-training method using LoRA modules to overcome computational costs, domain order sensitivity, and lack of task-specific models in continual DAP, extending its applicability to standard LLM fine-tuning.

Authors:Fanghai Yi, Zehong Zheng, Zexiao Liang, Yihang Dong, Xiyang Fang, Wangyu Wu, Xuhang Chen
Title: MAC-Lookup: Multi-Axis Conditional Lookup Model for Underwater Image Enhancement
Abstract:
Enhancing underwater images is crucial for exploration. These images face visibility and color issues due to light changes, water turbidity, and bubbles. Traditional prior-based methods and pixel-based methods often fail, while deep learning lacks sufficient high-quality datasets. We introduce the Multi-Axis Conditional Lookup (MAC-Lookup) model, which enhances visual quality by improving color accuracy, sharpness, and contrast. It includes Conditional 3D Lookup Table Color Correction (CLTCC) for preliminary color and quality correction and Multi-Axis Adaptive Enhancement (MAAE) for detail refinement. This model prevents over-enhancement and saturation while handling underwater challenges. Extensive experiments show that MAC-Lookup excels in enhancing underwater images by restoring details and colors better than existing methods. The code is https://github.com/onlycatdoraemon/MAC-Lookup.
中文: MAC-Lookup模型通过颜色校正和自适应增强技术,有效提升水下图像质量,恢复细节与色彩,实验证明其性能优于现有方法。
English: The MAC-Lookup model effectively enhances underwater images by using color correction and adaptive enhancement to improve visual quality and overcome common underwater challenges, outperforming existing methods in experiments.

Authors:Yuxiang Zhang, Wei Li, Wen Jia, Mengmeng Zhang, Ran Tao, Shunlin Liang
Title: Cross-domain Hyperspectral Image Classification based on Bi-directional Domain Adaptation
Abstract:
Utilizing hyperspectral remote sensing technology enables the extraction of fine-grained land cover classes. Typically, satellite or airborne images used for training and testing are acquired from different regions or times, where the same class has significant spectral shifts in different scenes. In this paper, we propose a Bi-directional Domain Adaptation (BiDA) framework for cross-domain hyperspectral image (HSI) classification, which focuses on extracting both domain-invariant features and domain-specific information in the independent adaptive space, thereby enhancing the adaptability and separability to the target scene. In the proposed BiDA, a triple-branch transformer architecture (the source branch, target branch, and coupled branch) with semantic tokenizer is designed as the backbone. Specifically, the source branch and target branch independently learn the adaptive space of source and target domains, a Coupled Multi-head Cross-attention (CMCA) mechanism is developed in coupled branch for feature interaction and inter-domain correlation mining. Furthermore, a bi-directional distillation loss is designed to guide adaptive space learning using inter-domain correlation. Finally, we propose an Adaptive Reinforcement Strategy (ARS) to encourage the model to focus on specific generalized feature extraction within both source and target scenes in noise condition. Experimental results on cross-temporal/scene airborne and satellite datasets demonstrate that the proposed BiDA performs significantly better than some state-of-the-art domain adaptation approaches. In the cross-temporal tree species classification task, the proposed BiDA is more than 3\%$\sim$5\% higher than the most advanced method. The codes will be available from the website: https://github.com/YuxiangZhang-BIT/IEEE_TCSVT_BiDA.
中文: 本文提出了一种双向域适应(BiDA)框架,通过三支路变换器提取域不变和域特定特征,显著提升了跨域高光谱图像分类的性能,优于现有先进方法。
English: This paper introduces a Bi-directional Domain Adaptation (BiDA) framework that uses a triple-branch transformer to enhance cross-domain hyperspectral image classification by extracting domain-invariant and domain-specific features, achieving superior performance over existing methods.

Authors:Takuro Kawada, Shunsuke Kitada, Sota Nemoto, Hitoshi Iyatomi
Title: SciGA: A Comprehensive Dataset for Designing Graphical Abstracts in Academic Papers
Abstract:
Graphical Abstracts (GAs) play a crucial role in visually conveying the key findings of scientific papers. While recent research has increasingly incorporated visual materials such as Figure 1 as de facto GAs, their potential to enhance scientific communication remains largely unexplored. Moreover, designing effective GAs requires advanced visualization skills, creating a barrier to their widespread adoption. To tackle these challenges, we introduce SciGA-145k, a large-scale dataset comprising approximately 145,000 scientific papers and 1.14 million figures, explicitly designed for supporting GA selection and recommendation as well as facilitating research in automated GA generation. As a preliminary step toward GA design support, we define two tasks: 1) Intra-GA recommendation, which identifies figures within a given paper that are well-suited to serve as GAs, and 2) Inter-GA recommendation, which retrieves GAs from other papers to inspire the creation of new GAs. We provide reasonable baseline models for these tasks. Furthermore, we propose Confidence Adjusted top-1 ground truth Ratio (CAR), a novel recommendation metric that offers a fine-grained analysis of model behavior. CAR addresses limitations in traditional ranking-based metrics by considering cases where multiple figures within a paper, beyond the explicitly labeled GA, may also serve as GAs. By unifying these tasks and metrics, our SciGA-145k establishes a foundation for advancing visual scientific communication while contributing to the development of AI for Science.

Authors:Xiao Wang, Jingtao Jiang, Qiang Chen, Lan Chen, Lin Zhu, Yaowei Wang, Yonghong Tian, Jin Tang
Title: ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning
Abstract:
Event stream based scene text recognition is a newly arising research topic in recent years which performs better than the widely used RGB cameras in extremely challenging scenarios, especially the low illumination, fast motion. Existing works either adopt end-to-end encoder-decoder framework or large language models for enhanced recognition, however, they are still limited by the challenges of insufficient interpretability and weak contextual logical reasoning. In this work, we propose a novel chain-of-thought reasoning based event stream scene text recognition framework, termed ESTR-CoT. Specifically, we first adopt the vision encoder EVA-CLIP (ViT-G/14) to transform the input event stream into tokens and utilize a Llama tokenizer to encode the given generation prompt. A Q-former is used to align the vision token to the pre-trained large language model Vicuna-7B and output both the answer and chain-of-thought (CoT) reasoning process simultaneously. Our framework can be optimized using supervised fine-tuning in an end-to-end manner. In addition, we also propose a large-scale CoT dataset to train our framework via a three stage processing (i.e., generation, polish, and expert verification). This dataset provides a solid data foundation for the development of subsequent reasoning-based large models. Extensive experiments on three event stream STR benchmark datasets (i.e., EventSTR, WordArt*, IC15*) fully validated the effectiveness and interpretability of our proposed framework. The source code and pre-trained models will be released on https://github.com/Event-AHU/ESTR-CoT.
中文: 本文提出了一种新颖的事件流场景文本识别框架ESTR-CoT,通过引入思维链推理机制提升了解释能力和上下文逻辑,在多个基准数据集上的实验充分验证了其有效性。
English: This paper introduces ESTR-CoT, a novel event stream scene text recognition framework that integrates chain-of-thought reasoning to enhance interpretability and contextual logic, validated by extensive experiments on benchmark datasets.

Authors:Yukang Cao, Chenyang Si, Jinghao Wang, Ziwei Liu
Title: FreeMorph: Tuning-Free Generalized Image Morphing with Diffusion Model
Abstract:
We present FreeMorph, the first tuning-free method for image morphing that accommodates inputs with different semantics or layouts. Unlike existing methods that rely on finetuning pre-trained diffusion models and are limited by time constraints and semantic/layout discrepancies, FreeMorph delivers high-fidelity image morphing without requiring per-instance training. Despite their efficiency and potential, tuning-free methods face challenges in maintaining high-quality results due to the non-linear nature of the multi-step denoising process and biases inherited from the pre-trained diffusion model. In this paper, we introduce FreeMorph to address these challenges by integrating two key innovations. 1) We first propose a guidance-aware spherical interpolation design that incorporates explicit guidance from the input images by modifying the self-attention modules, thereby addressing identity loss and ensuring directional transitions throughout the generated sequence. 2) We further introduce a step-oriented variation trend that blends self-attention modules derived from each input image to achieve controlled and consistent transitions that respect both inputs. Our extensive evaluations demonstrate that FreeMorph outperforms existing methods, being 10x ~ 50x faster and establishing a new state-of-the-art for image morphing.
中文: FreeMorph是首个无需调优的图像变形方法,能处理不同语义或布局的输入,无需逐实例训练即可实现高保真效果,速度比现有方法快10至50倍。
English: FreeMorph is the first tuning-free image morphing method that handles inputs with different semantics or layouts, achieving high-fidelity results 10x to 50x faster than existing methods without per-instance training.

Authors:Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Hao Peng, Haojie Ding, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Jin Ouyang, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Shengnan Zhang, Siyang Mao, Sui Huang, Tianke Zhang, Tingting Gao, Wei Chen, Wei Yuan, Xiangyu Wu, Xiao Hu, Xingyu Lu, Yang Zhou, Yi-Fan Zhang, Yiping Yang, Yulong Chen, Zhenhua Wu, Zhenyu Li, Zhixin Ling, Ziming Li, Dehua Ma, Di Xu, Haixuan Gao, Hang Li, Jiawei Guo, Jing Wang, Lejian Ren, Muhao Wei, Qianqian Wang, Qigen Hu, Shiyao Wang, Tao Yu, Xinchen Luo, Yan Li, Yiming Liang, Yuhang Hu, Zeyi Lu, Zhuoran Yang, Zixing Zhang
Title: Kwai Keye-VL Technical Report
Abstract:
While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities on static images, they often fall short in comprehending dynamic, information-dense short-form videos, a dominant medium in today's digital landscape. To bridge this gap, we introduce \textbf{Kwai Keye-VL}, an 8-billion-parameter multimodal foundation model engineered for leading-edge performance in short-video understanding while maintaining robust general-purpose vision-language abilities. The development of Keye-VL rests on two core pillars: a massive, high-quality dataset exceeding 600 billion tokens with a strong emphasis on video, and an innovative training recipe. This recipe features a four-stage pre-training process for solid vision-language alignment, followed by a meticulous two-phase post-training process. The first post-training stage enhances foundational capabilities like instruction following, while the second phase focuses on stimulating advanced reasoning. In this second phase, a key innovation is our five-mode ``cold-start'' data mixture, which includes ``thinking'', ``non-thinking'', ``auto-think'', ``think with image'', and high-quality video data. This mixture teaches the model to decide when and how to reason. Subsequent reinforcement learning (RL) and alignment steps further enhance these reasoning capabilities and correct abnormal model behaviors, such as repetitive outputs. To validate our approach, we conduct extensive evaluations, showing that Keye-VL achieves state-of-the-art results on public video benchmarks and remains highly competitive on general image-based tasks (Figure 1). Furthermore, we develop and release the \textbf{KC-MMBench}, a new benchmark tailored for real-world short-video scenarios, where Keye-VL shows a significant advantage.
Chinese: Kwai Keye-VL 是一个拥有80亿参数的多模态基础模型,通过四阶段预训练和两阶段后训练的创新方法,在短视频理解方面实现领先性能,同时在通用视觉语言任务中保持强大竞争力。
English: Kwai Keye-VL is an 8-billion-parameter multimodal model designed to excel in understanding dynamic short videos through a four-stage pre-training and two-phase post-training process, achieving state-of-the-art results on video benchmarks while maintaining strong general vision-language capabilities.

Authors:Nan Chen, Mengqi Huang, Yihao Meng, Zhendong Mao
Title: LongAnimation: Long Animation Generation with Dynamic Global-Local Memory
Abstract:
Animation colorization is a crucial part of real animation industry production. Long animation colorization has high labor costs. Therefore, automated long animation colorization based on the video generation model has significant research value. Existing studies are limited to short-term colorization. These studies adopt a local paradigm, fusing overlapping features to achieve smooth transitions between local segments. However, the local paradigm neglects global information, failing to maintain long-term color consistency. In this study, we argue that ideal long-term color consistency can be achieved through a dynamic global-local paradigm, i.e., dynamically extracting global color-consistent features relevant to the current generation. Specifically, we propose LongAnimation, a novel framework, which mainly includes a SketchDiT, a Dynamic Global-Local Memory (DGLM), and a Color Consistency Reward. The SketchDiT captures hybrid reference features to support the DGLM module. The DGLM module employs a long video understanding model to dynamically compress global historical features and adaptively fuse them with the current generation features. To refine the color consistency, we introduce a Color Consistency Reward. During inference, we propose a color consistency fusion to smooth the video segment transition. Extensive experiments on both short-term (14 frames) and long-term (average 500 frames) animations show the effectiveness of LongAnimation in maintaining short-term and long-term color consistency for open-domain animation colorization task. The code can be found at https://cn-makers.github.io/long_animation_web/.

Authors:Yiming Ju, Jijin Hu, Zhengxiong Luo, Haoge Deng, hanyu Zhao, Li Du, Chengwei Wu, Donglin Hao, Xinlong Wang, Tengfei Pan
Title: CI-VID: A Coherent Interleaved Text-Video Dataset
Abstract:
Text-to-video (T2V) generation has recently attracted considerable attention, resulting in the development of numerous high-quality datasets that have propelled progress in this area. However, existing public datasets are primarily composed of isolated text-video (T-V) pairs and thus fail to support the modeling of coherent multi-clip video sequences. To address this limitation, we introduce CI-VID, a dataset that moves beyond isolated text-to-video (T2V) generation toward text-and-video-to-video (TV2V) generation, enabling models to produce coherent, multi-scene video sequences. CI-VID contains over 340,000 samples, each featuring a coherent sequence of video clips with text captions that capture both the individual content of each clip and the transitions between them, enabling visually and textually grounded generation. To further validate the effectiveness of CI-VID, we design a comprehensive, multi-dimensional benchmark incorporating human evaluation, VLM-based assessment, and similarity-based metrics. Experimental results demonstrate that models trained on CI-VID exhibit significant improvements in both accuracy and content consistency when generating video sequences. This facilitates the creation of story-driven content with smooth visual transitions and strong temporal coherence, underscoring the quality and practical utility of the CI-VID dataset We release the CI-VID dataset and the accompanying code for data construction and evaluation at: https://github.com/ymju-BAAI/CI-VID
Chinese: CI-VID数据集通过提供34万多个带详细描述的多片段连贯视频序列,推动了文本到视频生成的发展,使模型能够生成具有更高准确性和时序一致性的故事驱动视频内容。
English: The CI-VID dataset advances text-to-video generation by providing over 340,000 coherent multi-clip sequences with detailed captions, enabling models to produce story-driven videos with improved accuracy and temporal consistency.

Authors:Zhentan Zheng
Title: evMLP: An Efficient Event-Driven MLP Architecture for Vision
Abstract:
Deep neural networks have achieved remarkable results in computer vision tasks. In the early days, Convolutional Neural Networks (CNNs) were the mainstream architecture. In recent years, Vision Transformers (ViTs) have become increasingly popular. In addition, exploring applications of multi-layer perceptrons (MLPs) has provided new perspectives for research into vision model architectures. In this paper, we present evMLP accompanied by a simple event-driven local update mechanism. The proposed evMLP can independently process patches on images or feature maps via MLPs. We define changes between consecutive frames as "events". Under the event-driven local update mechanism, evMLP selectively processes patches where events occur. For sequential image data (e.g., video processing), this approach improves computational performance by avoiding redundant computations. Through ImageNet image classification experiments, evMLP attains accuracy competitive with state-of-the-art models. More significantly, experimental results on multiple video datasets demonstrate that evMLP reduces computational cost via its event-driven local update mechanism while maintaining output consistency with its non-event-driven baseline. The code and trained models are available at https://github.com/i-evi/evMLP.
中文: 本文提出evMLP模型,采用事件驱动的多层感知机架构,通过选择性处理帧间变化图像块,在ImageNet分类中达到先进精度,并在视频任务中通过局部更新机制显著降低计算开销。
English: The paper introduces evMLP, an event-driven multi-layer perceptron model that selectively processes image patches with changes between frames, achieving competitive accuracy on ImageNet while reducing computational costs in video processing tasks.

Authors:Yaowei Li, Xiaoyu Li, Zhaoyang Zhang, Yuxuan Bian, Gan Liu, Xinyuan Li, Jiale Xu, Wenbo Hu, Yating Liu, Lingen Li, Jing Cai, Yuexian Zou, Yancheng He, Ying Shan
Title: IC-Custom: Diverse Image Customization via In-Context Learning
Abstract:
Image customization, a crucial technique for industrial media production, aims to generate content that is consistent with reference images. However, current approaches conventionally separate image customization into position-aware and position-free customization paradigms and lack a universal framework for diverse customization, limiting their applications across various scenarios. To overcome these limitations, we propose IC-Custom, a unified framework that seamlessly integrates position-aware and position-free image customization through in-context learning. IC-Custom concatenates reference images with target images to a polyptych, leveraging DiT's multi-modal attention mechanism for fine-grained token-level interactions. We propose the In-context Multi-Modal Attention (ICMA) mechanism, which employs learnable task-oriented register tokens and boundary-aware positional embeddings to enable the model to effectively handle diverse tasks and distinguish between inputs in polyptych configurations. To address the data gap, we curated a 12K identity-consistent dataset with 8K real-world and 4K high-quality synthetic samples, avoiding the overly glossy, oversaturated look typical of synthetic data. IC-Custom supports various industrial applications, including try-on, image insertion, and creative IP customization. Extensive evaluations on our proposed ProductBench and the publicly available DreamBench demonstrate that IC-Custom significantly outperforms community workflows, closed-source models, and state-of-the-art open-source approaches. IC-Custom achieves about 73\% higher human preference across identity consistency, harmony, and text alignment metrics, while training only 0.4\% of the original model parameters. Project page: https://liyaowei-stu.github.io/project/IC_Custom
中文: IC-Custom通过上下文学习提出了一个统一框架,无缝整合了位置感知和无位置图像定制,在仅训练少量参数的情况下显著优于现有方法,并支持多种工业应用场景。
English: IC-Custom introduces a unified framework that integrates position-aware and position-free image customization through in-context learning, achieving superior performance with minimal parameter training while supporting diverse industrial applications.

Authors:Kunlun Xu, Fan Zhuo, Jiangmeng Li, Xu Zou, Jiahuan Zhou
Title: Self-Reinforcing Prototype Evolution with Dual-Knowledge Cooperation for Semi-Supervised Lifelong Person Re-Identification
Abstract:
Current lifelong person re-identification (LReID) methods predominantly rely on fully labeled data streams. However, in real-world scenarios where annotation resources are limited, a vast amount of unlabeled data coexists with scarce labeled samples, leading to the Semi-Supervised LReID (Semi-LReID) problem where LReID methods suffer severe performance degradation. Existing LReID methods, even when combined with semi-supervised strategies, suffer from limited long-term adaptation performance due to struggling with the noisy knowledge occurring during unlabeled data utilization. In this paper, we pioneer the investigation of Semi-LReID, introducing a novel Self-Reinforcing Prototype Evolution with Dual-Knowledge Cooperation framework (SPRED). Our key innovation lies in establishing a self-reinforcing cycle between dynamic prototype-guided pseudo-label generation and new-old knowledge collaborative purification to enhance the utilization of unlabeled data. Specifically, learnable identity prototypes are introduced to dynamically capture the identity distributions and generate high-quality pseudo-labels. Then, the dual-knowledge cooperation scheme integrates current model specialization and historical model generalization, refining noisy pseudo-labels. Through this cyclic design, reliable pseudo-labels are progressively mined to improve current-stage learning and ensure positive knowledge propagation over long-term learning. Experiments on the established Semi-LReID benchmarks show that our SPRED achieves state-of-the-art performance. Our source code is available at https://github.com/zhoujiahuan1991/ICCV2025-SPRED
中文: 本文针对半监督终身行人重识别问题,提出SPRED框架,通过动态原型引导的伪标签生成与新旧知识协同净化的自我增强循环,在基准测试中取得了最优性能。
English: This paper introduces a novel framework called SPRED to address the Semi-Supervised Lifelong Person Re-Identification problem by establishing a self-reinforcing cycle between dynamic prototype-guided pseudo-label generation and dual-knowledge purification, achieving state-of-the-art performance on benchmarks.

Authors:Hailong Yan, Ao Li, Xiangtao Zhang, Zhe Liu, Zenglin Shi, Ce Zhu, Le Zhang
Title: MobileIE: An Extremely Lightweight and Effective ConvNet for Real-Time Image Enhancement on Mobile Devices
Abstract:
Recent advancements in deep neural networks have driven significant progress in image enhancement (IE). However, deploying deep learning models on resource-constrained platforms, such as mobile devices, remains challenging due to high computation and memory demands. To address these challenges and facilitate real-time IE on mobile, we introduce an extremely lightweight Convolutional Neural Network (CNN) framework with around 4K parameters. Our approach integrates reparameterization with an Incremental Weight Optimization strategy to ensure efficiency. Additionally, we enhance performance with a Feature Self-Transform module and a Hierarchical Dual-Path Attention mechanism, optimized with a Local Variance-Weighted loss. With this efficient framework, we are the first to achieve real-time IE inference at up to 1,100 frames per second (FPS) while delivering competitive image quality, achieving the best trade-off between speed and performance across multiple IE tasks. The code will be available at https://github.com/AVC2-UESTC/MobileIE.git.
中文摘要:本文提出了一种仅含约4K参数的极轻量CNN框架,通过创新的优化技术,在移动设备上实现了高达1100 FPS的实时图像增强,同时保持了有竞争力的图像质量。
English Summary: This paper introduces an extremely lightweight CNN framework with only around 4K parameters, achieving real-time image enhancement at up to 1,100 FPS on mobile devices while maintaining competitive image quality through innovative optimization techniques.

Authors:Tyler Ward, Meredith K. Owen, O'Kira Coleman, Brian Noehren, Abdullah-Al-Zubaer Imran
Title: Autoadaptive Medical Segment Anything Model
Abstract:
Medical image segmentation is a key task in the imaging workflow, influencing many image-based decisions. Traditional, fully-supervised segmentation models rely on large amounts of labeled training data, typically obtained through manual annotation, which can be an expensive, time-consuming, and error-prone process. This signals a need for accurate, automatic, and annotation-efficient methods of training these models. We propose ADA-SAM (automated, domain-specific, and adaptive segment anything model), a novel multitask learning framework for medical image segmentation that leverages class activation maps from an auxiliary classifier to guide the predictions of the semi-supervised segmentation branch, which is based on the Segment Anything (SAM) framework. Additionally, our ADA-SAM model employs a novel gradient feedback mechanism to create a learnable connection between the segmentation and classification branches by using the segmentation gradients to guide and improve the classification predictions. We validate ADA-SAM on real-world clinical data collected during rehabilitation trials, and demonstrate that our proposed method outperforms both fully-supervised and semi-supervised baselines by double digits in limited label settings. Our code is available at: https://github.com/tbwa233/ADA-SAM.
Chinese: ADA-SAM是一种自动化医学图像分割框架,通过整合类别激活图和梯度反馈机制,在有限标注条件下显著提升了分割精度,性能远超现有基准方法。
English: ADA-SAM is an automated medical image segmentation framework that integrates class activation maps and a gradient feedback mechanism to enhance accuracy in limited-label settings, significantly outperforming existing methods.

Authors:Shengli Zhou, Jianuo Zhu, Qilin Huang, Fangjing Wang, Yanfu Zhang, Feng Zheng
Title: HCNQA: Enhancing 3D VQA with Hierarchical Concentration Narrowing Supervision
Abstract:
3D Visual Question-Answering (3D VQA) is pivotal for models to perceive the physical world and perform spatial reasoning. Answer-centric supervision is a commonly used training method for 3D VQA models. Many models that utilize this strategy have achieved promising results in 3D VQA tasks. However, the answer-centric approach only supervises the final output of models and allows models to develop reasoning pathways freely. The absence of supervision on the reasoning pathway enables the potential for developing superficial shortcuts through common patterns in question-answer pairs. Moreover, although slow-thinking methods advance large language models, they suffer from underthinking. To address these issues, we propose \textbf{HCNQA}, a 3D VQA model leveraging a hierarchical concentration narrowing supervision method. By mimicking the human process of gradually focusing from a broad area to specific objects while searching for answers, our method guides the model to perform three phases of concentration narrowing through hierarchical supervision. By supervising key checkpoints on a general reasoning pathway, our method can ensure the development of a rational and effective reasoning pathway. Extensive experimental results demonstrate that our method can effectively ensure that the model develops a rational reasoning pathway and performs better. The code is available at https://github.com/JianuoZhu/HCNQA.
中文: HCNQA模型采用分层聚焦监督方法,通过模拟人类逐步聚焦的推理过程来引导3D视觉问答模型建立合理推理路径,从而提升性能表现。
English: The proposed HCNQA model introduces hierarchical concentration narrowing supervision to guide 3D VQA models through a structured reasoning pathway, effectively preventing superficial shortcuts and enhancing performance.

Authors:Tianze Hua, Tian Yun, Ellie Pavlick
Title: How Do Vision-Language Models Process Conflicting Information Across Modalities?
Abstract:
AI models are increasingly required to be multimodal, integrating disparate input streams into a coherent state representation on which subsequent behaviors and actions can be based. This paper seeks to understand how such models behave when input streams present conflicting information. Focusing specifically on vision-language models, we provide inconsistent inputs (e.g., an image of a dog paired with the caption "A photo of a cat") and ask the model to report the information present in one of the specific modalities (e.g., "What does the caption say / What is in the image?"). We find that models often favor one modality over the other, e.g., reporting the image regardless of what the caption says, but that different models differ in which modality they favor. We find evidence that the behaviorally preferred modality is evident in the internal representational structure of the model, and that specific attention heads can restructure the representations to favor one modality over the other. Moreover, we find modality-agnostic "router heads" which appear to promote answers about the modality requested in the instruction, and which can be manipulated or transferred in order to improve performance across datasets and modalities. Together, the work provides essential steps towards identifying and controlling if and how models detect and resolve conflicting signals within complex multimodal environments.
中文: 本研究探讨了视觉语言模型如何处理图像与文本模态间的冲突输入,发现模型常偏向某一模态,特定注意力头可调控这种偏好,并实现跨模态性能优化。
English: This study investigates how vision-language models handle conflicting inputs between image and text modalities, revealing that models often favor one modality over the other, with specific attention heads influencing this preference and enabling control over modality prioritization.

Authors:Peng Zheng, Junke Wang, Yi Chang, Yizhou Yu, Rui Ma, Zuxuan Wu
Title: Rethinking Discrete Tokens: Treating Them as Conditions for Continuous Autoregressive Image Synthesis
Abstract:
Recent advances in large language models (LLMs) have spurred interests in encoding images as discrete tokens and leveraging autoregressive (AR) frameworks for visual generation. However, the quantization process in AR-based visual generation models inherently introduces information loss that degrades image fidelity. To mitigate this limitation, recent studies have explored to autoregressively predict continuous tokens. Unlike discrete tokens that reside in a structured and bounded space, continuous representations exist in an unbounded, high-dimensional space, making density estimation more challenging and increasing the risk of generating out-of-distribution artifacts. Based on the above findings, this work introduces DisCon (Discrete-Conditioned Continuous Autoregressive Model), a novel framework that reinterprets discrete tokens as conditional signals rather than generation targets. By modeling the conditional probability of continuous representations conditioned on discrete tokens, DisCon circumvents the optimization challenges of continuous token modeling while avoiding the information loss caused by quantization. DisCon achieves a gFID score of 1.38 on ImageNet 256$\times$256 generation, outperforming state-of-the-art autoregressive approaches by a clear margin. Project page: https://pengzheng0707.github.io/DisCon.

Authors:Ming Dai, Wenxuan Cheng, Jiang-jiang Liu, Sen Yang, Wenxiao Cai, Yanpeng Sun, Wankou Yang
Title: DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy
Abstract:
Referring Image Segmentation (RIS) is a challenging task that aims to segment objects in an image based on natural language expressions. While prior studies have predominantly concentrated on improving vision-language interactions and achieving fine-grained localization, a systematic analysis of the fundamental bottlenecks in existing RIS frameworks remains underexplored. To bridge this gap, we propose DeRIS, a novel framework that decomposes RIS into two key components: perception and cognition. This modular decomposition facilitates a systematic analysis of the primary bottlenecks impeding RIS performance. Our findings reveal that the predominant limitation lies not in perceptual deficiencies, but in the insufficient multi-modal cognitive capacity of current models. To mitigate this, we propose a Loopback Synergy mechanism, which enhances the synergy between the perception and cognition modules, thereby enabling precise segmentation while simultaneously improving robust image-text comprehension. Additionally, we analyze and introduce a simple non-referent sample conversion data augmentation to address the long-tail distribution issue related to target existence judgement in general scenarios. Notably, DeRIS demonstrates inherent adaptability to both non- and multi-referents scenarios without requiring specialized architectural modifications, enhancing its general applicability. The codes and models are available at https://github.com/Dmmm1997/DeRIS.
中文摘要:DeRIS框架通过分解指称图像分割为感知与认知模块,发现当前模型主要瓶颈在于多模态认知能力不足,并提出循环协同机制增强模块间协作,同时引入数据增强方法提升通用性。
English Summary: The DeRIS framework addresses the primary bottleneck in Referring Image Segmentation by identifying insufficient multimodal cognitive capacity as the key limitation and introducing a Loopback Synergy mechanism to enhance perception-cognition integration.

Authors:Kai Chen, Ruiyuan Gao, Lanqing Hong, Hang Xu, Xu Jia, Holger Caesar, Dengxin Dai, Bingbing Liu, Dzmitry Tsishkou, Songcen Xu, Chunjing Xu, Qiang Xu, Huchuan Lu, Dit-Yan Yeung
Title: ECCV 2024 W-CODA: 1st Workshop on Multimodal Perception and Comprehension of Corner Cases in Autonomous Driving
Abstract:
In this paper, we present details of the 1st W-CODA workshop, held in conjunction with the ECCV 2024. W-CODA aims to explore next-generation solutions for autonomous driving corner cases, empowered by state-of-the-art multimodal perception and comprehension techniques. 5 Speakers from both academia and industry are invited to share their latest progress and opinions. We collect research papers and hold a dual-track challenge, including both corner case scene understanding and generation. As the pioneering effort, we will continuously bridge the gap between frontier autonomous driving techniques and fully intelligent, reliable self-driving agents robust towards corner cases.
中文摘要:首届W-CODA工作坊在ECCV 2024期间举办,聚焦通过多模态AI技术解决自动驾驶极端案例,包含学术研讨和场景理解与生成的双轨挑战。
English Summary: The 1st W-CODA workshop at ECCV 2024 focuses on advancing autonomous driving solutions for corner cases through multimodal AI, featuring expert talks and dual-track challenges on scene understanding and generation.

Authors:Martine Hjelkrem-Tan, Marius Aasan, Gabriel Y. Arteaga, Adín Ramírez Rivera
Title: SPoT: Subpixel Placement of Tokens in Vision Transformers
Abstract:
Vision Transformers naturally accommodate sparsity, yet standard tokenization methods confine features to discrete patch grids. This constraint prevents models from fully exploiting sparse regimes, forcing awkward compromises. We propose Subpixel Placement of Tokens (SPoT), a novel tokenization strategy that positions tokens continuously within images, effectively sidestepping grid-based limitations. With our proposed oracle-guided search, we uncover substantial performance gains achievable with ideal subpixel token positioning, drastically reducing the number of tokens necessary for accurate predictions during inference. SPoT provides a new direction for flexible, efficient, and interpretable ViT architectures, redefining sparsity as a strategic advantage rather than an imposed limitation.
中文总结:提出的子像素标记定位(SPoT)方法通过连续标记定位突破网格限制,利用策略性稀疏优势显著减少标记数量并提升模型性能。
English summary: The proposed Subpixel Placement of Tokens (SPoT) method enables continuous token positioning within images to overcome grid limitations, significantly reducing token requirements while improving performance through strategic sparsity utilization.

Authors:Boyuan Sun, Modi Jin, Bowen Yin, Qibin Hou
Title: Depth Anything at Any Condition
Abstract:
We present Depth Anything at Any Condition (DepthAnything-AC), a foundation monocular depth estimation (MDE) model capable of handling diverse environmental conditions. Previous foundation MDE models achieve impressive performance across general scenes but not perform well in complex open-world environments that involve challenging conditions, such as illumination variations, adverse weather, and sensor-induced distortions. To overcome the challenges of data scarcity and the inability of generating high-quality pseudo-labels from corrupted images, we propose an unsupervised consistency regularization finetuning paradigm that requires only a relatively small amount of unlabeled data. Furthermore, we propose the Spatial Distance Constraint to explicitly enforce the model to learn patch-level relative relationships, resulting in clearer semantic boundaries and more accurate details. Experimental results demonstrate the zero-shot capabilities of DepthAnything-AC across diverse benchmarks, including real-world adverse weather benchmarks, synthetic corruption benchmarks, and general benchmarks. Project Page: https://ghost233lism.github.io/depthanything-AC-page Code: https://github.com/HVision-NKU/DepthAnythingAC
中文总结:DepthAnything-AC是一种基础单目深度估计模型,通过无监督一致性正则化和空间距离约束处理多样化环境条件,在各类基准测试中展现出卓越的零样本能力。
English Summary: DepthAnything-AC is a foundation monocular depth estimation model that handles diverse environmental conditions through unsupervised consistency regularization and a Spatial Distance Constraint, demonstrating strong zero-shot performance across various benchmarks.

Authors:Camille Billouard, Dawa Derksen, Alexandre Constantin, Bruno Vallet
Title: Tile and Slide : A New Framework for Scaling NeRF from Local to Global 3D Earth Observation
Abstract:
Neural Radiance Fields (NeRF) have recently emerged as a paradigm for 3D reconstruction from multiview satellite imagery. However, state-of-the-art NeRF methods are typically constrained to small scenes due to the memory footprint during training, which we study in this paper. Previous work on large-scale NeRFs palliate this by dividing the scene into NeRFs. This paper introduces Snake-NeRF, a framework that scales to large scenes. Our out-of-core method eliminates the need to load all images and networks simultaneously, and operates on a single device. We achieve this by dividing the region of interest into NeRFs that 3D tile without overlap. Importantly, we crop the images with overlap to ensure each NeRFs is trained with all the necessary pixels. We introduce a novel $2\times 2$ 3D tile progression strategy and segmented sampler, which together prevent 3D reconstruction errors along the tile edges. Our experiments conclude that large satellite images can effectively be processed with linear time complexity, on a single GPU, and without compromise in quality.
中文: Snake-NeRF是一种可扩展的框架,通过将场景划分为无重叠的3D区块并在单GPU上高效处理,实现了卫星图像的大规模三维重建且不损失质量。
English: Snake-NeRF is a scalable framework that enables large-scale 3D reconstruction from satellite imagery by dividing scenes into non-overlapping 3D tiles and processing them efficiently on a single GPU without quality loss.

Authors:Yuxiao Wang, Yu Lei, Zhenao Wei, Weiying Xue, Xinyu Jiang, Nan Zhuang, Qi Liu
Title: Prompt Guidance and Human Proximal Perception for HOT Prediction with Regional Joint Loss
Abstract:
The task of Human-Object conTact (HOT) detection involves identifying the specific areas of the human body that are touching objects. Nevertheless, current models are restricted to just one type of image, often leading to too much segmentation in areas with little interaction, and struggling to maintain category consistency within specific regions. To tackle this issue, a HOT framework, termed \textbf{P3HOT}, is proposed, which blends \textbf{P}rompt guidance and human \textbf{P}roximal \textbf{P}erception. To begin with, we utilize a semantic-driven prompt mechanism to direct the network's attention towards the relevant regions based on the correlation between image and text. Then a human proximal perception mechanism is employed to dynamically perceive key depth range around the human, using learnable parameters to effectively eliminate regions where interactions are not expected. Calculating depth resolves the uncertainty of the overlap between humans and objects in a 2D perspective, providing a quasi-3D viewpoint. Moreover, a Regional Joint Loss (RJLoss) has been created as a new loss to inhibit abnormal categories in the same area. A new evaluation metric called ``AD-Acc.'' is introduced to address the shortcomings of existing methods in addressing negative samples. Comprehensive experimental results demonstrate that our approach achieves state-of-the-art performance in four metrics across two benchmark datasets. Specifically, our model achieves an improvement of \textbf{0.7}$\uparrow$, \textbf{2.0}$\uparrow$, \textbf{1.6}$\uparrow$, and \textbf{11.0}$\uparrow$ in SC-Acc., mIoU, wIoU, and AD-Acc. metrics, respectively, on the HOT-Annotated dataset. The sources code are available at https://github.com/YuxiaoWang-AI/P3HOT.
中文:提出的P3HOT框架通过结合提示引导和人体邻近感知技术,提升了人-物接触检测的区域关注度和交互准确性,在基准数据集上实现了多项指标的领先性能。
English: The proposed P3HOT framework enhances human-object contact detection by integrating prompt guidance and human proximal perception to improve regional focus and interaction accuracy, achieving state-of-the-art results across multiple metrics on benchmark datasets.

Authors:Xu Zhang, Ming Lu, Yan Chen, Zhan Ma
Title: Perception-Oriented Latent Coding for High-Performance Compressed Domain Semantic Inference
Abstract:
In recent years, compressed domain semantic inference has primarily relied on learned image coding models optimized for mean squared error (MSE). However, MSE-oriented optimization tends to yield latent spaces with limited semantic richness, which hinders effective semantic inference in downstream tasks. Moreover, achieving high performance with these models often requires fine-tuning the entire vision model, which is computationally intensive, especially for large models. To address these problems, we introduce Perception-Oriented Latent Coding (POLC), an approach that enriches the semantic content of latent features for high-performance compressed domain semantic inference. With the semantically rich latent space, POLC requires only a plug-and-play adapter for fine-tuning, significantly reducing the parameter count compared to previous MSE-oriented methods. Experimental results demonstrate that POLC achieves rate-perception performance comparable to state-of-the-art generative image coding methods while markedly enhancing performance in vision tasks, with minimal fine-tuning overhead. Code is available at https://github.com/NJUVISION/POLC.
中文: 本文提出感知导向的潜在编码(POLC),通过增强压缩图像的语义丰富性,仅需即插即用的适配器微调即可显著提升视觉任务性能,大幅减少计算开销。
English: This paper introduces Perception-Oriented Latent Coding (POLC) to enhance semantic richness in compressed images, enabling superior vision task performance with minimal fine-tuning through a plug-and-play adapter.

Authors:Yue-Jiang Dong, Wang Zhao, Jiale Xu, Ying Shan, Song-Hai Zhang
Title: DepthSync: Diffusion Guidance-Based Depth Synchronization for Scale- and Geometry-Consistent Video Depth Estimation
Abstract:
Diffusion-based video depth estimation methods have achieved remarkable success with strong generalization ability. However, predicting depth for long videos remains challenging. Existing methods typically split videos into overlapping sliding windows, leading to accumulated scale discrepancies across different windows, particularly as the number of windows increases. Additionally, these methods rely solely on 2D diffusion priors, overlooking the inherent 3D geometric structure of video depths, which results in geometrically inconsistent predictions. In this paper, we propose DepthSync, a novel, training-free framework using diffusion guidance to achieve scale- and geometry-consistent depth predictions for long videos. Specifically, we introduce scale guidance to synchronize the depth scale across windows and geometry guidance to enforce geometric alignment within windows based on the inherent 3D constraints in video depths. These two terms work synergistically, steering the denoising process toward consistent depth predictions. Experiments on various datasets validate the effectiveness of our method in producing depth estimates with improved scale and geometry consistency, particularly for long videos.

Authors:Youngjin Oh, Junhyeong Kwon, Keuntek Lee, Nam Ik Cho
Title: Towards Controllable Real Image Denoising with Camera Parameters
Abstract:
Recent deep learning-based image denoising methods have shown impressive performance; however, many lack the flexibility to adjust the denoising strength based on the noise levels, camera settings, and user preferences. In this paper, we introduce a new controllable denoising framework that adaptively removes noise from images by utilizing information from camera parameters. Specifically, we focus on ISO, shutter speed, and F-number, which are closely related to noise levels. We convert these selected parameters into a vector to control and enhance the performance of the denoising network. Experimental results show that our method seamlessly adds controllability to standard denoising neural networks and improves their performance. Code is available at https://github.com/OBAKSA/CPADNet.
中文: 本文提出了一种可控图像去噪框架,通过利用ISO、快门速度和光圈值等相机参数自适应去除噪声,不仅增强了标准去噪网络的灵活性,还提升了其性能。
English: This paper presents a controllable image denoising framework that adaptively removes noise by incorporating camera parameters like ISO, shutter speed, and F-number, enhancing both flexibility and performance of standard denoising networks.

Authors:Bryan Constantine Sadihin, Michael Hua Wang, Shei Pern Chua, Hang Su
Title: SketchColour: Channel Concat Guided DiT-based Sketch-to-Colour Pipeline for 2D Animation
Abstract:
The production of high-quality 2D animation is highly labor-intensive process, as animators are currently required to draw and color a large number of frames by hand. We present SketchColour, the first sketch-to-colour pipeline for 2D animation built on a diffusion transformer (DiT) backbone. By replacing the conventional U-Net denoiser with a DiT-style architecture and injecting sketch information via lightweight channel-concatenation adapters accompanied with LoRA finetuning, our method natively integrates conditioning without the parameter and memory bloat of a duplicated ControlNet, greatly reducing parameter count and GPU memory usage. Evaluated on the SAKUGA dataset, SketchColour outperforms previous state-of-the-art video colourization methods across all metrics, despite using only half the training data of competing models. Our approach produces temporally coherent animations with minimal artifacts such as colour bleeding or object deformation. Our code is available at: https://bconstantine.github.io/SketchColour .

Authors:Hao Wang, Keyan Hu, Xin Guo, Haifeng Li, Chao Tao
Title: A Gift from the Integration of Discriminative and Diffusion-based Generative Learning: Boundary Refinement Remote Sensing Semantic Segmentation
Abstract:
Remote sensing semantic segmentation must address both what the ground objects are within an image and where they are located. Consequently, segmentation models must ensure not only the semantic correctness of large-scale patches (low-frequency information) but also the precise localization of boundaries between patches (high-frequency information). However, most existing approaches rely heavily on discriminative learning, which excels at capturing low-frequency features, while overlooking its inherent limitations in learning high-frequency features for semantic segmentation. Recent studies have revealed that diffusion generative models excel at generating high-frequency details. Our theoretical analysis confirms that the diffusion denoising process significantly enhances the model's ability to learn high-frequency features; however, we also observe that these models exhibit insufficient semantic inference for low-frequency features when guided solely by the original image. Therefore, we integrate the strengths of both discriminative and generative learning, proposing the Integration of Discriminative and diffusion-based Generative learning for Boundary Refinement (IDGBR) framework. The framework first generates a coarse segmentation map using a discriminative backbone model. This map and the original image are fed into a conditioning guidance network to jointly learn a guidance representation subsequently leveraged by an iterative denoising diffusion process refining the coarse segmentation. Extensive experiments across five remote sensing semantic segmentation datasets (binary and multi-class segmentation) confirm our framework's capability of consistent boundary refinement for coarse results from diverse discriminative architectures. The source code will be available at https://github.com/KeyanHu-git/IDGBR.
Chinese Summary: IDGBR框架融合判别式与生成式学习,通过扩散去噪过程优化粗分割结果,有效提升了遥感语义分割中边界定位的精确度。
English Summary: The IDGBR framework combines discriminative and generative learning to enhance boundary precision in remote sensing semantic segmentation by refining coarse segmentation maps through a diffusion denoising process.

Authors:Robert Aufschläger, Youssef Shoeb, Azarm Nowzad, Michael Heigl, Fabian Bally, Martin Schramm
Title: Following the Clues: Experiments on Person Re-ID using Cross-Modal Intelligence
Abstract:
The collection and release of street-level recordings as Open Data play a vital role in advancing autonomous driving systems and AI research. However, these datasets pose significant privacy risks, particularly for pedestrians, due to the presence of Personally Identifiable Information (PII) that extends beyond biometric traits such as faces. In this paper, we present cRID, a novel cross-modal framework combining Large Vision-Language Models, Graph Attention Networks, and representation learning to detect textual describable clues of PII and enhance person re-identification (Re-ID). Our approach focuses on identifying and leveraging interpretable features, enabling the detection of semantically meaningful PII beyond low-level appearance cues. We conduct a systematic evaluation of PII presence in person image datasets. Our experiments show improved performance in practical cross-dataset Re-ID scenarios, notably from Market-1501 to CUHK03-np (detected), highlighting the framework's practical utility. Code is available at https://github.com/RAufschlaeger/cRID.
中文摘要:本文提出cRID跨模态框架,通过检测可文本描述的个人身份信息并利用可解释特征分析,在街景图像中增强隐私保护并提升行人重识别性能。
English Summary: The paper introduces cRID, a cross-modal framework that enhances privacy protection in street-level imagery by detecting textual describable PII and improving person re-identification through interpretable feature analysis.

Authors:Jimyeong Kim, Jungwon Park, Yeji Song, Nojun Kwak, Wonjong Rhee
Title: ReFlex: Text-Guided Editing of Real Images in Rectified Flow via Mid-Step Feature Extraction and Attention Adaptation
Abstract:
Rectified Flow text-to-image models surpass diffusion models in image quality and text alignment, but adapting ReFlow for real-image editing remains challenging. We propose a new real-image editing method for ReFlow by analyzing the intermediate representations of multimodal transformer blocks and identifying three key features. To extract these features from real images with sufficient structural preservation, we leverage mid-step latent, which is inverted only up to the mid-step. We then adapt attention during injection to improve editability and enhance alignment to the target text. Our method is training-free, requires no user-provided mask, and can be applied even without a source prompt. Extensive experiments on two benchmarks with nine baselines demonstrate its superior performance over prior methods, further validated by human evaluations confirming a strong user preference for our approach.

Authors:Chentao Shen, Ding Pan, Mingyu Mei, Zaixing He, Xinyue Zhao
Title: Active Control Points-based 6DoF Pose Tracking for Industrial Metal Objects
Abstract:
Visual pose tracking is playing an increasingly vital role in industrial contexts in recent years. However, the pose tracking for industrial metal objects remains a challenging task especially in the real world-environments, due to the reflection characteristic of metal objects. To address this issue, we propose a novel 6DoF pose tracking method based on active control points. The method uses image control points to generate edge feature for optimization actively instead of 6DoF pose-based rendering, and serve them as optimization variables. We also introduce an optimal control point regression method to improve robustness. The proposed tracking method performs effectively in both dataset evaluation and real world tasks, providing a viable solution for real-time tracking of industrial metal objects. Our source code is made publicly available at: https://github.com/tomatoma00/ACPTracking.
中文: 本文提出了一种基于主动控制点的新型6DoF姿态跟踪方法,有效解决了工业金属物体因反光特性导致的跟踪难题,在数据集评估和实际应用中均表现优异。
English: This paper introduces a novel 6DoF pose tracking method using active control points to overcome the challenges of tracking reflective industrial metal objects, demonstrating effectiveness in both datasets and real-world applications.

Authors:Jonáš Herec, Vít Růžička, Rado Pitoňák
Title: Optimizing Methane Detection On Board Satellites: Speed, Accuracy, and Low-Power Solutions for Resource-Constrained Hardware
Abstract:
Methane is a potent greenhouse gas, and detecting its leaks early via hyperspectral satellite imagery can help mitigate climate change. Meanwhile, many existing missions operate in manual tasking regimes only, thus missing potential events of interest. To overcome slow downlink rates cost-effectively, onboard detection is a viable solution. However, traditional methane enhancement methods are too computationally demanding for resource-limited onboard hardware. This work accelerates methane detection by focusing on efficient, low-power algorithms. We test fast target detection methods (ACE, CEM) that have not been previously used for methane detection and propose a Mag1c-SAS - a significantly faster variant of the current state-of-the-art algorithm for methane detection: Mag1c. To explore their true detection potential, we integrate them with a machine learning model (U-Net, LinkNet). Our results identify two promising candidates (Mag1c-SAS and CEM), both acceptably accurate for the detection of strong plumes and computationally efficient enough for onboard deployment: one optimized more for accuracy, the other more for speed, achieving up to ~100x and ~230x faster computation than original Mag1c on resource-limited hardware. Additionally, we propose and evaluate three band selection strategies. One of them can outperform the method traditionally used in the field while using fewer channels, leading to even faster processing without compromising accuracy. This research lays the foundation for future advancements in onboard methane detection with minimal hardware requirements, improving timely data delivery. The produced code, data, and models are open-sourced and can be accessed from https://github.com/zaitra/methane-filters-benchmark.
中文: 本研究针对星载高光谱图像甲烷泄漏检测,提出了低功耗高效算法Mag1c-SAS和CEM,在保持强羽流识别精度的同时,运算速度较现有标准提升最高达230倍,为资源受限的硬件平台提供了可行的实时解决方案。
English: This study introduces efficient, low-power algorithms for onboard methane leak detection from hyperspectral imagery, proposing Mag1c-SAS and CEM methods that achieve up to 230x faster computation than current standards while maintaining accuracy for strong plume identification.

Authors:Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, Ming-Ming Cheng, Xiang Li
Title: Representation Entanglement for Generation:Training Diffusion Transformers Is Much Easier Than You Think
Abstract:
REPA and its variants effectively mitigate training challenges in diffusion models by incorporating external visual representations from pretrained models, through alignment between the noisy hidden projections of denoising networks and foundational clean image representations. We argue that the external alignment, which is absent during the entire denoising inference process, falls short of fully harnessing the potential of discriminative representations. In this work, we propose a straightforward method called Representation Entanglement for Generation (REG), which entangles low-level image latents with a single high-level class token from pretrained foundation models for denoising. REG acquires the capability to produce coherent image-class pairs directly from pure noise, substantially improving both generation quality and training efficiency. This is accomplished with negligible additional inference overhead, requiring only one single additional token for denoising (<0.5\% increase in FLOPs and latency). The inference process concurrently reconstructs both image latents and their corresponding global semantics, where the acquired semantic knowledge actively guides and enhances the image generation process. On ImageNet 256$\times$256, SiT-XL/2 + REG demonstrates remarkable convergence acceleration, achieving $\textbf{63}\times$ and $\textbf{23}\times$ faster training than SiT-XL/2 and SiT-XL/2 + REPA, respectively. More impressively, SiT-L/2 + REG trained for merely 400K iterations outperforms SiT-XL/2 + REPA trained for 4M iterations ($\textbf{10}\times$ longer). Code is available at: https://github.com/Martinser/REG.
中文: 本研究提出的REG方法通过将低层级图像潜在变量与高层级类别标记相融合,有效提升了扩散模型的性能,能够从纯噪声中高效生成协调的图像-类别对,在显著加速训练过程并提高生成质量的同时,仅需极小的额外计算开销。
English: The proposed REG method enhances diffusion models by entangling low-level image latents with a high-level class token, enabling efficient generation of coherent image-class pairs from noise with minimal computational overhead and significant improvements in training speed and quality.

Authors:Shaocheng Yan, Pengcheng Shi, Zhenjun Zhao, Kaixin Wang, Kuang Cao, Ji Wu, Jiayuan Li
Title: TurboReg: TurboClique for Robust and Efficient Point Cloud Registration
Abstract:
Robust estimation is essential in correspondence-based Point Cloud Registration (PCR). Existing methods using maximal clique search in compatibility graphs achieve high recall but suffer from exponential time complexity, limiting their use in time-sensitive applications. To address this challenge, we propose a fast and robust estimator, TurboReg, built upon a novel lightweight clique, TurboClique, and a highly parallelizable Pivot-Guided Search (PGS) algorithm. First, we define the TurboClique as a 3-clique within a highly-constrained compatibility graph. The lightweight nature of the 3-clique allows for efficient parallel searching, and the highly-constrained compatibility graph ensures robust spatial consistency for stable transformation estimation. Next, PGS selects matching pairs with high SC$^2$ scores as pivots, effectively guiding the search toward TurboCliques with higher inlier ratios. Moreover, the PGS algorithm has linear time complexity and is significantly more efficient than the maximal clique search with exponential time complexity. Extensive experiments show that TurboReg achieves state-of-the-art performance across multiple real-world datasets, with substantial speed improvements. For example, on the 3DMatch+FCGF dataset, TurboReg (1K) operates $208.22\times$ faster than 3DMAC while also achieving higher recall. Our code is accessible at \href{https://github.com/Laka-3DV/TurboReg}{\texttt{TurboReg}}.
中文: TurboReg通过轻量级TurboCliques和并行化Pivot-Guided Search算法,提出了一种快速鲁棒的点云配准估计器,在保持线性时间复杂度的同时实现了最优性能与显著加速。
English: TurboReg introduces a fast and robust PCR estimator using lightweight TurboCliques and a parallel Pivot-Guided Search, achieving state-of-the-art performance with linear time complexity and significant speed gains.

Authors:Chen Sun, Haiyang Sun, Zhiqing Guo, Yunfeng Diao, Liejun Wang, Dan Ma, Gaobo Yang, Keqin Li
Title: DiffMark: Diffusion-based Robust Watermark Against Deepfakes
Abstract:
Deepfakes pose significant security and privacy threats through malicious facial manipulations. While robust watermarking can aid in authenticity verification and source tracking, existing methods often lack the sufficient robustness against Deepfake manipulations. Diffusion models have demonstrated remarkable performance in image generation, enabling the seamless fusion of watermark with image during generation. In this study, we propose a novel robust watermarking framework based on diffusion model, called DiffMark. By modifying the training and sampling scheme, we take the facial image and watermark as conditions to guide the diffusion model to progressively denoise and generate corresponding watermarked image. In the construction of facial condition, we weight the facial image by a timestep-dependent factor that gradually reduces the guidance intensity with the decrease of noise, thus better adapting to the sampling process of diffusion model. To achieve the fusion of watermark condition, we introduce a cross information fusion (CIF) module that leverages a learnable embedding table to adaptively extract watermark features and integrates them with image features via cross-attention. To enhance the robustness of the watermark against Deepfake manipulations, we integrate a frozen autoencoder during training phase to simulate Deepfake manipulations. Additionally, we introduce Deepfake-resistant guidance that employs specific Deepfake model to adversarially guide the diffusion sampling process to generate more robust watermarked images. Experimental results demonstrate the effectiveness of the proposed DiffMark on typical Deepfakes. Our code will be available at https://github.com/vpsg-research/DiffMark.
中文摘要:本研究提出基于扩散模型的鲁棒水印框架DiffMark,通过面部条件加权和交叉信息融合模块自适应嵌入水印,并采用对抗性训练策略提升对深度伪造攻击的防御能力。
English Summary: This study introduces DiffMark, a robust watermarking framework based on diffusion models that integrates facial images and watermarks through adaptive conditioning and cross-attention fusion to enhance resistance against Deepfake manipulations.

Authors:Kuniaki Saito, Donghyun Kim, Kwanyong Park, Atsushi Hashimoto, Yoshitaka Ushiku
Title: CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning
Abstract:
An image captioning model flexibly switching its language pattern, e.g., descriptiveness and length, should be useful since it can be applied to diverse applications. However, despite the dramatic improvement in generative vision-language models, fine-grained control over the properties of generated captions is not easy due to two reasons: (i) existing models are not given the properties as a condition during training and (ii) existing models cannot smoothly transition its language pattern from one state to the other. Given this challenge, we propose a new approach, CaptionSmiths, to acquire a single captioning model that can handle diverse language patterns. First, our approach quantifies three properties of each caption, length, descriptiveness, and uniqueness of a word, as continuous scalar values, without human annotation. Given the values, we represent the conditioning via interpolation between two endpoint vectors corresponding to the extreme states, e.g., one for a very short caption and one for a very long caption. Empirical results demonstrate that the resulting model can smoothly change the properties of the output captions and show higher lexical alignment than baselines. For instance, CaptionSmiths reduces the error in controlling caption length by 506\% despite better lexical alignment. Code will be available on https://github.com/omron-sinicx/captionsmiths.
Chinese: 提出的CaptionSmiths模型通过将标题属性量化为连续值并采用向量插值方法,实现了对标题长度和描述性等特征的精细控制,相比现有方法能更流畅地调整语言模式并提升词汇对齐效果。
English: The proposed CaptionSmiths model enables fine-grained control over caption properties like length and descriptiveness by quantifying them as continuous values and using vector interpolation, achieving smoother transitions and better lexical alignment than existing methods.

Authors:Huanwen Liang, Jingxian Xu, Yuanji Zhang, Yuhao Huang, Yuhan Zhang, Xin Yang, Ran Li, Xuedong Deng, Yanjun Liu, Guowei Tao, Yun Wu, Sheng Zhao, Xinru Gao, Dong Ni
Title: Medical-Knowledge Driven Multiple Instance Learning for Classifying Severe Abdominal Anomalies on Prenatal Ultrasound
Abstract:
Fetal abdominal malformations are serious congenital anomalies that require accurate diagnosis to guide pregnancy management and reduce mortality. Although AI has demonstrated significant potential in medical diagnosis, its application to prenatal abdominal anomalies remains limited. Most existing studies focus on image-level classification and rely on standard plane localization, placing less emphasis on case-level diagnosis. In this paper, we develop a case-level multiple instance learning (MIL)-based method, free of standard plane localization, for classifying fetal abdominal anomalies in prenatal ultrasound. Our contribution is three-fold. First, we adopt a mixture-of-attention-experts module (MoAE) to weight different attention heads for various planes. Secondly, we propose a medical-knowledge-driven feature selection module (MFS) to align image features with medical knowledge, performing self-supervised image token selection at the case-level. Finally, we propose a prompt-based prototype learning (PPL) to enhance the MFS. Extensively validated on a large prenatal abdominal ultrasound dataset containing 2,419 cases, with a total of 24,748 images and 6 categories, our proposed method outperforms the state-of-the-art competitors. Codes are available at:https://github.com/LL-AC/AAcls.
中文: 本研究提出了一种新颖的病例级多示例学习方法,用于产前超声中胎儿腹部畸形的分类,无需标准切面定位,通过整合注意力机制、医学知识和原型学习,在大型数据集上实现了卓越性能。
English: This study introduces a novel case-level multiple instance learning method for classifying fetal abdominal anomalies in prenatal ultrasound, which eliminates the need for standard plane localization and integrates attention mechanisms, medical knowledge, and prototype learning to achieve superior performance on a large dataset.

Authors:Langyu Wang, Bingke Zhu, Yingying Chen, Yiyuan Zhang, Ming Tang, Jinqiao Wang
Title: MUG: Pseudo Labeling Augmented Audio-Visual Mamba Network for Audio-Visual Video Parsing
Abstract:
The weakly-supervised audio-visual video parsing (AVVP) aims to predict all modality-specific events and locate their temporal boundaries. Despite significant progress, due to the limitations of the weakly-supervised and the deficiencies of the model architecture, existing methods are lacking in simultaneously improving both the segment-level prediction and the event-level prediction. In this work, we propose a audio-visual Mamba network with pseudo labeling aUGmentation (MUG) for emphasising the uniqueness of each segment and excluding the noise interference from the alternate modalities. Specifically, we annotate some of the pseudo-labels based on previous work. Using unimodal pseudo-labels, we perform cross-modal random combinations to generate new data, which can enhance the model's ability to parse various segment-level event combinations. For feature processing and interaction, we employ a audio-visual mamba network. The AV-Mamba enhances the ability to perceive different segments and excludes additional modal noise while sharing similar modal information. Our extensive experiments demonstrate that MUG improves state-of-the-art results on LLP dataset in all metrics (e.g,, gains of 2.1% and 1.2% in terms of visual Segment-level and audio Segment-level metrics). Our code is available at https://github.com/WangLY136/MUG.
Chinese Summary: 本文提出了一种带有伪标签增强的视听Mamba网络(MUG),通过生成跨模态训练数据和减少噪声干扰来提升片段级事件解析能力,在LLP数据集的所有指标上均实现了最先进的性能提升。
English Summary: This paper introduces a novel audio-visual Mamba network with pseudo-label augmentation (MUG) that enhances segment-level event parsing by generating cross-modal training data and reducing noise interference, achieving state-of-the-art performance improvements on the LLP dataset.

Authors:Tianrui Lou, Xiaojun Jia, Siyuan Liang, Jiawei Liang, Ming Zhang, Yanjun Xiao, Xiaochun Cao
Title: 3D Gaussian Splatting Driven Multi-View Robust Physical Adversarial Camouflage Generation
Abstract:
Physical adversarial attack methods expose the vulnerabilities of deep neural networks and pose a significant threat to safety-critical scenarios such as autonomous driving. Camouflage-based physical attack is a more promising approach compared to the patch-based attack, offering stronger adversarial effectiveness in complex physical environments. However, most prior work relies on mesh priors of the target object and virtual environments constructed by simulators, which are time-consuming to obtain and inevitably differ from the real world. Moreover, due to the limitations of the backgrounds in training images, previous methods often fail to produce multi-view robust adversarial camouflage and tend to fall into sub-optimal solutions. Due to these reasons, prior work lacks adversarial effectiveness and robustness across diverse viewpoints and physical environments. We propose a physical attack framework based on 3D Gaussian Splatting (3DGS), named PGA, which provides rapid and precise reconstruction with few images, along with photo-realistic rendering capabilities. Our framework further enhances cross-view robustness and adversarial effectiveness by preventing mutual and self-occlusion among Gaussians and employing a min-max optimization approach that adjusts the imaging background of each viewpoint, helping the algorithm filter out non-robust adversarial features. Extensive experiments validate the effectiveness and superiority of PGA. Our code is available at:https://github.com/TRLou/PGA.
中文: 提出的PGA框架利用3D高斯溅射技术,通过少量图像快速重建并逼真渲染场景,借助防止遮挡和针对各视角背景调整的极小极大优化方法,显著提升了对抗迷彩在多视角下的有效性和鲁棒性。
English: The proposed PGA framework leverages 3D Gaussian Splatting to rapidly reconstruct and render realistic scenes from few images, enhancing adversarial camouflage effectiveness and cross-view robustness through occlusion prevention and min-max optimization that adapts backgrounds across viewpoints.

Authors:Cuong Le, Huy-Phuong Le, Duc Le, Minh-Thien Duong, Van-Binh Nguyen, My-Ha Le
Title: Physics-informed Ground Reaction Dynamics from Human Motion Capture
Abstract:
Body dynamics are crucial information for the analysis of human motions in important research fields, ranging from biomechanics, sports science to computer vision and graphics. Modern approaches collect the body dynamics, external reactive force specifically, via force plates, synchronizing with human motion capture data, and learn to estimate the dynamics from a black-box deep learning model. Being specialized devices, force plates can only be installed in laboratory setups, imposing a significant limitation on the learning of human dynamics. To this end, we propose a novel method for estimating human ground reaction dynamics directly from the more reliable motion capture data with physics laws and computational simulation as constrains. We introduce a highly accurate and robust method for computing ground reaction forces from motion capture data using Euler's integration scheme and PD algorithm. The physics-based reactive forces are used to inform the learning model about the physics-informed motion dynamics thus improving the estimation accuracy. The proposed approach was tested on the GroundLink dataset, outperforming the baseline model on: 1) the ground reaction force estimation accuracy compared to the force plates measurement; and 2) our simulated root trajectory precision. The implementation code is available at https://github.com/cuongle1206/Phys-GRD
中文: 本文提出了一种基于物理的方法,利用运动捕捉数据结合欧拉积分和PD控制来精确估计地面反作用力,在力估计精度和轨迹精度方面均优于基线模型。
English: This paper introduces a physics-based method that uses motion capture data with Euler's integration and PD control to accurately estimate ground reaction forces, outperforming baseline models in both force estimation and trajectory precision.

Authors:Dong Liang, Xingyu Qiu, Yuzhen Li, Wei Wang, Kuanquan Wang, Suyu Dong, Gongning Luo
Title: Structure and Smoothness Constrained Dual Networks for MR Bias Field Correction
Abstract:
MR imaging techniques are of great benefit to disease diagnosis. However, due to the limitation of MR devices, significant intensity inhomogeneity often exists in imaging results, which impedes both qualitative and quantitative medical analysis. Recently, several unsupervised deep learning-based models have been proposed for MR image improvement. However, these models merely concentrate on global appearance learning, and neglect constraints from image structures and smoothness of bias field, leading to distorted corrected results. In this paper, novel structure and smoothness constrained dual networks, named S2DNets, are proposed aiming to self-supervised bias field correction. S2DNets introduce piece-wise structural constraints and smoothness of bias field for network training to effectively remove non-uniform intensity and retain much more structural details. Extensive experiments executed on both clinical and simulated MR datasets show that the proposed model outperforms other conventional and deep learning-based models. In addition to comparison on visual metrics, downstream MR image segmentation tasks are also used to evaluate the impact of the proposed model. The source code is available at: https://github.com/LeongDong/S2DNets}{https://github.com/LeongDong/S2DNets.
中文: 本文提出S2DNets,一种新型自监督双网络,通过引入分段结构约束和偏置场平滑性来有效校正MR图像强度不均匀性并保留结构细节,实验表明其性能优于现有方法。
English: This paper introduces S2DNets, a novel self-supervised dual network that incorporates structural constraints and bias field smoothness to effectively correct intensity inhomogeneity in MR images while preserving structural details, outperforming existing methods in experiments.

Authors:Worameth Chinchuthakun, Pakkapon Phongthawee, Amit Raj, Varun Jampani, Pramook Khungurn, Supasorn Suwajanakorn
Title: DiffusionLight-Turbo: Accelerated Light Probes for Free via Single-Pass Chrome Ball Inpainting
Abstract:
We introduce a simple yet effective technique for estimating lighting from a single low-dynamic-range (LDR) image by reframing the task as a chrome ball inpainting problem. This approach leverages a pre-trained diffusion model, Stable Diffusion XL, to overcome the generalization failures of existing methods that rely on limited HDR panorama datasets. While conceptually simple, the task remains challenging because diffusion models often insert incorrect or inconsistent content and cannot readily generate chrome balls in HDR format. Our analysis reveals that the inpainting process is highly sensitive to the initial noise in the diffusion process, occasionally resulting in unrealistic outputs. To address this, we first introduce DiffusionLight, which uses iterative inpainting to compute a median chrome ball from multiple outputs to serve as a stable, low-frequency lighting prior that guides the generation of a high-quality final result. To generate high-dynamic-range (HDR) light probes, an Exposure LoRA is fine-tuned to create LDR images at multiple exposure values, which are then merged. While effective, DiffusionLight is time-intensive, requiring approximately 30 minutes per estimation. To reduce this overhead, we introduce DiffusionLight-Turbo, which reduces the runtime to about 30 seconds with minimal quality loss. This 60x speedup is achieved by training a Turbo LoRA to directly predict the averaged chrome balls from the iterative process. Inference is further streamlined into a single denoising pass using a LoRA swapping technique. Experimental results that show our method produces convincing light estimates across diverse settings and demonstrates superior generalization to in-the-wild scenarios. Our code is available at https://diffusionlight.github.io/turbo

Authors:Ahmad Chaddad, Jihao Peng, Yihang Wu
Title: Classification based deep learning models for lung cancer and disease using medical images
Abstract:
The use of deep learning (DL) in medical image analysis has significantly improved the ability to predict lung cancer. In this study, we introduce a novel deep convolutional neural network (CNN) model, named ResNet+, which is based on the established ResNet framework. This model is specifically designed to improve the prediction of lung cancer and diseases using the images. To address the challenge of missing feature information that occurs during the downsampling process in CNNs, we integrate the ResNet-D module, a variant designed to enhance feature extraction capabilities by modifying the downsampling layers, into the traditional ResNet model. Furthermore, a convolutional attention module was incorporated into the bottleneck layers to enhance model generalization by allowing the network to focus on relevant regions of the input images. We evaluated the proposed model using five public datasets, comprising lung cancer (LC2500 $n$=3183, IQ-OTH/NCCD $n$=1336, and LCC $n$=25000 images) and lung disease (ChestXray $n$=5856, and COVIDx-CT $n$=425024 images). To address class imbalance, we used data augmentation techniques to artificially increase the representation of underrepresented classes in the training dataset. The experimental results show that ResNet+ model demonstrated remarkable accuracy/F1, reaching 98.14/98.14\% on the LC25000 dataset and 99.25/99.13\% on the IQ-OTH/NCCD dataset. Furthermore, the ResNet+ model saved computational cost compared to the original ResNet series in predicting lung cancer images. The proposed model outperformed the baseline models on publicly available datasets, achieving better performance metrics. Our codes are publicly available at https://github.com/AIPMLab/Graduation-2024/tree/main/Peng.
中文: 本研究提出ResNet+增强型深度卷积神经网络,通过整合ResNet-D模块和注意力机制来提升医学影像中肺癌及肺部疾病的预测能力,在多个公共数据集上实现了高精度和计算效率。
English: This study introduces ResNet+, an enhanced deep convolutional neural network that integrates a ResNet-D module and attention mechanism to improve lung cancer and disease prediction from medical images, achieving high accuracy and computational efficiency across multiple public datasets.

Authors:Zhuo Su, Li Liu, Matthias Müller, Jiehua Zhang, Diana Wofk, Ming-Ming Cheng, Matti Pietikäinen
Title: Rapid Salient Object Detection with Difference Convolutional Neural Networks
Abstract:
This paper addresses the challenge of deploying salient object detection (SOD) on resource-constrained devices with real-time performance. While recent advances in deep neural networks have improved SOD, existing top-leading models are computationally expensive. We propose an efficient network design that combines traditional wisdom on SOD and the representation power of modern CNNs. Like biologically-inspired classical SOD methods relying on computing contrast cues to determine saliency of image regions, our model leverages Pixel Difference Convolutions (PDCs) to encode the feature contrasts. Differently, PDCs are incorporated in a CNN architecture so that the valuable contrast cues are extracted from rich feature maps. For efficiency, we introduce a difference convolution reparameterization (DCR) strategy that embeds PDCs into standard convolutions, eliminating computation and parameters at inference. Additionally, we introduce SpatioTemporal Difference Convolution (STDC) for video SOD, enhancing the standard 3D convolution with spatiotemporal contrast capture. Our models, SDNet for image SOD and STDNet for video SOD, achieve significant improvements in efficiency-accuracy trade-offs. On a Jetson Orin device, our models with $<$ 1M parameters operate at 46 FPS and 150 FPS on streamed images and videos, surpassing the second-best lightweight models in our experiments by more than $2\times$ and $3\times$ in speed with superior accuracy. Code will be available at https://github.com/hellozhuo/stdnet.git.
中文: 本文提出了一种结合传统对比度方法与现代卷积神经网络的高效显著目标检测模型,通过像素差分卷积在资源受限设备上实现了更优的速度与精度平衡。
English: This paper introduces efficient models for salient object detection that combine traditional contrast-based methods with modern CNNs using Pixel Difference Convolutions, achieving superior speed-accuracy trade-offs on resource-constrained devices.

Authors:Yating Wang, Haoyi Zhu, Mingyu Liu, Jiange Yang, Hao-Shu Fang, Tong He
Title: VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers
Abstract:
In this paper, we introduce an innovative vector quantization based action tokenizer built upon the largest-scale action trajectory dataset to date, leveraging over 100 times more data than previous approaches. This extensive dataset enables our tokenizer to capture rich spatiotemporal dynamics, resulting in a model that not only accelerates inference but also generates smoother and more coherent action outputs. Once trained, the tokenizer can be seamlessly adapted to a wide range of downstream tasks in a zero-shot manner, from short-horizon reactive behaviors to long-horizon planning. A key finding of our work is that the domain gap between synthetic and real action trajectories is marginal, allowing us to effectively utilize a vast amount of synthetic data during training without compromising real-world performance. To validate our approach, we conducted extensive experiments in both simulated environments and on real robotic platforms. The results demonstrate that as the volume of synthetic trajectory data increases, the performance of our tokenizer on downstream tasks improves significantly-most notably, achieving up to a 30% higher success rate on two real-world tasks in long-horizon scenarios. These findings highlight the potential of our action tokenizer as a robust and scalable solution for real-time embodied intelligence systems, paving the way for more efficient and reliable robotic control in diverse application domains.Project website: https://xiaoxiao0406.github.io/vqvla.github.io

Authors:Zhe Kong, Le Li, Yong Zhang, Feng Gao, Shaoshu Yang, Tao Wang, Kaihao Zhang, Zhuoliang Kang, Xiaoming Wei, Guanying Chen, Wenhan Luo
Title: DAM-VSR: Disentanglement of Appearance and Motion for Video Super-Resolution
Abstract:
Real-world video super-resolution (VSR) presents significant challenges due to complex and unpredictable degradations. Although some recent methods utilize image diffusion models for VSR and have shown improved detail generation capabilities, they still struggle to produce temporally consistent frames. We attempt to use Stable Video Diffusion (SVD) combined with ControlNet to address this issue. However, due to the intrinsic image-animation characteristics of SVD, it is challenging to generate fine details using only low-quality videos. To tackle this problem, we propose DAM-VSR, an appearance and motion disentanglement framework for VSR. This framework disentangles VSR into appearance enhancement and motion control problems. Specifically, appearance enhancement is achieved through reference image super-resolution, while motion control is achieved through video ControlNet. This disentanglement fully leverages the generative prior of video diffusion models and the detail generation capabilities of image super-resolution models. Furthermore, equipped with the proposed motion-aligned bidirectional sampling strategy, DAM-VSR can conduct VSR on longer input videos. DAM-VSR achieves state-of-the-art performance on real-world data and AIGC data, demonstrating its powerful detail generation capabilities.
中文: 真实世界视频超分辨率面临时间一致性和细节生成的挑战,提出的DAM-VSR框架通过解耦外观增强与运动控制,利用扩散模型和双向采样策略实现了卓越性能。
English: Real-world video super-resolution faces challenges with temporal consistency and detail generation, which the proposed DAM-VSR framework addresses by disentangling appearance enhancement and motion control using diffusion models and a bidirectional sampling strategy for superior performance.

Authors:V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiale Zhu, Jiali Chen, Jing Chen, Jinhao Chen, Jinghao Lin, Jinjiang Wang, Junjie Chen, Leqi Lei, Letian Gong, Leyi Pan, Mingdao Liu, Mingde Xu, Mingzhi Zhang, Qinkai Zheng, Sheng Yang, Shi Zhong, Shiyu Huang, Shuyuan Zhao, Siyan Xue, Shangqin Tu, Shengbiao Meng, Tianshu Zhang, Tianwei Luo, Tianxiang Hao, Tianyu Tong, Wenkai Li, Wei Jia, Xiao Liu, Xiaohan Zhang, Xin Lyu, Xinyue Fan, Xuancheng Huang, Yanling Wang, Yadong Xue, Yanfeng Wang, Yanzi Wang, Yifan An, Yifan Du, Yiming Shi, Yiheng Huang, Yilin Niu, Yuan Wang, Yuanchang Yue, Yuchen Li, Yutao Zhang, Yuting Wang, Yu Wang, Yuxuan Zhang, Zhao Xue, Zhenyu Hou, Zhengxiao Du, Zihan Wang, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Minlie Huang, Yuxiao Dong, Jie Tang
Title: GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
Abstract:
We present GLM-4.1V-Thinking and GLM-4.5V, a family of vision-language models (VLMs) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document interpretation. In a comprehensive evaluation across 42 public benchmarks, GLM-4.5V achieves state-of-the-art performance on nearly all tasks among open-source models of similar size, and demonstrates competitive or even superior results compared to closed-source models such as Gemini-2.5-Flash on challenging tasks including Coding and GUI Agents. Meanwhile, the smaller GLM-4.1V-9B-Thinking remains highly competitive-achieving superior results to the much larger Qwen2.5-VL-72B on 29 benchmarks. We open-source both GLM-4.1V-9B-Thinking and GLM-4.5V. Code, models and more information are released at https://github.com/zai-org/GLM-V.
中文: GLM-4.1V-Thinking和GLM-4.5V视觉语言模型通过以推理为核心的训练框架和课程采样强化学习,在42个基准测试中取得顶尖性能,超越同类开源模型并与主流闭源模型相媲美。
English: The GLM-4.1V-Thinking and GLM-4.5V vision-language models achieve state-of-the-art performance across 42 benchmarks through a reasoning-centric training framework and Reinforcement Learning with Curriculum Sampling, surpassing similar-sized open-source models and competing with leading closed-source models.

Authors:Jack Nugent, Siyang Wu, Zeyu Ma, Beining Han, Meenal Parakh, Abhishek Joshi, Lingjie Mei, Alexander Raistrick, Xinyuan Li, Jia Deng
Title: Evaluating Robustness of Monocular Depth Estimation with Procedural Scene Perturbations
Abstract:
Recent years have witnessed substantial progress on monocular depth estimation, particularly as measured by the success of large models on standard benchmarks. However, performance on standard benchmarks does not offer a complete assessment, because most evaluate accuracy but not robustness. In this work, we introduce PDE (Procedural Depth Evaluation), a new benchmark which enables systematic robustness evaluation. PDE uses procedural generation to create 3D scenes that test robustness to various controlled perturbations, including object, camera, material and lighting changes. Our analysis yields interesting findings on what perturbations are challenging for state-of-the-art depth models, which we hope will inform further research. Code and data are available at https://github.com/princeton-vl/proc-depth-eval.
中文: 本文提出PDE基准测试,通过程序化生成三维场景来系统评估单目深度估计模型对各种受控干扰的鲁棒性,揭示了当前先进方法面临的挑战性场景。
English: This paper introduces PDE, a procedurally generated benchmark for systematically evaluating the robustness of monocular depth estimation models to various controlled perturbations, revealing challenging scenarios for state-of-the-art methods.

Authors:Yuheng Du, Sheng Yang, Lingxuan Wang, Zhenghua Hou, Chengying Cai, Zhitao Tan, Mingxia Chen, Shi-Sheng Huang, Qiang Li
Title: RTMap: Real-Time Recursive Mapping with Change Detection and Localization
Abstract:
While recent online HD mapping methods relieve burdened offline pipelines and solve map freshness, they remain limited by perceptual inaccuracies, occlusion in dense traffic, and an inability to fuse multi-agent observations. We propose RTMap to enhance these single-traversal methods by persistently crowdsourcing a multi-traversal HD map as a self-evolutional memory. On onboard agents, RTMap simultaneously addresses three core challenges in an end-to-end fashion: (1) Uncertainty-aware positional modeling for HD map elements, (2) probabilistic-aware localization w.r.t. the crowdsourced prior-map, and (3) real-time detection for possible road structural changes. Experiments on several public autonomous driving datasets demonstrate our solid performance on both the prior-aided map quality and the localization accuracy, demonstrating our effectiveness of robustly serving downstream prediction and planning modules while gradually improving the accuracy and freshness of the crowdsourced prior-map asynchronously. Our source-code will be made publicly available at https://github.com/CN-ADLab/RTMap.
中文: RTMap通过众包多轨迹高清地图作为自进化记忆,以不确定性建模、概率定位和实时变化检测增强单次轨迹建图,有效解决感知误差、遮挡及多智能体融合问题,显著提升地图质量与定位精度。
English: RTMap enhances single-traversal HD mapping by crowdsourcing a multi-traversal map as a self-evolving memory, addressing perceptual inaccuracies, occlusion, and multi-agent fusion through uncertainty-aware modeling, probabilistic localization, and real-time change detection to improve map quality and localization accuracy.

Authors:Zifu Wan, Ce Zhang, Silong Yong, Martin Q. Ma, Simon Stepputtis, Louis-Philippe Morency, Deva Ramanan, Katia Sycara, Yaqi Xie
Title: ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models
Abstract:
Recent Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses. Although they have achieved remarkable performance across a range of multi-modal tasks, they face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in real-world applications. Existing work has explored contrastive decoding approaches to mitigate this issue, where the output of the original LVLM is compared and contrasted with that of a perturbed version. However, these methods require two or more queries that slow down LVLM response generation, making them less suitable for real-time applications. To overcome this limitation, we propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment. Specifically, we enhance textual outputs by selectively amplifying crucial textual information using a text-to-visual entropy ratio for each token. Extensive experimental results demonstrate that our proposed ONLY consistently outperforms state-of-the-art methods across various benchmarks while requiring minimal implementation effort and computational cost. Code is available at https://github.com/zifuwan/ONLY.
中文摘要:本文提出的ONLY方法是一种无需训练的解码方案,通过文本-视觉熵比选择性增强关键文本信息,有效减少大型视觉语言模型的幻觉问题,能以最小计算成本实现实时部署。
English Summary: The proposed ONLY method is a training-free decoding approach that efficiently reduces hallucinations in Large Vision-Language Models by amplifying crucial textual information using a text-to-visual entropy ratio, enabling real-time deployment with minimal computational cost.

Authors:Wei Li, Jiaman Tang, Yang Li, Beihao Xia, Ligang Tan, Hongmao Qin
Title: UAVD-Mamba: Deformable Token Fusion Vision Mamba for Multimodal UAV Detection
Abstract:
Unmanned Aerial Vehicle (UAV) object detection has been widely used in traffic management, agriculture, emergency rescue, etc. However, it faces significant challenges, including occlusions, small object sizes, and irregular shapes. These challenges highlight the necessity for a robust and efficient multimodal UAV object detection method. Mamba has demonstrated considerable potential in multimodal image fusion. Leveraging this, we propose UAVD-Mamba, a multimodal UAV object detection framework based on Mamba architectures. To improve geometric adaptability, we propose the Deformable Token Mamba Block (DTMB) to generate deformable tokens by incorporating adaptive patches from deformable convolutions alongside normal patches from normal convolutions, which serve as the inputs to the Mamba Block. To optimize the multimodal feature complementarity, we design two separate DTMBs for the RGB and infrared (IR) modalities, with the outputs from both DTMBs integrated into the Mamba Block for feature extraction and into the Fusion Mamba Block for feature fusion. Additionally, to improve multiscale object detection, especially for small objects, we stack four DTMBs at different scales to produce multiscale feature representations, which are then sent to the Detection Neck for Mamba (DNM). The DNM module, inspired by the YOLO series, includes modifications to the SPPF and C3K2 of YOLOv11 to better handle the multiscale features. In particular, we employ cross-enhanced spatial attention before the DTMB and cross-channel attention after the Fusion Mamba Block to extract more discriminative features. Experimental results on the DroneVehicle dataset show that our method outperforms the baseline OAFA method by 3.6% in the mAP metric. Codes will be released at https://github.com/GreatPlum-hnu/UAVD-Mamba.git.
Chinese: 提出的UAVD-Mamba框架利用Mamba架构解决无人机目标检测中的遮挡和小物体等挑战,通过引入可变形令牌块和多模态融合,在DroneVehicle数据集上比基线方法mAP指标提升了3.6%。
English: The proposed UAVD-Mamba framework leverages Mamba architectures to address challenges in UAV object detection, such as occlusions and small objects, by introducing deformable token blocks and multimodal fusion, achieving a 3.6% mAP improvement over the baseline on the DroneVehicle dataset.

Authors:Yasser El Jarida, Youssef Iraqi, Loubna Mekouar
Title: Instant Particle Size Distribution Measurement Using CNNs Trained on Synthetic Data
Abstract:
Accurate particle size distribution (PSD) measurement is important in industries such as mining, pharmaceuticals, and fertilizer manufacturing, significantly influencing product quality and operational efficiency. Traditional PSD methods like sieve analysis and laser diffraction are manual, time-consuming, and limited by particle overlap. Recent developments in convolutional neural networks (CNNs) enable automated, real-time PSD estimation directly from particle images. In this work, we present a CNN-based methodology trained on realistic synthetic particle imagery generated using Blender's advanced rendering capabilities. Synthetic data sets using this method can replicate various industrial scenarios by systematically varying particle shapes, textures, lighting, and spatial arrangements that closely resemble the actual configurations. We evaluated three CNN-based architectures, ResNet-50, InceptionV3, and EfficientNet-B0, for predicting critical PSD parameters (d10, d50, d90). Results demonstrated comparable accuracy across models, with EfficientNet-B0 achieving the best computational efficiency suitable for real-time industrial deployment. This approach shows the effectiveness of realistic synthetic data for robust CNN training, which offers significant potential for automated industrial PSD monitoring. The code is released at : https://github.com/YasserElj/Synthetic-Granular-Gen
中文摘要:本研究提出基于卷积神经网络的方法,利用逼真合成颗粒图像实现粒度分布的自动化测量,其中EfficientNet-B0模型展现出最佳计算效率,适用于工业实时监测。
English Summary: This study introduces a CNN-based method using realistic synthetic particle images to automate particle size distribution measurement, with EfficientNet-B0 showing optimal computational efficiency for industrial deployment.

Authors:Minye Shao, Xingyu Miao, Haoran Duan, Zeyu Wang, Jingkun Chen, Yawen Huang, Xian Wu, Jingjing Deng, Yang Long, Yefeng Zheng
Title: TRACE: Temporally Reliable Anatomically-Conditioned 3D CT Generation with Enhanced Efficiency
Abstract:
3D medical image generation is essential for data augmentation and patient privacy, calling for reliable and efficient models suited for clinical practice. However, current methods suffer from limited anatomical fidelity, restricted axial length, and substantial computational cost, placing them beyond reach for regions with limited resources and infrastructure. We introduce TRACE, a framework that generates 3D medical images with spatiotemporal alignment using a 2D multimodal-conditioned diffusion approach. TRACE models sequential 2D slices as video frame pairs, combining segmentation priors and radiology reports for anatomical alignment, incorporating optical flow to sustain temporal coherence. During inference, an overlapping-frame strategy links frame pairs into a flexible length sequence, reconstructed into a spatiotemporally and anatomically aligned 3D volume. Experimental results demonstrate that TRACE effectively balances computational efficiency with preserving anatomical fidelity and spatiotemporal consistency. Code is available at: https://github.com/VinyehShaw/TRACE.
中文: TRACE框架采用二维多模态条件扩散方法生成时空与解剖结构对齐的三维医学图像,在保证计算效率的同时有效解决了现有方法在解剖保真度和生成长度方面的局限性。
English: TRACE is a computationally efficient framework that generates spatiotemporally and anatomically aligned 3D medical images using a 2D multimodal-conditioned diffusion approach, overcoming limitations in fidelity and scalability of current methods.

Authors:Hendric Voss, Stefan Kopp
Title: JAX-IK: Real-Time Inverse Kinematics for Generating Multi-Constrained Movements of Virtual Human Characters
Abstract:
Generating accurate and realistic virtual human movements in real-time is of high importance for a variety of applications in computer graphics, interactive virtual environments, robotics, and biomechanics. This paper introduces a novel real-time inverse kinematics (IK) solver specifically designed for realistic human-like movement generation. Leveraging the automatic differentiation and just-in-time compilation of TensorFlow, the proposed solver efficiently handles complex articulated human skeletons with high degrees of freedom. By treating forward and inverse kinematics as differentiable operations, our method effectively addresses common challenges such as error accumulation and complicated joint limits in multi-constrained problems, which are critical for realistic human motion modeling. We demonstrate the solver's effectiveness on the SMPLX human skeleton model, evaluating its performance against widely used iterative-based IK algorithms, like Cyclic Coordinate Descent (CCD), FABRIK, and the nonlinear optimization algorithm IPOPT. Our experiments cover both simple end-effector tasks and sophisticated, multi-constrained problems with realistic joint limits. Results indicate that our IK solver achieves real-time performance, exhibiting rapid convergence, minimal computational overhead per iteration, and improved success rates compared to existing methods. The project code is available at https://github.com/hvoss-techfak/JAX-IK
中文: 本文提出了一种新颖的实时逆运动学求解器,利用TensorFlow的自动微分和即时编译技术高效生成逼真人体运动,在收敛速度和成功率方面优于现有方法。
English: This paper presents a novel real-time inverse kinematics solver that uses TensorFlow's automatic differentiation and just-in-time compilation to generate realistic human movements efficiently, outperforming existing methods in convergence speed and success rates.

Authors:Huaqiu Li, Yong Wang, Tongwen Huang, Hailang Huang, Haoqian Wang, Xiangxiang Chu
Title: LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling
Abstract:
Unified image restoration is a significantly challenging task in low-level vision. Existing methods either make tailored designs for specific tasks, limiting their generalizability across various types of degradation, or rely on training with paired datasets, thereby suffering from closed-set constraints. To address these issues, we propose a novel, dataset-free, and unified approach through recurrent posterior sampling utilizing a pretrained latent diffusion model. Our method incorporates the multimodal understanding model to provide sematic priors for the generative model under a task-blind condition. Furthermore, it utilizes a lightweight module to align the degraded input with the generated preference of the diffusion model, and employs recurrent refinement for posterior sampling. Extensive experiments demonstrate that our method outperforms state-of-the-art methods, validating its effectiveness and robustness. Our code and data are available at https://github.com/AMAP-ML/LD-RPS.
中文: 本文提出了一种新颖、无需数据集的统一图像恢复方法,利用预训练的潜在扩散模型,结合多模态理解提供语义先验,并通过循环细化在任务无关条件下超越现有最优方法。
English: This paper introduces a novel, dataset-free unified image restoration method using a pretrained latent diffusion model, which incorporates multimodal understanding for semantic priors and recurrent refinement to outperform state-of-the-art approaches.

Authors:Xiao Zhang, Fei Wei, Yong Wang, Wenda Zhao, Feiyi Li, Xiangxiang Chu
Title: UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement
Abstract:
Zero-shot domain adaptation (ZSDA) presents substantial challenges due to the lack of images in the target domain. Previous approaches leverage Vision-Language Models (VLMs) to tackle this challenge, exploiting their zero-shot learning capabilities. However, these methods primarily address domain distribution shifts and overlook the misalignment between the detection task and VLMs, which rely on manually crafted prompts. To overcome these limitations, we propose the unified prompt and representation enhancement (UPRE) framework, which jointly optimizes both textual prompts and visual representations. Specifically, our approach introduces a multi-view domain prompt that combines linguistic domain priors with detection-specific knowledge, and a visual representation enhancement module that produces domain style variations. Furthermore, we introduce multi-level enhancement strategies, including relative domain distance and positive-negative separation, which align multi-modal representations at the image level and capture diverse visual representations at the instance level, respectively. Extensive experiments conducted on nine benchmark datasets demonstrate the superior performance of our framework in ZSDA detection scenarios. Code is available at https://github.com/AMAP-ML/UPRE.
中文摘要:提出的UPRE框架通过联合优化文本提示和视觉表征,克服了零样本领域自适应中的关键挑战,在九个基准数据集上展现出卓越性能。
English Summary: The proposed UPRE framework overcomes limitations in zero-shot domain adaptation by jointly optimizing textual prompts and visual representations, achieving superior performance across nine benchmark datasets.

Authors:Zeming Chen, Hang Zhao
Title: BEV-VAE: Multi-view Image Generation with Spatial Consistency for Autonomous Driving
Abstract:
Multi-view image generation in autonomous driving demands consistent 3D scene understanding across camera views. Most existing methods treat this problem as a 2D image set generation task, lacking explicit 3D modeling. However, we argue that a structured representation is crucial for scene generation, especially for autonomous driving applications. This paper proposes BEV-VAE for consistent and controllable view synthesis. BEV-VAE first trains a multi-view image variational autoencoder for a compact and unified BEV latent space and then generates the scene with a latent diffusion transformer. BEV-VAE supports arbitrary view generation given camera configurations, and optionally 3D layouts. Experiments on nuScenes and Argoverse 2 (AV2) show strong performance in both 3D consistent reconstruction and generation. The code is available at: https://github.com/Czm369/bev-vae.
Chinese: BEV-VAE提出了一种创新的自动驾驶多视角图像生成方法,通过统一的BEV潜在空间和潜在扩散变换器,确保跨相机视角的3D一致性和可控性。
English: BEV-VAE introduces a novel approach for multi-view image generation in autonomous driving by leveraging a unified BEV latent space and a latent diffusion transformer to ensure 3D consistency and controllability across camera views.

Authors:Qihang Fan, Huaibo Huang, Yuang Ai, Ran He
Title: Rectifying Magnitude Neglect in Linear Attention
Abstract:
As the core operator of Transformers, Softmax Attention exhibits excellent global modeling capabilities. However, its quadratic complexity limits its applicability to vision tasks. In contrast, Linear Attention shares a similar formulation with Softmax Attention while achieving linear complexity, enabling efficient global information modeling. Nevertheless, Linear Attention suffers from a significant performance degradation compared to standard Softmax Attention. In this paper, we analyze the underlying causes of this issue based on the formulation of Linear Attention. We find that, unlike Softmax Attention, Linear Attention entirely disregards the magnitude information of the Query. This prevents the attention score distribution from dynamically adapting as the Query scales. As a result, despite its structural similarity to Softmax Attention, Linear Attention exhibits a significantly different attention score distribution. Based on this observation, we propose Magnitude-Aware Linear Attention (MALA), which modifies the computation of Linear Attention to fully incorporate the Query's magnitude. This adjustment allows MALA to generate an attention score distribution that closely resembles Softmax Attention while exhibiting a more well-balanced structure. We evaluate the effectiveness of MALA on multiple tasks, including image classification, object detection, instance segmentation, semantic segmentation, natural language processing, speech recognition, and image generation. Our MALA achieves strong results on all of these tasks. Code will be available at https://github.com/qhfan/MALA
中文: 本文提出的幅度感知线性注意力(MALA)通过引入查询向量幅值计算,使线性注意力能生成接近Softmax注意力的分布,在多项任务中取得优异性能。
English: This paper introduces Magnitude-Aware Linear Attention (MALA), which addresses Linear Attention's performance gap by incorporating Query magnitude to mimic Softmax Attention's distribution, achieving strong results across diverse tasks.

Authors:Jan Nikolas Morshuis, Christian Schlarmann, Thomas Küstner, Christian F. Baumgartner, Matthias Hein
Title: Mind the Detail: Uncovering Clinically Relevant Image Details in Accelerated MRI with Semantically Diverse Reconstructions
Abstract:
In recent years, accelerated MRI reconstruction based on deep learning has led to significant improvements in image quality with impressive results for high acceleration factors. However, from a clinical perspective image quality is only secondary; much more important is that all clinically relevant information is preserved in the reconstruction from heavily undersampled data. In this paper, we show that existing techniques, even when considering resampling for diffusion-based reconstruction, can fail to reconstruct small and rare pathologies, thus leading to potentially wrong diagnosis decisions (false negatives). To uncover the potentially missing clinical information we propose ``Semantically Diverse Reconstructions'' (\SDR), a method which, given an original reconstruction, generates novel reconstructions with enhanced semantic variability while all of them are fully consistent with the measured data. To evaluate \SDR automatically we train an object detector on the fastMRI+ dataset. We show that \SDR significantly reduces the chance of false-negative diagnoses (higher recall) and improves mean average precision compared to the original reconstructions. The code is available on https://github.com/NikolasMorshuis/SDR
中文摘要:深度学习加速的MRI重建可能遗漏微小或罕见病变,因此提出的语义多样性重建(SDR)方法通过生成多种重建结果来降低漏诊率并提升诊断精确度。
English Summary: Deep learning-based accelerated MRI reconstruction may miss small or rare pathologies, so the proposed Semantically Diverse Reconstructions (SDR) method generates multiple reconstructions to reduce false negatives and improve diagnostic accuracy.

Authors:Rusi Chen, Yuanting Yang, Jiezhi Yao, Hongning Song, Ji Zhang, Yongsong Zhou, Yuhao Huang, Ronghao Yang, Dan Jia, Yuhan Zhang, Xing Tao, Haoran Dou, Qing Zhou, Xin Yang, Dong Ni
Title: MTCNet: Motion and Topology Consistency Guided Learning for Mitral Valve Segmentationin 4D Ultrasound
Abstract:
Mitral regurgitation is one of the most prevalent cardiac disorders. Four-dimensional (4D) ultrasound has emerged as the primary imaging modality for assessing dynamic valvular morphology. However, 4D mitral valve (MV) analysis remains challenging due to limited phase annotations, severe motion artifacts, and poor imaging quality. Yet, the absence of inter-phase dependency in existing methods hinders 4D MV analysis. To bridge this gap, we propose a Motion-Topology guided consistency network (MTCNet) for accurate 4D MV ultrasound segmentation in semi-supervised learning (SSL). MTCNet requires only sparse end-diastolic and end-systolic annotations. First, we design a cross-phase motion-guided consistency learning strategy, utilizing a bi-directional attention memory bank to propagate spatio-temporal features. This enables MTCNet to achieve excellent performance both per- and inter-phase. Second, we devise a novel topology-guided correlation regularization that explores physical prior knowledge to maintain anatomically plausible. Therefore, MTCNet can effectively leverage structural correspondence between labeled and unlabeled phases. Extensive evaluations on the first largest 4D MV dataset, with 1408 phases from 160 patients, show that MTCNet performs superior cross-phase consistency compared to other advanced methods (Dice: 87.30%, HD: 1.75mm). Both the code and the dataset are available at https://github.com/crs524/MTCNet.
中文: 提出的运动-拓扑引导一致性网络(MTCNet)通过跨相位运动一致性学习和拓扑正则化方法,在仅需稀疏标注的情况下实现了对四维二尖瓣超声的精准分割,并在大规模临床数据上展现出优越性能。
English: The proposed Motion-Topology guided consistency network (MTCNet) addresses challenges in 4D mitral valve ultrasound segmentation by leveraging cross-phase motion consistency and topology regularization, achieving superior performance with only sparse annotations on a large clinical dataset.

Authors:Siyuan Yao, Rui Zhu, Ziqi Wang, Wenqi Ren, Yanyang Yan, Xiaochun Cao
Title: UMDATrack: Unified Multi-Domain Adaptive Tracking Under Adverse Weather Conditions
Abstract:
Visual object tracking has gained promising progress in past decades. Most of the existing approaches focus on learning target representation in well-conditioned daytime data, while for the unconstrained real-world scenarios with adverse weather conditions, e.g. nighttime or foggy environment, the tremendous domain shift leads to significant performance degradation. In this paper, we propose UMDATrack, which is capable of maintaining high-quality target state prediction under various adverse weather conditions within a unified domain adaptation framework. Specifically, we first use a controllable scenario generator to synthesize a small amount of unlabeled videos (less than 2% frames in source daytime datasets) in multiple weather conditions under the guidance of different text prompts. Afterwards, we design a simple yet effective domain-customized adapter (DCA), allowing the target objects' representation to rapidly adapt to various weather conditions without redundant model updating. Furthermore, to enhance the localization consistency between source and target domains, we propose a target-aware confidence alignment module (TCA) following optimal transport theorem. Extensive experiments demonstrate that UMDATrack can surpass existing advanced visual trackers and lead new state-of-the-art performance by a significant margin. Our code is available at https://github.com/Z-Z188/UMDATrack.
中文: UMDATrack提出了一种统一域适应框架,通过可控场景生成器、域定制适配器和目标感知置信度对齐模块,在多种恶劣天气条件下保持高质量视觉目标跟踪,实现了最先进的性能。
English: UMDATrack introduces a unified domain adaptation framework that maintains high-quality visual object tracking across adverse weather conditions through a controllable scenario generator, domain-customized adapter, and target-aware confidence alignment module, achieving state-of-the-art performance.

Authors:Yupeng Zheng, Pengxuan Yang, Zebin Xing, Qichao Zhang, Yuhang Zheng, Yinfeng Gao, Pengfei Li, Teng Zhang, Zhongpu Xia, Peng Jia, Dongbin Zhao
Title: World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model
Abstract:
End-to-end autonomous driving directly generates planning trajectories from raw sensor data, yet it typically relies on costly perception supervision to extract scene information. A critical research challenge arises: constructing an informative driving world model to enable perception annotation-free, end-to-end planning via self-supervised learning. In this paper, we present World4Drive, an end-to-end autonomous driving framework that employs vision foundation models to build latent world models for generating and evaluating multi-modal planning trajectories. Specifically, World4Drive first extracts scene features, including driving intention and world latent representations enriched with spatial-semantic priors provided by vision foundation models. It then generates multi-modal planning trajectories based on current scene features and driving intentions and predicts multiple intention-driven future states within the latent space. Finally, it introduces a world model selector module to evaluate and select the best trajectory. We achieve perception annotation-free, end-to-end planning through self-supervised alignment between actual future observations and predicted observations reconstructed from the latent space. World4Drive achieves state-of-the-art performance without manual perception annotations on both the open-loop nuScenes and closed-loop NavSim benchmarks, demonstrating an 18.1\% relative reduction in L2 error, 46.7% lower collision rate, and 3.75 faster training convergence. Codes will be accessed at https://github.com/ucaszyp/World4Drive.
中文: World4Drive是一种端到端自动驾驶框架,利用视觉基础模型构建潜在世界模型来生成和评估多模态规划轨迹,通过自监督学习无需人工感知标注即可实现最先进的性能。
English: World4Drive is an end-to-end autonomous driving framework that uses vision foundation models to create latent world models for generating and evaluating multi-modal planning trajectories, achieving state-of-the-art performance without manual perception annotations through self-supervised learning.

Authors:Luming Zhao, Jingwen Xuan, Jiamin Lou, Yonghui Yu, Wenwu Yang
Title: Context-Aware Academic Emotion Dataset and Benchmark
Abstract:
Academic emotion analysis plays a crucial role in evaluating students' engagement and cognitive states during the learning process. This paper addresses the challenge of automatically recognizing academic emotions through facial expressions in real-world learning environments. While significant progress has been made in facial expression recognition for basic emotions, academic emotion recognition remains underexplored, largely due to the scarcity of publicly available datasets. To bridge this gap, we introduce RAER, a novel dataset comprising approximately 2,700 video clips collected from around 140 students in diverse, natural learning contexts such as classrooms, libraries, laboratories, and dormitories, covering both classroom sessions and individual study. Each clip was annotated independently by approximately ten annotators using two distinct sets of academic emotion labels with varying granularity, enhancing annotation consistency and reliability. To our knowledge, RAER is the first dataset capturing diverse natural learning scenarios. Observing that annotators naturally consider context cues-such as whether a student is looking at a phone or reading a book-alongside facial expressions, we propose CLIP-CAER (CLIP-based Context-aware Academic Emotion Recognition). Our method utilizes learnable text prompts within the vision-language model CLIP to effectively integrate facial expression and context cues from videos. Experimental results demonstrate that CLIP-CAER substantially outperforms state-of-the-art video-based facial expression recognition methods, which are primarily designed for basic emotions, emphasizing the crucial role of context in accurately recognizing academic emotions. Project page: https://zgsfer.github.io/CAER

Authors:Hao Tang, Zhiqing Guo, Liejun Wang, Chao Liu
Title: Similarity Memory Prior is All You Need for Medical Image Segmentation
Abstract:
In recent years, it has been found that "grandmother cells" in the primary visual cortex (V1) of macaques can directly recognize visual input with complex shapes. This inspires us to examine the value of these cells in promoting the research of medical image segmentation. In this paper, we design a Similarity Memory Prior Network (Sim-MPNet) for medical image segmentation. Specifically, we propose a Dynamic Memory Weights-Loss Attention (DMW-LA), which matches and remembers the category features of specific lesions or organs in medical images through the similarity memory prior in the prototype memory bank, thus helping the network to learn subtle texture changes between categories. DMW-LA also dynamically updates the similarity memory prior in reverse through Weight-Loss Dynamic (W-LD) update strategy, effectively assisting the network directly extract category features. In addition, we propose the Double-Similarity Global Internal Enhancement Module (DS-GIM) to deeply explore the internal differences in the feature distribution of input data through cosine similarity and euclidean distance. Extensive experiments on four public datasets show that Sim-MPNet has better segmentation performance than other state-of-the-art methods. Our code is available on https://github.com/vpsg-research/Sim-MPNet.
中文:受猕猴视觉皮层"祖母细胞"启发,本文提出Sim-MPNet模型,通过DMW-LA和DS-GIM模块实现医学图像精准分割,在多个公开数据集上取得领先性能。
English: Inspired by "grandmother cells" in macaques' visual cortex, this paper introduces Sim-MPNet with DMW-LA and DS-GIM modules for enhanced medical image segmentation, achieving state-of-the-art results on multiple datasets.

Authors:Kai Zhou, Shuhai Zhang, Zeng You, Jinwu Hu, Mingkui Tan, Fei Liu
Title: Zero-Shot Skeleton-Based Action Recognition With Prototype-Guided Feature Alignment
Abstract:
Zero-shot skeleton-based action recognition aims to classify unseen skeleton-based human actions without prior exposure to such categories during training. This task is extremely challenging due to the difficulty in generalizing from known to unknown actions. Previous studies typically use two-stage training: pre-training skeleton encoders on seen action categories using cross-entropy loss and then aligning pre-extracted skeleton and text features, enabling knowledge transfer to unseen classes through skeleton-text alignment and language models' generalization. However, their efficacy is hindered by 1) insufficient discrimination for skeleton features, as the fixed skeleton encoder fails to capture necessary alignment information for effective skeleton-text alignment; 2) the neglect of alignment bias between skeleton and unseen text features during testing. To this end, we propose a prototype-guided feature alignment paradigm for zero-shot skeleton-based action recognition, termed PGFA. Specifically, we develop an end-to-end cross-modal contrastive training framework to improve skeleton-text alignment, ensuring sufficient discrimination for skeleton features. Additionally, we introduce a prototype-guided text feature alignment strategy to mitigate the adverse impact of the distribution discrepancy during testing. We provide a theoretical analysis to support our prototype-guided text feature alignment strategy and empirically evaluate our overall PGFA on three well-known datasets. Compared with the top competitor SMIE method, our PGFA achieves absolute accuracy improvements of 22.96%, 12.53%, and 18.54% on the NTU-60, NTU-120, and PKU-MMD datasets, respectively.
中文: 本文提出PGFA原型引导特征对齐框架,通过端到端跨模态对比训练和原型引导策略增强骨架-文本对齐并减少分布差异,在多个数据集上显著提升了零样本骨架动作识别的准确率。
English: This paper introduces PGFA, an end-to-end prototype-guided feature alignment framework that enhances skeleton-text alignment and mitigates distribution discrepancies to significantly improve zero-shot skeleton-based action recognition accuracy across multiple datasets.

Authors:Djamahl Etchegaray, Yuxia Fu, Zi Huang, Yadan Luo
Title: Box-QAymo: Box-Referring VQA Dataset for Autonomous Driving
Abstract:
Interpretable communication is essential for safe and trustworthy autonomous driving, yet current vision-language models (VLMs) often operate under idealized assumptions and struggle to capture user intent in real-world scenarios. Existing driving-oriented VQA datasets are limited to full-scene descriptions or waypoint prediction, preventing the assessment of whether VLMs can respond to localized user-driven queries. We introduce Box-QAymo, a box-referring dataset and benchmark designed to both evaluate and finetune VLMs on spatial and temporal reasoning over user-specified objects. Users express intent by drawing bounding boxes, offering a fast and intuitive interface for focused queries in complex scenes. Specifically, we propose a hierarchical evaluation protocol that begins with binary sanity-check questions to assess basic model capacities, and progresses to (1) attribute prediction for box-referred objects, (2) motion understanding of target instances, and (3) spatiotemporal motion reasoning over inter-object dynamics across frames. To support this, we crowd-sourced fine-grained object classes and visual attributes that reflect the complexity drivers encounter, and extract object trajectories to construct temporally grounded QA pairs. Rigorous quality control through negative sampling, temporal consistency checks, and difficulty-aware balancing guarantee dataset robustness and diversity. Our comprehensive evaluation reveals significant limitations in current VLMs when queried about perception questions, highlighting the gap in achieving real-world performance. This work provides a foundation for developing more robust and interpretable autonomous driving systems that can communicate effectively with users under real-world conditions. Project page and dataset are available at https://djamahl99.github.io/qaymo-pages/.
中文: 本文提出Box-QAymo数据集和基准,通过用户指定的边界框评估并增强视觉语言模型的时空推理能力,以解决当前自动驾驶可解释通信中的实际缺陷。
English: This paper introduces Box-QAymo, a dataset and benchmark that evaluates and enhances vision-language models' spatial-temporal reasoning through user-specified bounding boxes, addressing current limitations in interpretable communication for autonomous driving.

Authors:Ruize Cui, Jiaan Zhang, Jialun Pei, Kai Wang, Pheng-Ann Heng, Jing Qin
Title: Topology-Constrained Learning for Efficient Laparoscopic Liver Landmark Detection
Abstract:
Liver landmarks provide crucial anatomical guidance to the surgeon during laparoscopic liver surgery to minimize surgical risk. However, the tubular structural properties of landmarks and dynamic intraoperative deformations pose significant challenges for automatic landmark detection. In this study, we introduce TopoNet, a novel topology-constrained learning framework for laparoscopic liver landmark detection. Our framework adopts a snake-CNN dual-path encoder to simultaneously capture detailed RGB texture information and depth-informed topological structures. Meanwhile, we propose a boundary-aware topology fusion (BTF) module, which adaptively merges RGB-D features to enhance edge perception while preserving global topology. Additionally, a topological constraint loss function is embedded, which contains a center-line constraint loss and a topological persistence loss to ensure homotopy equivalence between predictions and labels. Extensive experiments on L3D and P2ILF datasets demonstrate that TopoNet achieves outstanding accuracy and computational complexity, highlighting the potential for clinical applications in laparoscopic liver surgery. Our code will be available at https://github.com/cuiruize/TopoNet.
Chinese: TopoNet是一种新颖的拓扑约束学习框架,通过自适应特征融合和专用损失函数整合RGB纹理与深度拓扑结构,显著提升了腹腔镜肝脏标志物检测的精度与效率,在临床数据集中表现卓越。
English: TopoNet is a novel topology-constrained learning framework that enhances laparoscopic liver landmark detection by integrating RGB texture and depth-informed topological structures through adaptive feature fusion and specialized loss functions, demonstrating superior accuracy and efficiency in clinical datasets.

Authors:Haoran Lou, Chunxiao Fan, Ziyan Liu, Yuexin Wu, Xinliang Wang
Title: LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs
Abstract:
The architecture of multimodal large language models (MLLMs) commonly connects a vision encoder, often based on CLIP-ViT, to a large language model. While CLIP-ViT works well for capturing global image features, it struggles to model local relationships between adjacent patches, leading to weaker visual representation, which in turn affects the detailed understanding ability of MLLMs. To solve this, we propose LLaVA-SP, which only adds six spatial visual tokens to the original visual tokens to enhance the visual representation. Our approach offers three key advantages: 1) We propose a novel Projector, which uses convolutional kernels to derive visual spatial tokens from ViT patch features, simulating two visual spatial ordering approaches: "from central region to global" and "from abstract to specific". Then, a cross-attention mechanism is applied to fuse fine-grained visual information, enriching the overall visual representation. 2) We present two model variants: LLaVA-SP-Cropping, which focuses on detail features through progressive cropping, and LLaVA-SP-Pooling, which captures global semantics through adaptive pooling, enabling the model to handle diverse visual understanding tasks. 3) Extensive experiments show that LLaVA-SP, fine-tuned with LoRA, achieves significant performance improvements across various multimodal benchmarks, outperforming the state-of-the-art LLaVA-1.5 model in multiple tasks with nearly identical inference latency. The code and models are available at https://github.com/CnFaker/LLaVA-SP.
Chinese: 提出的LLaVA-SP模型通过新型投影器添加空间视觉标记,显著提升了多模态模型的视觉表征能力,在多项基准测试中超越LLaVA-1.5且推理延迟几乎不变。
English: The proposed LLaVA-SP model enhances multimodal language models by adding spatial visual tokens through a novel projector, significantly improving visual representation and outperforming LLaVA-1.5 across various benchmarks with minimal latency increase.

Authors:Yongzhen Wang, Liangliang Chen, Bingwen Hu, Heng Liu, Xiao-Ping Zhang, Mingqiang Wei
Title: Laplace-Mamba: Laplace Frequency Prior-Guided Mamba-CNN Fusion Network for Image Dehazing
Abstract:
Recent progress in image restoration has underscored Spatial State Models (SSMs) as powerful tools for modeling long-range dependencies, owing to their appealing linear complexity and computational efficiency. However, SSM-based approaches exhibit limitations in reconstructing localized structures and tend to be less effective when handling high-dimensional data, frequently resulting in suboptimal recovery of fine image features. To tackle these challenges, we introduce Laplace-Mamba, a novel framework that integrates Laplace frequency prior with a hybrid Mamba-CNN architecture for efficient image dehazing. Leveraging the Laplace decomposition, the image is disentangled into low-frequency components capturing global texture and high-frequency components representing edges and fine details. This decomposition enables specialized processing via dual parallel pathways: the low-frequency branch employs SSMs for global context modeling, while the high-frequency branch utilizes CNNs to refine local structural details, effectively addressing diverse haze scenarios. Notably, the Laplace transformation facilitates information-preserving downsampling of low-frequency components in accordance with the Nyquist theory, thereby significantly improving computational efficiency. Extensive evaluations across multiple benchmarks demonstrate that our method outperforms state-of-the-art approaches in both restoration quality and efficiency. The source code and pretrained models are available at https://github.com/yz-wang/Laplace-Mamba.
中文摘要:提出的Laplace-Mamba框架通过拉普拉斯频率分解与混合Mamba-CNN架构相结合,利用并行通路分别处理全局纹理和局部细节,有效解决图像去雾问题,在恢复质量和效率方面均优于现有方法。
English Summary: The proposed Laplace-Mamba framework combines Laplace frequency decomposition with a hybrid Mamba-CNN architecture to effectively address image dehazing by processing global textures and local details through parallel pathways, achieving superior restoration quality and efficiency.

Authors:Zijian Chen, Yuan Tian, Yuze Sun, Wei Sun, Zicheng Zhang, Weisi Lin, Guangtao Zhai, Wenjun Zhang
Title: Just Noticeable Difference for Large Multimodal Models
Abstract:
Just noticeable difference (JND), the minimum change that the human visual system (HVS) can perceive, has been studied for decades. Although recent work has extended this line of research into machine vision, there has been a scarcity of studies systematically exploring its perceptual boundaries across multiple tasks and stimulus types, particularly in the current era of rapidly advancing large multimodal models (LMMs), where studying the multifaceted capabilities of models has become a mainstream focus. Moreover, the perceptual defects of LMMs are not investigated thoroughly, resulting in potential security issues and suboptimal response efficiency. In this paper, we take an initial attempt and demonstrate that there exist significant visual blind spots in current LMMs. To systemically quantify this characteristic, we propose a new concept, {\bf LMM-JND}, together with its determination pipeline. Targeting uncovering the behavior commonalities in HVS-aligned visual perception tasks, we delve into several LMM families and construct a large-scale dataset, named VPA-JND, which contains 21.5k reference images with over 489k stimuli across 12 distortion types, to facilitate LMM-JND studies. VPA-JND exposes areas where state-of-the-art LMMs, including GPT-4o and the InternVL2.5 series, struggle with basic comparison queries and fall significantly short of human-level visual performance. We further explore the effects of vision and language backbones and find a notable correlation between their design philosophy that may instruct the future refinement of LMMs for their visual acuity. Together, our research underscores the significance of LMM-JND as a unique perspective for studying LMMs, and predictable LMM-JND is crucial for security concerns. This work will be available at https://github.com/zijianchen98/LMM-JND.
中文: 本研究提出LMM-JND概念系统量化大型多模态模型的视觉盲区,通过构建大规模数据集和分析,揭示了当前先进模型在视觉感知能力上与人类存在的显著差距。
English: This study introduces LMM-JND to systematically quantify visual blind spots in large multimodal models (LMMs), revealing their significant perceptual limitations compared to humans through a comprehensive dataset and analysis.

Authors:Yaofei Duan, Yuhao Huang, Xin Yang, Luyi Han, Xinyu Xie, Zhiyuan Zhu, Ping He, Ka-Hou Chan, Ligang Cui, Sio-Kei Im, Dong Ni, Tao Tan
Title: ADAptation: Reconstruction-based Unsupervised Active Learning for Breast Ultrasound Diagnosis
Abstract:
Deep learning-based diagnostic models often suffer performance drops due to distribution shifts between training (source) and test (target) domains. Collecting and labeling sufficient target domain data for model retraining represents an optimal solution, yet is limited by time and scarce resources. Active learning (AL) offers an efficient approach to reduce annotation costs while maintaining performance, but struggles to handle the challenge posed by distribution variations across different datasets. In this study, we propose a novel unsupervised Active learning framework for Domain Adaptation, named ADAptation, which efficiently selects informative samples from multi-domain data pools under limited annotation budget. As a fundamental step, our method first utilizes the distribution homogenization capabilities of diffusion models to bridge cross-dataset gaps by translating target images into source-domain style. We then introduce two key innovations: (a) a hypersphere-constrained contrastive learning network for compact feature clustering, and (b) a dual-scoring mechanism that quantifies and balances sample uncertainty and representativeness. Extensive experiments on four breast ultrasound datasets (three public and one in-house/multi-center) across five common deep classifiers demonstrate that our method surpasses existing strong AL-based competitors, validating its effectiveness and generalization for clinical domain adaptation. The code is available at the anonymized link: https://github.com/miccai25-966/ADAptation.
中文: 本研究提出名为ADAptation的无监督主动学习框架,通过扩散模型实现风格迁移,结合对比学习和双评分机制,在有限标注预算下高效选择信息样本,有效解决诊断模型因域偏移导致的性能下降问题,在多个乳腺超声数据集中展现出优越性能。
English: This study introduces ADAptation, an unsupervised active learning framework that addresses performance degradation in diagnostic models due to domain shifts by using diffusion models for style translation and incorporating contrastive learning with a dual-scoring mechanism to efficiently select informative samples under limited annotation budgets, demonstrating superior performance across multiple breast ultrasound datasets.

Authors:Ying Guo, Xi Liu, Cheng Zhen, Pengfei Yan, Xiaoming Wei
Title: ARIG: Autoregressive Interactive Head Generation for Real-time Conversations
Abstract:
Face-to-face communication, as a common human activity, motivates the research on interactive head generation. A virtual agent can generate motion responses with both listening and speaking capabilities based on the audio or motion signals of the other user and itself. However, previous clip-wise generation paradigm or explicit listener/speaker generator-switching methods have limitations in future signal acquisition, contextual behavioral understanding, and switching smoothness, making it challenging to be real-time and realistic. In this paper, we propose an autoregressive (AR) based frame-wise framework called ARIG to realize the real-time generation with better interaction realism. To achieve real-time generation, we model motion prediction as a non-vector-quantized AR process. Unlike discrete codebook-index prediction, we represent motion distribution using diffusion procedure, achieving more accurate predictions in continuous space. To improve interaction realism, we emphasize interactive behavior understanding (IBU) and detailed conversational state understanding (CSU). In IBU, based on dual-track dual-modal signals, we summarize short-range behaviors through bidirectional-integrated learning and perform contextual understanding over long ranges. In CSU, we use voice activity signals and context features of IBU to understand the various states (interruption, feedback, pause, etc.) that exist in actual conversations. These serve as conditions for the final progressive motion prediction. Extensive experiments have verified the effectiveness of our model.

Authors:Xin Luo, Menglin Zhang, Yunwei Lan, Tianyu Zhang, Rui Li, Chang Liu, Dong Liu
Title: Latent Posterior-Mean Rectified Flow for Higher-Fidelity Perceptual Face Restoration
Abstract:
The Perception-Distortion tradeoff (PD-tradeoff) theory suggests that face restoration algorithms must balance perceptual quality and fidelity. To achieve minimal distortion while maintaining perfect perceptual quality, Posterior-Mean Rectified Flow (PMRF) proposes a flow based approach where source distribution is minimum distortion estimations. Although PMRF is shown to be effective, its pixel-space modeling approach limits its ability to align with human perception, where human perception is defined as how humans distinguish between two image distributions. In this work, we propose Latent-PMRF, which reformulates PMRF in the latent space of a variational autoencoder (VAE), facilitating better alignment with human perception during optimization. By defining the source distribution on latent representations of minimum distortion estimation, we bound the minimum distortion by the VAE's reconstruction error. Moreover, we reveal the design of VAE is crucial, and our proposed VAE significantly outperforms existing VAEs in both reconstruction and restoration. Extensive experiments on blind face restoration demonstrate the superiority of Latent-PMRF, offering an improved PD-tradeoff compared to existing methods, along with remarkable convergence efficiency, achieving a 5.79X speedup over PMRF in terms of FID. Our code will be available as open-source.
Chinese: Latent-PMRF通过在后验均值整流流中引入变分自编码器的潜在空间重构,显著提升了人脸恢复的感知-失真平衡,并实现了比现有方法快5.79倍的收敛效率。
English: Latent-PMRF enhances face restoration by reformulating the Posterior-Mean Rectified Flow in a variational autoencoder's latent space, achieving superior perceptual-distortion tradeoff and a 5.79X speedup in convergence efficiency over prior methods.

Authors:Huanxin Yang, Qiwen Wang
Title: MFH: Marrying Frequency Domain with Handwritten Mathematical Expression Recognition
Abstract:
Handwritten mathematical expression recognition (HMER) suffers from complex formula structures and character layouts in sequence prediction. In this paper, we incorporate frequency domain analysis into HMER and propose a method that marries frequency domain with HMER (MFH), leveraging the discrete cosine transform (DCT). We emphasize the structural analysis assistance of frequency information for recognizing mathematical formulas. When implemented on various baseline models, our network exhibits a consistent performance enhancement, demonstrating the efficacy of frequency domain information. Experiments show that our MFH-CoMER achieves noteworthy accuracyrates of 61.66%/62.07%/63.72% on the CROHME 2014/2016/2019 test sets. The source code is available at https://github.com/Hryxyhe/MFH.
中文: 本文提出MFH方法,通过引入离散余弦变换进行频域分析来增强手写数学表达式识别,有效提升结构分析能力,在多种基线模型上均实现性能提升,并在CROHME数据集上取得了显著准确率。
English: This paper introduces MFH, a method that integrates frequency domain analysis using discrete cosine transform to enhance handwritten mathematical expression recognition by improving structural analysis, which consistently boosts performance across baseline models and achieves high accuracy on CROHME datasets.

Authors:Jingyi Pan, Dan Xu, Qiong Luo
Title: DiGA3D: Coarse-to-Fine Diffusional Propagation of Geometry and Appearance for Versatile 3D Inpainting
Abstract:
Developing a unified pipeline that enables users to remove, re-texture, or replace objects in a versatile manner is crucial for text-guided 3D inpainting. However, there are still challenges in performing multiple 3D inpainting tasks within a unified framework: 1) Single reference inpainting methods lack robustness when dealing with views that are far from the reference view. 2) Appearance inconsistency arises when independently inpainting multi-view images with 2D diffusion priors; 3) Geometry inconsistency limits performance when there are significant geometric changes in the inpainting regions. To tackle these challenges, we introduce DiGA3D, a novel and versatile 3D inpainting pipeline that leverages diffusion models to propagate consistent appearance and geometry in a coarse-to-fine manner. First, DiGA3D develops a robust strategy for selecting multiple reference views to reduce errors during propagation. Next, DiGA3D designs an Attention Feature Propagation (AFP) mechanism that propagates attention features from the selected reference views to other views via diffusion models to maintain appearance consistency. Furthermore, DiGA3D introduces a Texture-Geometry Score Distillation Sampling (TG-SDS) loss to further improve the geometric consistency of inpainted 3D scenes. Extensive experiments on multiple 3D inpainting tasks demonstrate the effectiveness of our method. The project page is available at https://rorisis.github.io/DiGA3D/.

Authors:Xin Xu, Eibe Frank, Geoffrey Holmes
Title: Few-shot Classification as Multi-instance Verification: Effective Backbone-agnostic Transfer across Domains
Abstract:
We investigate cross-domain few-shot learning under the constraint that fine-tuning of backbones (i.e., feature extractors) is impossible or infeasible -- a scenario that is increasingly common in practical use cases. Handling the low-quality and static embeddings produced by frozen, "black-box" backbones leads to a problem representation of few-shot classification as a series of multiple instance verification (MIV) tasks. Inspired by this representation, we introduce a novel approach to few-shot domain adaptation, named the "MIV-head", akin to a classification head that is agnostic to any pretrained backbone and computationally efficient. The core components designed for the MIV-head, when trained on few-shot data from a target domain, collectively yield strong performance on test data from that domain. Importantly, it does so without fine-tuning the backbone, and within the "meta-testing" phase. Experimenting under various settings and on an extension of the Meta-dataset benchmark for cross-domain few-shot image classification, using representative off-the-shelf convolutional neural network and vision transformer backbones pretrained on ImageNet1K, we show that the MIV-head achieves highly competitive accuracy when compared to state-of-the-art "adapter" (or partially fine-tuning) methods applied to the same backbones, while incurring substantially lower adaptation cost. We also find well-known "classification head" approaches lag far behind in terms of accuracy. Ablation study empirically justifies the core components of our approach. We share our code at https://github.com/xxweka/MIV-head.
Chinese: 本研究提出MIV-head方法,将跨域小样本学习视为多实例验证任务,无需微调主干网络即可实现与先进方法相媲美的准确率,且适应成本显著降低。
English: This study introduces the MIV-head, a novel approach for cross-domain few-shot learning that treats classification as multiple instance verification tasks, achieving competitive accuracy without fine-tuning backbones and at a lower adaptation cost compared to state-of-the-art methods.

Authors:Jian Wang, Qiongying Ni, Hongkui Yu, Ruixuan Yao, Jinqiao Ying, Bin Zhang, Xingyi Yang, Jin Peng, Jiongquan Chen, Junxuan Yu, Wenlong Shi, Chaoyu Chen, Zhongnuo Yan, Mingyuan Luo, Gaocheng Cai, Dong Ni, Jing Lu, Xin Yang
Title: Accurate and Efficient Fetal Birth Weight Estimation from 3D Ultrasound
Abstract:
Accurate fetal birth weight (FBW) estimation is essential for optimizing delivery decisions and reducing perinatal mortality. However, clinical methods for FBW estimation are inefficient, operator-dependent, and challenging to apply in cases of complex fetal anatomy. Existing deep learning methods are based on 2D standard ultrasound (US) images or videos that lack spatial information, limiting their prediction accuracy. In this study, we propose the first method for directly estimating FBW from 3D fetal US volumes. Our approach integrates a multi-scale feature fusion network (MFFN) and a synthetic sample-based learning framework (SSLF). The MFFN effectively extracts and fuses multi-scale features under sparse supervision by incorporating channel attention, spatial attention, and a ranking-based loss function. SSLF generates synthetic samples by simply combining fetal head and abdomen data from different fetuses, utilizing semi-supervised learning to improve prediction performance. Experimental results demonstrate that our method achieves superior performance, with a mean absolute error of $166.4\pm155.9$ $g$ and a mean absolute percentage error of $5.1\pm4.6$%, outperforming existing methods and approaching the accuracy of a senior doctor. Code is available at: https://github.com/Qioy-i/EFW.
中文: 本研究首次提出通过三维超声体积直接估算胎儿体重的方法,结合多尺度特征融合网络和合成样本学习框架,实现了超越现有方法并接近专家水平的预测精度。
English: This study introduces the first method for directly estimating fetal birth weight from 3D ultrasound volumes using a multi-scale feature fusion network and synthetic sample learning, achieving superior accuracy that rivals expert assessments.

Authors:Yingping Liang, Yutao Hu, Wenqi Shao, Ying Fu
Title: Learning Dense Feature Matching via Lifting Single 2D Image to 3D Space
Abstract:
Feature matching plays a fundamental role in many computer vision tasks, yet existing methods heavily rely on scarce and clean multi-view image collections, which constrains their generalization to diverse and challenging scenarios. Moreover, conventional feature encoders are typically trained on single-view 2D images, limiting their capacity to capture 3D-aware correspondences. In this paper, we propose a novel two-stage framework that lifts 2D images to 3D space, named as \textbf{Lift to Match (L2M)}, taking full advantage of large-scale and diverse single-view images. To be specific, in the first stage, we learn a 3D-aware feature encoder using a combination of multi-view image synthesis and 3D feature Gaussian representation, which injects 3D geometry knowledge into the encoder. In the second stage, a novel-view rendering strategy, combined with large-scale synthetic data generation from single-view images, is employed to learn a feature decoder for robust feature matching, thus achieving generalization across diverse domains. Extensive experiments demonstrate that our method achieves superior generalization across zero-shot evaluation benchmarks, highlighting the effectiveness of the proposed framework for robust feature matching.
Chinese: 提出的Lift to Match (L2M)框架通过多视图合成和3D高斯表示学习3D感知特征编码器,再结合新视角渲染与合成数据实现跨领域泛化,显著提升了特征匹配的鲁棒性。
English: The proposed Lift to Match (L2M) framework enhances feature matching by first learning a 3D-aware feature encoder through multi-view synthesis and 3D Gaussian representation, then employing novel-view rendering with synthetic data to achieve robust generalization across diverse domains.

Authors:Jianhao Xie, Ziang Zhang, Zhenyu Weng, Yuesheng Zhu, Guibo Luo
Title: MedDiff-FT: Data-Efficient Diffusion Model Fine-tuning with Structural Guidance for Controllable Medical Image Synthesis
Abstract:
Recent advancements in deep learning for medical image segmentation are often limited by the scarcity of high-quality training data.While diffusion models provide a potential solution by generating synthetic images, their effectiveness in medical imaging remains constrained due to their reliance on large-scale medical datasets and the need for higher image quality. To address these challenges, we present MedDiff-FT, a controllable medical image generation method that fine-tunes a diffusion foundation model to produce medical images with structural dependency and domain specificity in a data-efficient manner. During inference, a dynamic adaptive guiding mask enforces spatial constraints to ensure anatomically coherent synthesis, while a lightweight stochastic mask generator enhances diversity through hierarchical randomness injection. Additionally, an automated quality assessment protocol filters suboptimal outputs using feature-space metrics, followed by mask corrosion to refine fidelity. Evaluated on five medical segmentation datasets,MedDiff-FT's synthetic image-mask pairs improve SOTA method's segmentation performance by an average of 1% in Dice score. The framework effectively balances generation quality, diversity, and computational efficiency, offering a practical solution for medical data augmentation. The code is available at https://github.com/JianhaoXie1/MedDiff-FT.
中文摘要:MedDiff-FT提出了一种可控的医学图像生成方法,通过微调扩散基础模型,在保证解剖结构一致性的同时高效生成具有领域特性的医学图像,在五个分割数据集上使现有最佳方法的性能平均提升1% Dice分数。
English Summary: MedDiff-FT is a novel diffusion-based method that fine-tunes foundation models to generate high-quality medical images with anatomical coherence and domain specificity, improving segmentation performance by 1% Dice score across five datasets while maintaining computational efficiency.

Authors:Mengyi Shan, Zecheng He, Haoyu Ma, Felix Juefei-Xu, Peizhao Zhang, Tingbo Hou, Ching-Yao Chuang
Title: Populate-A-Scene: Affordance-Aware Human Video Generation
Abstract:
Can a video generation model be repurposed as an interactive world simulator? We explore the affordance perception potential of text-to-video models by teaching them to predict human-environment interaction. Given a scene image and a prompt describing human actions, we fine-tune the model to insert a person into the scene, while ensuring coherent behavior, appearance, harmonization, and scene affordance. Unlike prior work, we infer human affordance for video generation (i.e., where to insert a person and how they should behave) from a single scene image, without explicit conditions like bounding boxes or body poses. An in-depth study of cross-attention heatmaps demonstrates that we can uncover the inherent affordance perception of a pre-trained video model without labeled affordance datasets.

Authors:Chuyan Zhang, Kefan Wang, Yun Gu
Title: Beyond Low-Rank Tuning: Model Prior-Guided Rank Allocation for Effective Transfer in Low-Data and Large-Gap Regimes
Abstract:
Low-Rank Adaptation (LoRA) has proven effective in reducing computational costs while maintaining performance comparable to fully fine-tuned foundation models across various tasks. However, its fixed low-rank structure restricts its adaptability in scenarios with substantial domain gaps, where higher ranks are often required to capture domain-specific complexities. Current adaptive LoRA methods attempt to overcome this limitation by dynamically expanding or selectively allocating ranks, but these approaches frequently depend on computationally intensive techniques such as iterative pruning, rank searches, or additional regularization. To address these challenges, we introduce Stable Rank-Guided Low-Rank Adaptation (SR-LoRA), a novel framework that utilizes the stable rank of pre-trained weight matrices as a natural prior for layer-wise rank allocation. By leveraging the stable rank, which reflects the intrinsic dimensionality of the weights, SR-LoRA enables a principled and efficient redistribution of ranks across layers, enhancing adaptability without incurring additional search costs. Empirical evaluations on few-shot tasks with significant domain gaps show that SR-LoRA consistently outperforms recent adaptive LoRA variants, achieving a superior trade-off between performance and efficiency. Our code is available at https://github.com/EndoluminalSurgicalVision-IMR/SR-LoRA.
Chinese: SR-LoRA提出了一种基于稳定秩指导的低秩自适应框架,通过利用预训练权重矩阵的内在维度实现层级间秩的智能分配,在存在显著领域差异的任务中无需额外搜索成本即可获得更优的性能与效率平衡。
English: SR-LoRA introduces a stable rank-guided framework that efficiently reallocates ranks across layers using pre-trained weight matrices' intrinsic dimensionality, achieving superior performance and efficiency in domain-gap tasks without additional search costs.

Authors:Zhuangzhuang Dai, Vincent Gbouna Zakka, Luis J. Manso, Chen Li
Title: GazeTarget360: Towards Gaze Target Estimation in 360-Degree for Robot Perception
Abstract:
Enabling robots to understand human gaze target is a crucial step to allow capabilities in downstream tasks, for example, attention estimation and movement anticipation in real-world human-robot interactions. Prior works have addressed the in-frame target localization problem with data-driven approaches by carefully removing out-of-frame samples. Vision-based gaze estimation methods, such as OpenFace, do not effectively absorb background information in images and cannot predict gaze target in situations where subjects look away from the camera. In this work, we propose a system to address the problem of 360-degree gaze target estimation from an image in generalized visual scenes. The system, named GazeTarget360, integrates conditional inference engines of an eye-contact detector, a pre-trained vision encoder, and a multi-scale-fusion decoder. Cross validation results show that GazeTarget360 can produce accurate and reliable gaze target predictions in unseen scenarios. This makes a first-of-its-kind system to predict gaze targets from realistic camera footage which is highly efficient and deployable. Our source code is made publicly available at: https://github.com/zdai257/DisengageNet.
Chinese: 本文提出了GazeTarget360系统,通过整合眼神接触检测器、预训练视觉编码器和多尺度融合解码器,实现了从图像中进行360度视线目标估计,在未知场景中产生准确预测,超越了OpenFace等现有方法。
English: This paper introduces GazeTarget360, a novel system that enables 360-degree gaze target estimation from images by integrating an eye-contact detector, a pre-trained vision encoder, and a multi-scale-fusion decoder, achieving accurate predictions in unseen scenarios and outperforming prior methods like OpenFace.

Authors:Mehmet Yigit Avci, Pedro Borges, Paul Wright, Mehmet Yigitsoy, Sebastien Ourselin, Jorge Cardoso
Title: MR-CLIP: Efficient Metadata-Guided Learning of MRI Contrast Representations
Abstract:
Accurate interpretation of Magnetic Resonance Imaging scans in clinical systems is based on a precise understanding of image contrast. This contrast is primarily governed by acquisition parameters, such as echo time and repetition time, which are stored in the DICOM metadata. To simplify contrast identification, broad labels such as T1-weighted or T2-weighted are commonly used, but these offer only a coarse approximation of the underlying acquisition settings. In many real-world datasets, such labels are entirely missing, leaving raw acquisition parameters as the only indicators of contrast. Adding to this challenge, the available metadata is often incomplete, noisy, or inconsistent. The lack of reliable and standardized metadata complicates tasks such as image interpretation, retrieval, and integration into clinical workflows. Furthermore, robust contrast-aware representations are essential to enable more advanced clinical applications, such as achieving modality-invariant representations and data harmonization. To address these challenges, we propose MR-CLIP, a multimodal contrastive learning framework that aligns MR images with their DICOM metadata to learn contrast-aware representations, without relying on manual labels. Trained on a diverse clinical dataset that spans various scanners and protocols, MR-CLIP captures contrast variations across acquisitions and within scans, enabling anatomy-invariant representations. We demonstrate its effectiveness in cross-modal retrieval and contrast classification, highlighting its scalability and potential for further clinical applications. The code and weights are publicly available at https://github.com/myigitavci/MR-CLIP.
中文: MR-CLIP是一种多模态框架,通过将磁共振图像与DICOM元数据对齐来学习对比度感知表征,无需人工标注即可实现跨模态检索和对比度分类等稳健应用。
English: MR-CLIP is a multimodal framework that aligns MRI images with DICOM metadata to learn contrast-aware representations, enabling robust applications like cross-modal retrieval and contrast classification without manual labels.